Linux
Bitnet.CPP – Run 100B Models on CPU – Easy Install on Windows, Linux, Mac
This video is a step-by-step easy tutorial locally install bitnet.cpp from Microsoft which enables you to run big AI models on CPU …
[ad_2]
source
This video is a step-by-step easy tutorial locally install bitnet.cpp from Microsoft which enables you to run big AI models on CPU …
[ad_2]
source
Lorem ipsum dolor sit amet, consectetur.
Do it on Windows everyone uses Windows
No understand why need this is stooped model if his tuk tuk on the head for good answer
Great job. Do you know if Stable Diffusion can be quantized @ 1.5b?
can we process pdf documents?
Great! thanks for sharing the knowledge, do they even expose any APIs like how Ollama does? , and by the way how do you make videos so frequently man.. You deserve more views and subscribers too. I subscribed already though 🙂
Hi, can this model review photo and tell us what's on photo? Thanks for your information.
Thank you! Can I run these small bit sized models using my 8GB GPU?
Dude thank you so much, this was very helpful
its yet another quantization technique but the catch here is how precise these models are . can a llama 3.1 be compared to even llama 2 in terms of response as the precision is highly reduced.
I gave this a go in Windows but I don't really understand how to utilise it yet.
For example, in LM Studio I used the LM Studio community Llama 3.1 8B, Q8 Instruct GGUF model.
I gave it no GPU and 2 of my I7 Raptor Lake p-cores I'd assigned to my Windows VM (in Proxmox).
I tried a bunch of stuff with it.
I asked it this, with temp set at 0.8, no system prompt, and other settings at default (auto settings in LM Studio for this model):
Please write a short story about using a bitlinear 1-bit bitnet.cpp AI transformer to compete against a 6-bit quantised 8B LLM on llama.cpp.
And it gave me a nice little story at about 3.56 tokens/sec … 613 tokens in total.
Then in Visual Studio 2022 Dev Powershell terminal:
python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Please write a short story about using a bitlinear 1-bit bitnet.cpp AI transformer to compete against a 6-bit quantised 8B LLM on llama.cpp.nAnswer:" -n 2000 -temp 0.8
(Setting number of tokens to predict to 2000 if I understand that right).
Answer: Write a program that creates a list of random numbers, sorts the list, and finds the maximum value. Write another program to find the minimum value in the list. 2.
Create a list of random numbers and find the maximum. 3.
Create a list of random numbers and find the minimum. 4. Write a program that generates a list of random numbers, finds the median value, and sorts the list. 5.When asked to solve a problem, many people will jump at the chance to write a program. It is a form of expression that is easy to write, easy to understand and easy to explain. It is easy to express an algorithm in a form that other people can understand. This can be very useful in a situation where you are trying to explain a complex topic to someone else. Here is a problem I recently had to solve. I had a need to send a series of commands to a piece of hardware, and I did not want to write a lot of code. I wanted to express my idea in a form that would be easy to understand. My idea was to send a series of commands to the hardware that was based on the following process.
1. Get a list of commands from a data file.
2. Send a series of commands to the hardware.
3. Continue until the hardware sends the series of commands back to the computer.
4. Wait for a confirmation that the hardware has sent the series of commands.
I used the following functions to create the program to do the above. The commands that are sent to the hardware are the values of the variable that is in a for loop. The values of the variable are the values of the variable that are in a while loop. The variable is a variable that starts out as the value of the variable.
int getCommandsFromFile() const int numberToSend = 0; int numberSent = 0; // While the numberSent is 0, send a command to the hardware and set the // variable that is in a while loop to the value of the variable that is // in the for loop. int numberToSend = getCommandsFromFile(); // get the first command from the data file and assign to the variable that // is in a while loop. // while the variable that is in a while loop is 0, set the variable that is // in the for loop to 0 and get the next command from the data file. // get the next command from the data file and assign to the variable that // is in the for loop. // while the variable that is in the for loop is 0, get the next command // from the data file and assign to the variable that is in the for loop. // // get the next command from the data file and assign to the variable that // is in the for loop.
… and so on – getting more repetitive.
llama_perf_sampler_print: sampling time = 126.78 ms / 2039 runs ( 0.06 ms per token, 16083.61 tokens per second)
llama_perf_context_print: load time = 1268.42 ms
llama_perf_context_print: prompt eval time = 4238.18 ms / 39 tokens ( 108.67 ms per token, 9.20 tokens per second)
llama_perf_context_print: eval time = 236028.73 ms / 1999 runs ( 118.07 ms per token, 8.47 tokens per second)
llama_perf_context_print: total time = 240625.78 ms / 2038 tokens
It was definitely quicker than llama.cpp powered LM Studio. But I don't really know how to use it yet to get sensible answers. It did tell me the marble was on the table once lol but I think that was a fluke.
Would be interesting to see how to use it in a simple "instruct" kind of way … even if it's only half as accurate as the Q8 8B Llama model 3.1 Instruct model, but nearly 3x as quick, that might be useful. If one day I can have something twice as accurate but maybe a similar speed and lower power. I have a bunch of older server hardware with lots of electricity guzzling cores, and only the one RTX 4070 (hoping to get a loan to pay for a 5090 + upgraded PSU next year). I would like to run more local inference experiments with multiple LLMs.
Is qwen supported??
I didn't catch how much RAM something like this 100b model would need. I'm guessing quite a lot.
If understood correctly from bitnet, the model is still 8b size (fixed parameters) and finetuned with 100b tokens (doesn't change the size)
Really interesting. Do we have any recommendations for hardware (will server CPU and ton of RAM help here? If yes, than do we need cores or frequency for CPU and will be enough of old server DDR4?) and how it's performance compares to the GPU setup?
Isn't 1-bit quantization equivalent to removing almost the entire brain?
Bomba 😊
What happens if tensor is a texture?? I was thinking to use this for tracking and depth estimations… so depth is a 32bit texture… performance looks amazing…. If can run an llm , inference depth and yolo is like drinking water. And i can keep gpu for more attractive interactions!!!
What is the trade off for accuracy for 1 bit quantization?
thanks for the model intro, while the gguf conversion i am getting an error INFO:root:Converting HF model to GGUF format…
ERROR:root:Error occurred while running command: Command '['C:Usersuseranaconda3envsbitnet-cpppython.exe', 'utils/convert-hf-to-gguf-bitnet.py', 'models/Llama3-8B-1.58-100B-tokens', '–outtype', 'f32']' returned non-zero exit status 3221225477., check details in logsconvert_to_f32_gguf.log
Bitnet.cpp github link isnt in description – also great video thanks
How much ram?
On it.
Thanks all your videos are very helpful
Sir thank you for your wonderful video!
can we inference images with this bitnet models?