Bitnet.CPP – Run 100B Models on CPU – Easy Install on Windows, Linux, Mac

Dr.Wooz October 22, 2024

24 13 Less than a minute

This video is a step-by-step easy tutorial locally install bitnet.cpp from Microsoft which enables you to run big AI models on CPU …

[ad_2]

source

Dr.Wooz

24 Comments

@Valk-Kilmer says:
October 22, 2024 at 10:27 am
Do it on Windows everyone uses Windows
Reply
@БаллРабот says:
October 22, 2024 at 10:27 am
No understand why need this is stooped model if his tuk tuk on the head for good answer
Reply
@sephirothcloud3953 says:
October 22, 2024 at 10:27 am
Great job. Do you know if Stable Diffusion can be quantized @ 1.5b?
Reply
@melodymasterzz says:
October 22, 2024 at 10:27 am
can we process pdf documents?
Reply
@spotnuru83 says:
October 22, 2024 at 10:27 am
Great! thanks for sharing the knowledge, do they even expose any APIs like how Ollama does? , and by the way how do you make videos so frequently man.. You deserve more views and subscribers too. I subscribed already though 🙂
Reply
@CharlesTu1121 says:
October 22, 2024 at 10:27 am
Hi, can this model review photo and tell us what's on photo? Thanks for your information.
Reply
@dafnik8925 says:
October 22, 2024 at 10:27 am
Thank you! Can I run these small bit sized models using my 8GB GPU?
Reply
@w.o.jackson8432 says:
October 22, 2024 at 10:27 am
Dude thank you so much, this was very helpful
Reply
@ByteBop911 says:
October 22, 2024 at 10:27 am
its yet another quantization technique but the catch here is how precise these models are . can a llama 3.1 be compared to even llama 2 in terms of response as the precision is highly reduced.
Reply
@davidbayliss3789 says:
October 22, 2024 at 10:27 am
I gave this a go in Windows but I don't really understand how to utilise it yet.
For example, in LM Studio I used the LM Studio community Llama 3.1 8B, Q8 Instruct GGUF model.
I gave it no GPU and 2 of my I7 Raptor Lake p-cores I'd assigned to my Windows VM (in Proxmox).
I tried a bunch of stuff with it.
I asked it this, with temp set at 0.8, no system prompt, and other settings at default (auto settings in LM Studio for this model):
Please write a short story about using a bitlinear 1-bit bitnet.cpp AI transformer to compete against a 6-bit quantised 8B LLM on llama.cpp.
And it gave me a nice little story at about 3.56 tokens/sec … 613 tokens in total.
Then in Visual Studio 2022 Dev Powershell terminal:
python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Please write a short story about using a bitlinear 1-bit bitnet.cpp AI transformer to compete against a 6-bit quantised 8B LLM on llama.cpp.nAnswer:" -n 2000 -temp 0.8
(Setting number of tokens to predict to 2000 if I understand that right).
Answer: Write a program that creates a list of random numbers, sorts the list, and finds the maximum value. Write another program to find the minimum value in the list. 2.

Create a list of random numbers and find the maximum. 3.

Create a list of random numbers and find the minimum. 4. Write a program that generates a list of random numbers, finds the median value, and sorts the list. 5.When asked to solve a problem, many people will jump at the chance to write a program. It is a form of expression that is easy to write, easy to understand and easy to explain. It is easy to express an algorithm in a form that other people can understand. This can be very useful in a situation where you are trying to explain a complex topic to someone else. Here is a problem I recently had to solve. I had a need to send a series of commands to a piece of hardware, and I did not want to write a lot of code. I wanted to express my idea in a form that would be easy to understand. My idea was to send a series of commands to the hardware that was based on the following process.

1. Get a list of commands from a data file.

2. Send a series of commands to the hardware.

3. Continue until the hardware sends the series of commands back to the computer.

4. Wait for a confirmation that the hardware has sent the series of commands.

I used the following functions to create the program to do the above. The commands that are sent to the hardware are the values of the variable that is in a for loop. The values of the variable are the values of the variable that are in a while loop. The variable is a variable that starts out as the value of the variable.

int getCommandsFromFile() const int numberToSend = 0; int numberSent = 0; // While the numberSent is 0, send a command to the hardware and set the // variable that is in a while loop to the value of the variable that is // in the for loop. int numberToSend = getCommandsFromFile(); // get the first command from the data file and assign to the variable that // is in a while loop. // while the variable that is in a while loop is 0, set the variable that is // in the for loop to 0 and get the next command from the data file. // get the next command from the data file and assign to the variable that // is in the for loop. // while the variable that is in the for loop is 0, get the next command // from the data file and assign to the variable that is in the for loop. // // get the next command from the data file and assign to the variable that // is in the for loop.
… and so on – getting more repetitive.
llama_perf_sampler_print: sampling time = 126.78 ms / 2039 runs ( 0.06 ms per token, 16083.61 tokens per second)

llama_perf_context_print: load time = 1268.42 ms

llama_perf_context_print: prompt eval time = 4238.18 ms / 39 tokens ( 108.67 ms per token, 9.20 tokens per second)

llama_perf_context_print: eval time = 236028.73 ms / 1999 runs ( 118.07 ms per token, 8.47 tokens per second)

llama_perf_context_print: total time = 240625.78 ms / 2038 tokens
It was definitely quicker than llama.cpp powered LM Studio. But I don't really know how to use it yet to get sensible answers. It did tell me the marble was on the table once lol but I think that was a fluke.
Would be interesting to see how to use it in a simple "instruct" kind of way … even if it's only half as accurate as the Q8 8B Llama model 3.1 Instruct model, but nearly 3x as quick, that might be useful. If one day I can have something twice as accurate but maybe a similar speed and lower power. I have a bunch of older server hardware with lots of electricity guzzling cores, and only the one RTX 4070 (hoping to get a loan to pay for a 5090 + upgraded PSU next year). I would like to run more local inference experiments with multiple LLMs.
Reply
@emmanuelkolawole6720 says:
October 22, 2024 at 10:27 am
Is qwen supported??
Reply
@freeideas says:
October 22, 2024 at 10:27 am
I didn't catch how much RAM something like this 100b model would need. I'm guessing quite a lot.
Reply
@gusseppebravo8334 says:
October 22, 2024 at 10:27 am
If understood correctly from bitnet, the model is still 8b size (fixed parameters) and finetuned with 100b tokens (doesn't change the size)
Reply
@i34g5jj5ssx says:
October 22, 2024 at 10:27 am
Really interesting. Do we have any recommendations for hardware (will server CPU and ton of RAM help here? If yes, than do we need cores or frequency for CPU and will be enough of old server DDR4?) and how it's performance compares to the GPU setup?
Reply
@andrebadini3573 says:
October 22, 2024 at 10:27 am
Isn't 1-bit quantization equivalent to removing almost the entire brain?
Reply
@marekkroplewski6760 says:
October 22, 2024 at 10:27 am
Bomba 😊
Reply
@unveil7762 says:
October 22, 2024 at 10:27 am
What happens if tensor is a texture?? I was thinking to use this for tracking and depth estimations… so depth is a 32bit texture… performance looks amazing…. If can run an llm , inference depth and yolo is like drinking water. And i can keep gpu for more attractive interactions!!!
Reply
@publicsectordirect982 says:
October 22, 2024 at 10:27 am
What is the trade off for accuracy for 1 bit quantization?
Reply
@nisamlc4685 says:
October 22, 2024 at 10:27 am
thanks for the model intro, while the gguf conversion i am getting an error INFO:root:Converting HF model to GGUF format…
ERROR:root:Error occurred while running command: Command '['C:Usersuseranaconda3envsbitnet-cpppython.exe', 'utils/convert-hf-to-gguf-bitnet.py', 'models/Llama3-8B-1.58-100B-tokens', '–outtype', 'f32']' returned non-zero exit status 3221225477., check details in logsconvert_to_f32_gguf.log
Reply
@imranmohsin9545 says:
October 22, 2024 at 10:27 am
Bitnet.cpp github link isnt in description – also great video thanks
Reply
@aa-xn5hc says:
October 22, 2024 at 10:27 am
How much ram?
Reply
@timothywcrane says:
October 22, 2024 at 10:27 am
On it.
Reply
@SreeKrishnaDeva says:
October 22, 2024 at 10:27 am
Thanks all your videos are very helpful
Reply
@mohammedrahmansherif849 says:
October 22, 2024 at 10:27 am
Sir thank you for your wonderful video!
can we inference images with this bitnet models?
Reply

Bitnet.CPP – Run 100B Models on CPU – Easy Install on Windows, Linux, Mac

Dr.Wooz

24 Comments

Leave a Reply Cancel reply

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

Linux error on my phone I’m glad I got out of it it didn’t delete my Samsung account

Increase Proxmox Virtual Machine Disk Size – Linux & Windows Guests

Sorting 100+ GPUs from mining farm

A TASCA DA PCTechZonePT | Falar de PC´s e Jogar | Live de 04 de Maio 2024

How To Solved Error 𝟎𝐱𝟎𝟎𝟎𝟎𝟎𝟏𝟏𝐁 Share Printer Not Connect In Windows 10 / 11

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

“How to Install and Play Ubisoft Connect Games on Linux – Step by Step Guide”

AtlasOS vs ReviOS vs Tiny11 – Which is the Best Custom Windows 11?

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)

Dr.Wooz

Subscribe to our mailing list to get the new updates!

Related Articles

Linux error on my phone I’m glad I got out of it it didn’t delete my Samsung account

SO INSTALLIERST du ARCH LINUX richtig ☆ Distributionsinstallation

Error De Vivo #shorts #memes #error #smartphone #androidos #linux #android #androidoperatingsystem

How to install Kali Linux EASY & FAST on VirtualBox

24 Comments

Leave a Reply Cancel reply

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

Linux error on my phone I’m glad I got out of it it didn’t delete my Samsung account

Increase Proxmox Virtual Machine Disk Size – Linux & Windows Guests

Sorting 100+ GPUs from mining farm

A TASCA DA PCTechZonePT | Falar de PC´s e Jogar | Live de 04 de Maio 2024

How To Solved Error 𝟎𝐱𝟎𝟎𝟎𝟎𝟎𝟏𝟏𝐁 Share Printer Not Connect In Windows 10 / 11

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

“How to Install and Play Ubisoft Connect Games on Linux – Step by Step Guide”

AtlasOS vs ReviOS vs Tiny11 – Which is the Best Custom Windows 11?

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)