4090 Local AI Server Benchmarks

Dr.Wooz October 20, 2024

20 11 Less than a minute

How many Tokens per Second could you expect to hit with a RTX 4090 GPU on modern LLMs and which would be the best …

[ad_2]

source

Dr.Wooz

20 Comments

@ericvaish8841 says:
October 20, 2024 at 7:07 pm
Man, you have no idea how much help this video has been. There are so limited reviews of 4090 for LLMs. Awesome video, kudos!!
Reply
@MenkarX says:
October 20, 2024 at 7:07 pm
For 5090 nvidia should include a trolley
Reply
@HaydonRyan says:
October 20, 2024 at 7:07 pm
Awesome video. Would love to see an amd mi60 in a rig! Older but cheap and 32gb
Reply
@zahirkhan778 says:
October 20, 2024 at 7:07 pm
4090 is that big? or are you a small person? I'm confused.
Reply
@jacocoetzee762 says:
October 20, 2024 at 7:07 pm
The video I was looking for ❤. Thank you so much. Would it be possible to “cluster” these GPUs and potentially run larger models?
Reply
@gaidin says:
October 20, 2024 at 7:07 pm
I love playing with Flux and some other models with a mid-high end GPU….but your use case seems to be spending $5k to have two high-end GPUs tell you what you can cook with the contents of your pantry lol. Not hating on the love for crazy good hardware….but I don't understand the application of it in this case.
Reply
@prfrag says:
October 20, 2024 at 7:07 pm
4090 almost bigger than you 😀
Reply
@timothymchudson says:
October 20, 2024 at 7:07 pm
How much would you accept for 1 hour of consultation?
Reply
@ostraca says:
October 20, 2024 at 7:07 pm
I thought this video was either a joke or that you were a midget until I realized video cards are actually that big now.
Reply
@billkillernic says:
October 20, 2024 at 7:07 pm
1:22 what kind of motherboard ?
Reply
@ChrisCebelenski says:
October 20, 2024 at 7:07 pm
Because people will ask me – 3x 4060 ti 16GB GPU's – here's some of my numbers: 3.2b 3B_instruct_fp16 38T/S 1 GPU. QWEN2.5 32b_instruct Q6_K – 9.6T/S 3 GPU's. Qwen does cause ollama to freak out however as noted – if anyone can suggest how to get these models working it would be appreciated. (I got "GGGGGGGGGG" as the output for the story question, and then it was unresponsive until reset.)
Reply
@ME-dg5np says:
October 20, 2024 at 7:07 pm
I wonder …Next amd rdn4 wit 42 gb vram ? Amd wake aluppe hurry uppe
🎉😅
Reply
@computersales says:
October 20, 2024 at 7:07 pm
I think it would be fun to see a GPU showdown for these AI tasks. Compare some Tesla GPUs and some budget consumer GPUs.
Reply
@adharshkl7336 says:
October 20, 2024 at 7:07 pm
can we use 2 RTX3060 12gb card so combined we get 24gb of VRAm?,one more question is how many TOPS will i get in a single RTX 3060 12 GB card on an average irrespective of which model we are using?
I love your content
Reply
@ChrisCebelenski says:
October 20, 2024 at 7:07 pm
I've come down to, for personal use anyway, it's not really about tokens/sec once you pass a certain threshold of say, 9 or 10 – anything past that is gravy. It's more about memory usage on the GPU now, and running the larger models are still just out of reach for a single consumer card. And once you get past the consumer cards, it gets expensive real fast, and need special cooling, etc. Finally, holy fat bottoms batman! That card is massive in size! I don't have a single case that would fit that monster, not even the Supermicro AI server I built which is in a 4U rack! 32GB in the 4090 would be better, and really what we need is a cheaper H100 with modern cores and consumer package.
Reply
@sebastianpodesta says:
October 20, 2024 at 7:07 pm
Hi, another great video! Regarding the speed: is there a big difference in speed between the 4090 and the 3090? Because is the extra cuda cores
Reply
@Duodduck says:
October 20, 2024 at 7:07 pm
I think I know why Qwen went crazy– by default, Ollama uses a 2048 token context limit, so I think Qwen exceeded the limit and it couldn't see "limit to 5000 words" anymore, so it just kept going. In openwebui, you can set the context length.
Reply
@cracklingice says:
October 20, 2024 at 7:07 pm
If I was paying an outrageous amount to get a GPU early – I would much rather pay the scalper because most of them are just normal people trying to make a bit of cash on the side because life is hard. Nvidia is not hard up for money.
Reply
@ethanwaldo1480 says:
October 20, 2024 at 7:07 pm
Current research shows the 4090 provides the best price per token compared to large VRAM cards like the ADA RTX 6000, A100, and H100/H200 and I would expect 5090s to follow this trend. The tricky part is fitting enough 4090s/5090s in to a single machine for an equivalent amount of VRAM to load larger models like 70b without quantization.
I started out with a single processor Epyc using PCIe switches but ran in to motherboard PCI resource limitations which prevented adding more than 12 cards, plus the PCIe switches always seemed to downgrade the number of lanes and lane speeds for no good reason. They're difficult to source, but switching to a dual Epyc with 20x MCIO gave me 10x 4090s at full PCIE 4.0 16x each which is sufficient for llama 70b fp16 with plenty of VRAM headroom to run additional tokenizer models and vector DB and is also killer for fine tuning. Will soon have this converted to 16x 4090s at PCIE 4.0 8x each since most LLM inference engines prefer multiples of 2/4/8/16 and, unlike training, cross GPU bus transfer is less of a concern with inference.
There is experimental research in llama.cpp (look in examples/rpc) and Exo that allows distributing model inference layers across multiple computers and GPUs. With 100G networking, that could make for an interesting video.
Reply
@codescholar7345 says:
October 20, 2024 at 7:07 pm
Great video! I downloaded both recommended models and they are super fast. I there a site that lets one know what models will fit completely in a gpu and at certain quantization levels. Do you have a RAG video with openwebui and a 3090? Thanks
Reply

4090 Local AI Server Benchmarks

Dr.Wooz

20 Comments

Leave a Reply Cancel reply

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

Linux error on my phone I’m glad I got out of it it didn’t delete my Samsung account

¿RX 6700 mejor que RTX 4070?

Kali Linux Install 2024.1 Version

I BUILT A BIG 1/8 RC LANCIA DELTA INTEGRALE!

How To Solved Error 𝟎𝐱𝟎𝟎𝟎𝟎𝟎𝟏𝟏𝐁 Share Printer Not Connect In Windows 10 / 11

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

“How to Install and Play Ubisoft Connect Games on Linux – Step by Step Guide”

AtlasOS vs ReviOS vs Tiny11 – Which is the Best Custom Windows 11?

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)

Dr.Wooz

Subscribe to our mailing list to get the new updates!

Related Articles

Level Up Your Minecraft Server with Proxmox – Install and Setup

Self Hosted Postiz Setup Guide (HowTo)

Want Proxmox Notifications That WORK? Watch This Now

Instalación de FreeBSD en PROXMOX

20 Comments

Leave a Reply Cancel reply

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

Linux error on my phone I’m glad I got out of it it didn’t delete my Samsung account

¿RX 6700 mejor que RTX 4070?

Kali Linux Install 2024.1 Version

I BUILT A BIG 1/8 RC LANCIA DELTA INTEGRALE!

How To Solved Error 𝟎𝐱𝟎𝟎𝟎𝟎𝟎𝟏𝟏𝐁 Share Printer Not Connect In Windows 10 / 11

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

“How to Install and Play Ubisoft Connect Games on Linux – Step by Step Guide”

AtlasOS vs ReviOS vs Tiny11 – Which is the Best Custom Windows 11?

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)