Proxmox
4090 Local AI Server Benchmarks
How many Tokens per Second could you expect to hit with a RTX 4090 GPU on modern LLMs and which would be the best …
[ad_2]
source
How many Tokens per Second could you expect to hit with a RTX 4090 GPU on modern LLMs and which would be the best …
[ad_2]
source
Lorem ipsum dolor sit amet, consectetur.
Man, you have no idea how much help this video has been. There are so limited reviews of 4090 for LLMs. Awesome video, kudos!!
For 5090 nvidia should include a trolley
Awesome video. Would love to see an amd mi60 in a rig! Older but cheap and 32gb
4090 is that big? or are you a small person? I'm confused.
The video I was looking for ❤. Thank you so much. Would it be possible to “cluster” these GPUs and potentially run larger models?
I love playing with Flux and some other models with a mid-high end GPU….but your use case seems to be spending $5k to have two high-end GPUs tell you what you can cook with the contents of your pantry lol. Not hating on the love for crazy good hardware….but I don't understand the application of it in this case.
4090 almost bigger than you 😀
How much would you accept for 1 hour of consultation?
I thought this video was either a joke or that you were a midget until I realized video cards are actually that big now.
1:22 what kind of motherboard ?
Because people will ask me – 3x 4060 ti 16GB GPU's – here's some of my numbers: 3.2b 3B_instruct_fp16 38T/S 1 GPU. QWEN2.5 32b_instruct Q6_K – 9.6T/S 3 GPU's. Qwen does cause ollama to freak out however as noted – if anyone can suggest how to get these models working it would be appreciated. (I got "GGGGGGGGGG" as the output for the story question, and then it was unresponsive until reset.)
I wonder …Next amd rdn4 wit 42 gb vram ? Amd wake aluppe hurry uppe
🎉😅
I think it would be fun to see a GPU showdown for these AI tasks. Compare some Tesla GPUs and some budget consumer GPUs.
can we use 2 RTX3060 12gb card so combined we get 24gb of VRAm?,one more question is how many TOPS will i get in a single RTX 3060 12 GB card on an average irrespective of which model we are using?
I love your content
I've come down to, for personal use anyway, it's not really about tokens/sec once you pass a certain threshold of say, 9 or 10 – anything past that is gravy. It's more about memory usage on the GPU now, and running the larger models are still just out of reach for a single consumer card. And once you get past the consumer cards, it gets expensive real fast, and need special cooling, etc. Finally, holy fat bottoms batman! That card is massive in size! I don't have a single case that would fit that monster, not even the Supermicro AI server I built which is in a 4U rack! 32GB in the 4090 would be better, and really what we need is a cheaper H100 with modern cores and consumer package.
Hi, another great video! Regarding the speed: is there a big difference in speed between the 4090 and the 3090? Because is the extra cuda cores
I think I know why Qwen went crazy– by default, Ollama uses a 2048 token context limit, so I think Qwen exceeded the limit and it couldn't see "limit to 5000 words" anymore, so it just kept going. In openwebui, you can set the context length.
If I was paying an outrageous amount to get a GPU early – I would much rather pay the scalper because most of them are just normal people trying to make a bit of cash on the side because life is hard. Nvidia is not hard up for money.
Current research shows the 4090 provides the best price per token compared to large VRAM cards like the ADA RTX 6000, A100, and H100/H200 and I would expect 5090s to follow this trend. The tricky part is fitting enough 4090s/5090s in to a single machine for an equivalent amount of VRAM to load larger models like 70b without quantization.
I started out with a single processor Epyc using PCIe switches but ran in to motherboard PCI resource limitations which prevented adding more than 12 cards, plus the PCIe switches always seemed to downgrade the number of lanes and lane speeds for no good reason. They're difficult to source, but switching to a dual Epyc with 20x MCIO gave me 10x 4090s at full PCIE 4.0 16x each which is sufficient for llama 70b fp16 with plenty of VRAM headroom to run additional tokenizer models and vector DB and is also killer for fine tuning. Will soon have this converted to 16x 4090s at PCIE 4.0 8x each since most LLM inference engines prefer multiples of 2/4/8/16 and, unlike training, cross GPU bus transfer is less of a concern with inference.
There is experimental research in llama.cpp (look in examples/rpc) and Exo that allows distributing model inference layers across multiple computers and GPUs. With 100G networking, that could make for an interesting video.
Great video! I downloaded both recommended models and they are super fast. I there a site that lets one know what models will fit completely in a gpu and at certain quantization levels. Do you have a RAG video with openwebui and a 3090? Thanks