Proxmox

4090 Local AI Server Benchmarks



How many Tokens per Second could you expect to hit with a RTX 4090 GPU on modern LLMs and which would be the best …

[ad_2]

source

Related Articles

20 Comments

  1. I love playing with Flux and some other models with a mid-high end GPU….but your use case seems to be spending $5k to have two high-end GPUs tell you what you can cook with the contents of your pantry lol. Not hating on the love for crazy good hardware….but I don't understand the application of it in this case.

  2. Because people will ask me – 3x 4060 ti 16GB GPU's – here's some of my numbers: 3.2b 3B_instruct_fp16 38T/S 1 GPU. QWEN2.5 32b_instruct Q6_K – 9.6T/S 3 GPU's. Qwen does cause ollama to freak out however as noted – if anyone can suggest how to get these models working it would be appreciated. (I got "GGGGGGGGGG" as the output for the story question, and then it was unresponsive until reset.)

  3. can we use 2 RTX3060 12gb card so combined we get 24gb of VRAm?,one more question is how many TOPS will i get in a single RTX 3060 12 GB card on an average irrespective of which model we are using?
    I love your content

  4. I've come down to, for personal use anyway, it's not really about tokens/sec once you pass a certain threshold of say, 9 or 10 – anything past that is gravy. It's more about memory usage on the GPU now, and running the larger models are still just out of reach for a single consumer card. And once you get past the consumer cards, it gets expensive real fast, and need special cooling, etc. Finally, holy fat bottoms batman! That card is massive in size! I don't have a single case that would fit that monster, not even the Supermicro AI server I built which is in a 4U rack! 32GB in the 4090 would be better, and really what we need is a cheaper H100 with modern cores and consumer package.

  5. I think I know why Qwen went crazy– by default, Ollama uses a 2048 token context limit, so I think Qwen exceeded the limit and it couldn't see "limit to 5000 words" anymore, so it just kept going. In openwebui, you can set the context length.

  6. If I was paying an outrageous amount to get a GPU early – I would much rather pay the scalper because most of them are just normal people trying to make a bit of cash on the side because life is hard. Nvidia is not hard up for money.

  7. Current research shows the 4090 provides the best price per token compared to large VRAM cards like the ADA RTX 6000, A100, and H100/H200 and I would expect 5090s to follow this trend. The tricky part is fitting enough 4090s/5090s in to a single machine for an equivalent amount of VRAM to load larger models like 70b without quantization.

    I started out with a single processor Epyc using PCIe switches but ran in to motherboard PCI resource limitations which prevented adding more than 12 cards, plus the PCIe switches always seemed to downgrade the number of lanes and lane speeds for no good reason. They're difficult to source, but switching to a dual Epyc with 20x MCIO gave me 10x 4090s at full PCIE 4.0 16x each which is sufficient for llama 70b fp16 with plenty of VRAM headroom to run additional tokenizer models and vector DB and is also killer for fine tuning. Will soon have this converted to 16x 4090s at PCIE 4.0 8x each since most LLM inference engines prefer multiples of 2/4/8/16 and, unlike training, cross GPU bus transfer is less of a concern with inference.

    There is experimental research in llama.cpp (look in examples/rpc) and Exo that allows distributing model inference layers across multiple computers and GPUs. With 100G networking, that could make for an interesting video.

  8. Great video! I downloaded both recommended models and they are super fast. I there a site that lets one know what models will fit completely in a gpu and at certain quantization levels. Do you have a RAG video with openwebui and a 3090? Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button