Proxmox
Llama 3.1 405b LOCAL AI Home Server on 7995WX Threadripper and 4090
Running a 405B LLM on your home server is possible? YES! Today we give this a spin on out 7995WX and we also test the 3090 …
[ad_2]
source
Running a 405B LLM on your home server is possible? YES! Today we give this a spin on out 7995WX and we also test the 3090 …
[ad_2]
source
Lorem ipsum dolor sit amet, consectetur.
Great channel. Have you done any videos on creating a small GPU cluster? Obviously using 4 GPUs in a single chassis is better than 4 single GPU systems in a cluster, but I am wondering where the tipping point is. For a hobbyist seeking to up skill, it seems like learning about working with a cluster is worth the performance tradeoff considering all deployments at scale are clusters.
i request a 4x intel a770 version of your testing – i suspect Intel is a sleeping giant for 32B inference
two epyc 7002/70003 series cpu one motherboard = 250-320 gb/s RAM. or epyc two cpu on one motherboard 7004 = 400-500 gb/s RAM.
Ya, 405b is crazy for any enthusiast build. Thats for for AI Developers with multiple A** cards. 32b is the max for single GPU boards. I hope in a couple more years all video cards are 32GB minimum and 128GB enthusiast RTX9990 /TITAIN builds …
do you use bitsandbytes to quantization this model to int4 right? why this pc not use all of ram and vram?
Read about NUMA nodes.
What a shame, 10 000$ and more for machine, and you save for cooler using last gen TR cooler.
We need to start rating AI’s to give them a resource score to indicate how much the hardware costs would be to run them.
Massive amounts of cpu and ram aren't going to do much on running a single model on GPU. The GPU is still doing all the work.
The difference would be seen in multi-agent instances where there's a lot of orchestration involved across several models and gpus. Or just operating on the CPU itself and not the GPU.
Great video! Any recommendations for a small vision model on a 1080 Ti with 11GB? If not, what hardware would you say is the minimum required?
Why was your 405B model not showing much ram use during your run less than 30GB shown being used on the CPU?? Something definitely wrong with your setup… maybe you were streaming from disk instead of ram? Also of course it doesnt' matter for the 3B model… that thing easily is entirely in your GPU and doesn't hit system ram at all. Also potentially part of the reason you are slow when inferencing on CPU is NUMA and going cross domains.
If you have it enabled, try disabling IOMMU and cpu tuning is set to "Maximum Performance" in the BIOS. Would also be interesting to see with and without hyper-threading enabled.
0:18 For the life of everything holy do not use a morph cut like this again or I will have nightmares of it in my dreams.
I honestly don't understand what any of this means lol. I just wish the hardware was a little bit more price friendly towards consumers. 😅
Hello
Thank you so much for your time, and insights.
I am managing a team, and we are looking forward to build a server for ML.
Thank you for your former video. May I ask where I can find the thread ripper configuration you are testing here. Is it a custom build?
Thanks
I think you should run 7955WX if you have access to it, 7995WX is not a good candidate for this scenario.
With some super heavy optimization and all of the possibilities strategies to make it more efficient and some delulu spirit we can make it work buddy
Can you try with AirLLM?
This proves that 3090 is still best value for $ and will keep it this way for some time. Just got my first 3090 last week. Managed to lower temps with changing thermal pads and paste from throttling 85 to 79 and happy with purchase. For now only will upgrade RAM to 64GB, Ryzen 9 5950X, what is max for current MB, maybe will buy second 3090 (MB theoretically allows it) and for small experiments and learning will be enough for some time. For second level will probably build something on WRX90. First question will be about CPU – 5955wx for 1000$ or 3945WX for 200$ is ok? Will it be enough if main stuff will be done on GPU?
Actually useful videos these benchmarking ones!
이 동영상은 여러분의 집에서 405B 모델을 사용하면 안 된다는 것을 잘 보여주고 있습니다. 가장 빠른 CPU가 달린 컴퓨터라고 하더라도 405B 모델은 20분 뒤에 답해 주기 때문입니다. 여러분의 집에 필요한 것은 VRAM이 많이 달린 그래픽 카드입니다. 3070, 3080과 같은 구형 chipset에 48GB VRAM이 달린 그래픽 카드가 제조되고 판매었으면 좋겠습니다. 만약 가능하다면, 그래픽 카드 회사들이 480GB VRAM이 달린 그래픽 카드를 제조해서 판매해 주면 좋겠습니다.
When does your origin story arc begin?
You should run a benchmark like STREAM TRIAD to see what your actual memory bandwidth is for vector math vs theoretical bandwidth. On a recent test of a bunch of EPYC chips, many models have much lower TRIAD results vs their theoretical memory bandwidth.
BTW, while it might not make for as entertaining videos, in general, if you use `llama-bench` as a standardized benchmark for all your different testing, I bet you'll get more useful/standardized results for your spreadsheet (I also recommend recording the build number for your tests).
For those interested, on an 24C EPYC 9274F (395GB/s TRIAD vs 460.8GB/s theoretical), w/ a Qwen2.5-32B q4_0 GGUF with llama-bench (b3985), CPU-only gives pp/tg of 24.7 t/s and 8 t/s. With a W7900 (ROCm) w/ display attached, I get tg/pp 656.5 t/s and 25.9 t/s. If I run with `-ngl 0` — using the Navi31 (gfx1100) GPU for compute, but with the model loaded into system memory, not VRAM, I am able to get tg/pp 344.3 t/s and 7.6 t/s, which is actually not bad (eg, even if you didn't have enough memory to load any layers onto the GPU, you'd still get a pretty huge compute benefit and barely any loss on MBW.
Is the mz31-ar0 version 2.1 also okay for a rig like yours?
Great content!
My conclusion is that AI-at-home is profoundly I/O bound — MMA VRAM capacity specifically. Pat Gelsinger cancelled Beast Lake (a.k.a. "Royal Core") because he doesn't think the world needs powerful CPUs anymore. A capable CPU & nominally clocked RAM are required but nothing spicy seems to be necessary.
For I/O capability, TRP is the most I can afford. Entry into TRP is a 24-core CPU. That works.
4 x 5090s (128 combined VRAM; NVIDIA's popular stack) seems to be a reasonable target for 2025-2026, albeit still insufficient for responsiveness with the larger models.
(1) engage an electrician to add 2 new 20A circuits & put in two new UPSs to my home office
(2) buy/build a rig capable of holding 4 x GPUs — 2 power supplies
(3) custom loop water cooling (external radiator?)
(4) get a second job & change my diet to Ramen-only to be able to pay for all of this
Maybe you should have a new SSD for the new chip and gpu and do a fresh install on it. Then compare the two setups.
Yess I've wanted a video like this