Llama 3.1 405b LOCAL AI Home Server on 7995WX Threadripper and 4090

Dr.Wooz November 1, 2024

27 12 Less than a minute

Running a 405B LLM on your home server is possible? YES! Today we give this a spin on out 7995WX and we also test the 3090 …

[ad_2]

source

Dr.Wooz

27 Comments

@ruzaroos says:

November 1, 2024 at 3:35 pm

Great channel. Have you done any videos on creating a small GPU cluster? Obviously using 4 GPUs in a single chassis is better than 4 single GPU systems in a cluster, but I am wondering where the tipping point is. For a hobbyist seeking to up skill, it seems like learning about working with a cluster is worth the performance tradeoff considering all deployments at scale are clusters.

Reply
@sonofwright says:

November 1, 2024 at 3:35 pm

i request a 4x intel a770 version of your testing – i suspect Intel is a sleeping giant for 32B inference

Reply
@robot_0121 says:

November 1, 2024 at 3:35 pm

two epyc 7002/70003 series cpu one motherboard = 250-320 gb/s RAM. or epyc two cpu on one motherboard 7004 = 400-500 gb/s RAM.

Reply
@JoeVSvolcano says:

November 1, 2024 at 3:35 pm

Ya, 405b is crazy for any enthusiast build. Thats for for AI Developers with multiple A** cards. 32b is the max for single GPU boards. I hope in a couple more years all video cards are 32GB minimum and 128GB enthusiast RTX9990 /TITAIN builds …

Reply
@thisrobotman says:

November 1, 2024 at 3:35 pm

do you use bitsandbytes to quantization this model to int4 right? why this pc not use all of ram and vram?

Reply
Anonymous says:

November 1, 2024 at 3:35 pm

Read about NUMA nodes.

Reply
@bykvuk says:

November 1, 2024 at 3:35 pm

What a shame, 10 000$ and more for machine, and you save for cooler using last gen TR cooler.

Reply
@HectorDiabolucus says:

November 1, 2024 at 3:35 pm

We need to start rating AI’s to give them a resource score to indicate how much the hardware costs would be to run them.

Reply
@go-dev-o says:

November 1, 2024 at 3:35 pm

Massive amounts of cpu and ram aren't going to do much on running a single model on GPU. The GPU is still doing all the work.
The difference would be seen in multi-agent instances where there's a lot of orchestration involved across several models and gpus. Or just operating on the CPU itself and not the GPU.

Reply
@MediaCreators says:

November 1, 2024 at 3:35 pm

Great video! Any recommendations for a small vision model on a 1080 Ti with 11GB? If not, what hardware would you say is the minimum required?

Reply
@Wingnut353 says:

November 1, 2024 at 3:35 pm

Why was your 405B model not showing much ram use during your run less than 30GB shown being used on the CPU?? Something definitely wrong with your setup… maybe you were streaming from disk instead of ram? Also of course it doesnt' matter for the 3B model… that thing easily is entirely in your GPU and doesn't hit system ram at all. Also potentially part of the reason you are slow when inferencing on CPU is NUMA and going cross domains.

Reply
@ethanwaldo1480 says:

November 1, 2024 at 3:35 pm

If you have it enabled, try disabling IOMMU and cpu tuning is set to "Maximum Performance" in the BIOS. Would also be interesting to see with and without hyper-threading enabled.

Reply
@maxmustermann194 says:

November 1, 2024 at 3:35 pm

0:18 For the life of everything holy do not use a morph cut like this again or I will have nightmares of it in my dreams.

Reply
@computersales says:

November 1, 2024 at 3:35 pm

I honestly don't understand what any of this means lol. I just wish the hardware was a little bit more price friendly towards consumers. 😅

Reply
@Vadinaka says:

November 1, 2024 at 3:35 pm

Hello
Thank you so much for your time, and insights.
I am managing a team, and we are looking forward to build a server for ML.
Thank you for your former video. May I ask where I can find the thread ripper configuration you are testing here. Is it a custom build?
Thanks

Reply
@MrIndrek says:

November 1, 2024 at 3:35 pm

I think you should run 7955WX if you have access to it, 7995WX is not a good candidate for this scenario.

Reply
@tedguy2743 says:

November 1, 2024 at 3:35 pm

With some super heavy optimization and all of the possibilities strategies to make it more efficient and some delulu spirit we can make it work buddy

Reply
@0xb794 says:

November 1, 2024 at 3:35 pm

Can you try with AirLLM?

Reply
@KonstantinsQ says:

November 1, 2024 at 3:35 pm

This proves that 3090 is still best value for $ and will keep it this way for some time. Just got my first 3090 last week. Managed to lower temps with changing thermal pads and paste from throttling 85 to 79 and happy with purchase. For now only will upgrade RAM to 64GB, Ryzen 9 5950X, what is max for current MB, maybe will buy second 3090 (MB theoretically allows it) and for small experiments and learning will be enough for some time. For second level will probably build something on WRX90. First question will be about CPU – 5955wx for 1000$ or 3945WX for 200$ is ok? Will it be enough if main stuff will be done on GPU?

Reply
@oscarcharliezulu says:

November 1, 2024 at 3:35 pm

Actually useful videos these benchmarking ones!

Reply
@lietz4671 says:

November 1, 2024 at 3:35 pm

이 동영상은 여러분의 집에서 405B 모델을 사용하면 안 된다는 것을 잘 보여주고 있습니다. 가장 빠른 CPU가 달린 컴퓨터라고 하더라도 405B 모델은 20분 뒤에 답해 주기 때문입니다. 여러분의 집에 필요한 것은 VRAM이 많이 달린 그래픽 카드입니다. 3070, 3080과 같은 구형 chipset에 48GB VRAM이 달린 그래픽 카드가 제조되고 판매었으면 좋겠습니다. 만약 가능하다면, 그래픽 카드 회사들이 480GB VRAM이 달린 그래픽 카드를 제조해서 판매해 주면 좋겠습니다.

Reply
@joshhaas8121 says:

November 1, 2024 at 3:35 pm

When does your origin story arc begin?

Reply
@lhl says:

November 1, 2024 at 3:35 pm

You should run a benchmark like STREAM TRIAD to see what your actual memory bandwidth is for vector math vs theoretical bandwidth. On a recent test of a bunch of EPYC chips, many models have much lower TRIAD results vs their theoretical memory bandwidth.

BTW, while it might not make for as entertaining videos, in general, if you use `llama-bench` as a standardized benchmark for all your different testing, I bet you'll get more useful/standardized results for your spreadsheet (I also recommend recording the build number for your tests).

For those interested, on an 24C EPYC 9274F (395GB/s TRIAD vs 460.8GB/s theoretical), w/ a Qwen2.5-32B q4_0 GGUF with llama-bench (b3985), CPU-only gives pp/tg of 24.7 t/s and 8 t/s. With a W7900 (ROCm) w/ display attached, I get tg/pp 656.5 t/s and 25.9 t/s. If I run with `-ngl 0` — using the Navi31 (gfx1100) GPU for compute, but with the model loaded into system memory, not VRAM, I am able to get tg/pp 344.3 t/s and 7.6 t/s, which is actually not bad (eg, even if you didn't have enough memory to load any layers onto the GPU, you'd still get a pretty huge compute benefit and barely any loss on MBW.

Reply
@dardaraveiga6512 says:

November 1, 2024 at 3:35 pm

Is the mz31-ar0 version 2.1 also okay for a rig like yours?

Reply
@markldevine says:

November 1, 2024 at 3:35 pm

Great content!

My conclusion is that AI-at-home is profoundly I/O bound — MMA VRAM capacity specifically. Pat Gelsinger cancelled Beast Lake (a.k.a. "Royal Core") because he doesn't think the world needs powerful CPUs anymore. A capable CPU & nominally clocked RAM are required but nothing spicy seems to be necessary.

For I/O capability, TRP is the most I can afford. Entry into TRP is a 24-core CPU. That works.

4 x 5090s (128 combined VRAM; NVIDIA's popular stack) seems to be a reasonable target for 2025-2026, albeit still insufficient for responsiveness with the larger models.

(1) engage an electrician to add 2 new 20A circuits & put in two new UPSs to my home office
(2) buy/build a rig capable of holding 4 x GPUs — 2 power supplies
(3) custom loop water cooling (external radiator?)
(4) get a second job & change my diet to Ramen-only to be able to pay for all of this

Reply
@codescholar7345 says:

November 1, 2024 at 3:35 pm

Maybe you should have a new SSD for the new chip and gpu and do a fresh install on it. Then compare the two setups.

Reply
@init_yeah says:

November 1, 2024 at 3:35 pm

Yess I've wanted a video like this

Reply

Llama 3.1 405b LOCAL AI Home Server on 7995WX Threadripper and 4090

Dr.Wooz

27 Comments

Leave a Reply Cancel reply

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

Linux error on my phone I’m glad I got out of it it didn’t delete my Samsung account

HIGHLIGHT | GAM vs TW | VCS 2024 MÙA HÈ | Tuần 1| 23.06.2024

buscado una virtualize xd

Ubuntu 24.04 LTS Beta | Whats New, Whats Great, Whats Not!

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)

How To Solved Error 𝟎𝐱𝟎𝟎𝟎𝟎𝟎𝟏𝟏𝐁 Share Printer Not Connect In Windows 10 / 11

Level Up Your Minecraft Server with Proxmox – Install and Setup

“How to Install and Play Ubisoft Connect Games on Linux – Step by Step Guide”

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)

How to Use blackarch-install Command, Set Up, and Update BlackArch Linux with Pacman

Dr.Wooz

Subscribe to our mailing list to get the new updates!

Related Articles

Level Up Your Minecraft Server with Proxmox – Install and Setup

Self Hosted Postiz Setup Guide (HowTo)

Want Proxmox Notifications That WORK? Watch This Now

Instalación de FreeBSD en PROXMOX

27 Comments

Leave a Reply Cancel reply

Level Up Your Minecraft Server with Proxmox – Install and Setup

How To Install Tiny 11 Without a USB Drive | Windows 11 Lite Installation

Linux error on my phone I’m glad I got out of it it didn’t delete my Samsung account

HIGHLIGHT | GAM vs TW | VCS 2024 MÙA HÈ | Tuần 1| 23.06.2024

buscado una virtualize xd

Ubuntu 24.04 LTS Beta | Whats New, Whats Great, Whats Not!

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)

How To Solved Error 𝟎𝐱𝟎𝟎𝟎𝟎𝟎𝟏𝟏𝐁 Share Printer Not Connect In Windows 10 / 11

Level Up Your Minecraft Server with Proxmox – Install and Setup

“How to Install and Play Ubisoft Connect Games on Linux – Step by Step Guide”

MEDIA STATION X – НАИЛУЧШИЕ Стартовые Параметры(Start Parameter)

How to Use blackarch-install Command, Set Up, and Update BlackArch Linux with Pacman