No video

LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500

Sitting down to run some tests with i9 9820x, Tesla M40 (24GB), 4060Ti (16GB), and an A4500 (20GB)
Rough edit in lab session
Recorded and best viewed in 4K

Пікірлер: 33

  • @andrewowens5653
    @andrewowens5653Ай бұрын

    Thank you. It would be interesting to see some evaluation of multiple consumer gpus working on the same llm.

  • @RoboTFAI

    @RoboTFAI

    Ай бұрын

    I have another video of testing 1,2,3,4, and 6 4060's (which I consider consumer level) together on same LLM here - kzread.info/dash/bejne/jKlmm66Be7urmtY.html but if you have more specific ideas please let me know.

  • @fooboomoo
    @fooboomooАй бұрын

    great content and relevant to me since I recently bought a 4060 ti 16gb for ai.

  • @RoboTFAI

    @RoboTFAI

    29 күн бұрын

    thanks for watching!

  • @nithinbhandari3075
    @nithinbhandari307523 күн бұрын

    Thanks for comparing the different GPU hardware. Can you run a test like, there is 6k input token and 1k output token. So, we can known that how large LLM perform under 6k input and 1k output token.

  • @RoboTFAI

    @RoboTFAI

    22 күн бұрын

    Yea we can absolutely run some tests with much larger prompts/etc!

  • @fulldivemedia
    @fulldivemedia25 күн бұрын

    great content my problem is choosing an am5 motherboard, I have 3 that I have got my eye on but I don't know which one is more future-proof msi meg x670e ace asus proart x670e asus rog strix x670e-e gaming can you help? i want it mostly for AI art and such, msi costs more, rog and proart are the same price (but I still don't know between these two which one is better, proart 2 PCI x8 x8 but rog is x8 x4) is msi is better than proart?

  • @jeroenadamdevenijn4067
    @jeroenadamdevenijn4067Ай бұрын

    If I run Codestral 22b Q4_K_M on my P5000 (Pascal architecture), I get 11 t/s evaluation, so that means the P5000 performs around 75% of a 4060TI. But now, when I open Nvidia Power Management I can observe it only consumes 140W when under load while it should be ablte to go up to 180W. B.T.W. both these cards have 288GB/s memory bandwidth. I must have a bottleneck in my system which is a Intel 11th gen i7 laptop (4-core CPU) and eGPU over Thunderbolt 3.

  • @RoboTFAI

    @RoboTFAI

    Ай бұрын

    That's pretty decent speed in that setup

  • @jeroenadamdevenijn4067

    @jeroenadamdevenijn4067

    Ай бұрын

    @@RoboTFAI It does slow down though with larger context, let's say 8~9 t/s and when I go for Q5_K_S that becomes 7~8 t/s, still doable.

  • @stevenwhiting2452

    @stevenwhiting2452

    Ай бұрын

    Play with your data chunk sizes, it's usually unoptimised memory movement that limits the throughput. Nvidia has a tutorial that explains cuda much better than I can. The P40&P100 do the same thing on some models too.

  • @six1free
    @six1freeАй бұрын

    so I swung a 4060 laptop and a 4070tisuper and have spent the last couple days migrating my PC into an AI server, haven't yet gotten to the AI but in the meanwhile I'm putting the warranties to the test with some hardcore mining, almost nestalgic to when bitcoin was $10/btc I am realizing the 16Gvram is a bit of a bottleneck though, do you think adding an M40 or two would help? will the GPUs be able to crosstalk each others vram?

  • @RoboTFAI

    @RoboTFAI

    Ай бұрын

    Yes, and I will answer some of this question in next video! Mixing GPUs/Tensor splitting

  • @six1free

    @six1free

    Ай бұрын

    @@RoboTFAI sweet sounds like a good video

  • @georgepongracz3282
    @georgepongracz328218 күн бұрын

    it would be interesting to compare a 4070 ti super to the 4060 ti if the scaling is proprtional to cost

  • @RoboTFAI

    @RoboTFAI

    18 күн бұрын

    Don't have one to test with, but if you want to send me one I am happy to throw it through the gauntlet hahaha

  • @mohammdmodan5038
    @mohammdmodan503820 күн бұрын

    I'm planning to buy gpu i have 2 choice P100 and M40 24GB i want to run 8B model is it's enough for it currently i have RYZEN 5 3600 16GB DDR4 1T NVME

  • @mohammdmodan5038

    @mohammdmodan5038

    20 күн бұрын

    You have M40 right can you provide tokens/s

  • @RoboTFAI

    @RoboTFAI

    20 күн бұрын

    P100 is a Pascal architecture and newer than the M40 which is Maxwell architecture - so I would always recommend the newer cards of course depending on your budget and needs. Both will be power hungry. Llama 3.1 8B? Depends on context size....it defaults to 128k which is going to be heavy on your VRAM depending on quant/etc. To give an idea - Meta publishes this as guide (taken from huggingface.co/blog/llama31) on just context size vs kv cache size. You still have to load the model, other layers, etc, etc.... Model Size 1k tokens 16k tokens 128k tokens 8B 0.125 GB 1.95 GB 15.62 GB 70B 0.313 GB 4.88 GB 39.06 GB 405B 0.984 GB 15.38 GB 123.05 GB I actually have 3 old M40's sitting around in the lab as that is where I started in my AI journey over a year ago! So yea can do testing with them.

  • @tsclly2377
    @tsclly2377Ай бұрын

    P40 vs 3090ti .. just because there is so much of a price difference and what can you get in loading speeds if your files are on a P900 Optane (280GB) [assuming that one is setting up batch processing]

  • @RoboTFAI

    @RoboTFAI

    Ай бұрын

    I don't have either card to do testing with, will ask around friends/etc. Or might try to trade for a 3090 since everyone goes after them for their rigs...power hungry though

  • @fulldivemedia
    @fulldivemedia25 күн бұрын

    thanks

  • @RoboTFAI

    @RoboTFAI

    24 күн бұрын

    You're welcome!

  • @jackflash6377
    @jackflash6377Ай бұрын

    A4500 vs RTX3090 ??

  • @RoboTFAI

    @RoboTFAI

    22 күн бұрын

    Attempting to acquire a 3090 for the channel, stand by!

  • @marsrocket
    @marsrocket18 күн бұрын

    Llama 3 7B runs in near real-time on an Apple M1 processor, and presumably faster on an M2 or M3.

  • @RoboTFAI

    @RoboTFAI

    18 күн бұрын

    It does, I haven't brought Apple Silicon into the mix on the channel just yet - but I have a few M1, M1 Max as my daily machines

  • @donaldrudquist
    @donaldrudquist29 күн бұрын

    What application are you using to run this?

  • @RoboTFAI

    @RoboTFAI

    29 күн бұрын

    It's custom built by me - combo of Streamlit, Python, Langchain, etc, etc

  • @Johan-rm6ec
    @Johan-rm6ec5 күн бұрын

    With these kinds of tests, 2 x 4060 ti 16gb must be included. And how it performs. 24gb is not enough 32gb on a Quadro kind of is 2700 euro"s. So it seems its a sweetspot. That you shpuld cover. Know your audience know sweetspots and that are the video's people want to see.

  • @RoboTFAI

    @RoboTFAI

    4 күн бұрын

    Adding in 2x 4060's won't really increase the speed over 1 of them, at least not noticeable. There is some other videos on the channel addressing this topic a bit. Scaling out on # video cards is really meant to just gain you that extra VRAM. So it's always a balance of your budget, costs, power usage, and your expectations (this is the more important one). Lower, lower your expectations until your goals are met! haha

  • @STEELFOX2000
    @STEELFOX200028 күн бұрын

    iS POSSIBLE TO USE USE AN RX 6800 TO DO THIS TASK?

  • @RoboTFAI

    @RoboTFAI

    25 күн бұрын

    I do not have any AMD cards to test with, but there is ROCm for AMD and llama.cpp/LocalAI/etc etc do support it these days.