Serve a Custom LLM for Over 100 Customers

Ғылым және технология

➡️ ADVANCED-inference Repo: trelis.com/enterprise-server-...
➡️ ADVANCED-fine-tuning Repo: trelis.com/advanced-fine-tuni...
➡️ Trelis Function-calling Models: trelis.com/function-calling/
➡️ ADVANCED Vision Fine-tuning Repo: trelis.com/advanced-vision/
➡️ ADVANCED Transcription Repo: trelis.com/advanced-transcrip...
➡️ One-click Fine-tuning & Inference Templates: github.com/TrelisResearch/one...
➡️ Trelis Newsletter: Trelis.Substack.com
➡️ Tip Jar and Discord: ko-fi.com/trelisresearch
Affiliate Links (support the channel):
- Vast AI - cloud.vast.ai/?ref_id=98762
- RunPod - tinyurl.com/4b6ecbbn
Chapters:
0:00 Serving a model for 100 customers
0:25 Video Overview
1:08 Choosing a server
7:45 Choosing software to serve an API
11:26 One-click templates
12:13 Tips on GPU selection.
17:34 Using quantisation to fit in a cheaper GPU
21:31 Vast.ai setup
22:25 Serve Mistral with vLLM and AWQ, incl. concurrent requests
35:22 Serving a function calling model
45:00 API speed tests, including concurrent
49:56 Video Recap

Пікірлер: 46

  • @xdrap1
    @xdrap15 ай бұрын

    Great video. Almost no one talks about how to create a server and API with a customizable LLM. I'd love to see more videos on this. Your channel is awesome.

  • @93cutty

    @93cutty

    5 ай бұрын

    just started watching, excited for the vid!

  • @Paris-vz6uv
    @Paris-vz6uv3 ай бұрын

    This is exactly what I needed, everyone simply covers the super basics aspect. So, good to see someone going beyond. Please keep it coming. Thank you again :)

  • @wood6454
    @wood64545 ай бұрын

    Thank you for this comprehensive and easy-to-understand guide. I will be serving LLM for my friends.

  • @hickam16
    @hickam165 ай бұрын

    One of the best and most comprehensive explanations, thank you!

  • @omarzidan6840
    @omarzidan68405 ай бұрын

    We love you man! Good job. I was lost, but now really understood everything

  • @cprashanthreddy
    @cprashanthreddy5 ай бұрын

    Good Video. Explains most of the parameters required to deploy the solution. Thank you. :)

  • @dimknaf
    @dimknaf5 ай бұрын

    Very nice explanation in all the videos I saw. I subscribed. Keep the good work!

  • @PunitPandey
    @PunitPandey5 ай бұрын

    Excellent video. Very useful.

  • @hadebeh2588
    @hadebeh2588Ай бұрын

    Thank you so much for your video and all the great content you put out. Your channel is a gold mine of knowledge.

  • @jonathanvandenberg3571
    @jonathanvandenberg35715 ай бұрын

    Great content!

  • @sania3631
    @sania36312 ай бұрын

    Thanks you! Outstanding video, bro!

  • @jonyfrany1319
    @jonyfrany13195 ай бұрын

    Fantastic video sir, very informative.

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    thank you sir

  • @sherpya
    @sherpya4 ай бұрын

    you can pipe curl api calls that returns json to jq utility to colorize / format

  • @winsonsou
    @winsonsou5 ай бұрын

    Very very very comprehensive detailed explanation! Could i request for a video on calculating how much vram is required when trying to fine tune mistral for example?

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    Well Mistral is 7B model. Mixtral is about 45 GB. To fine-tine in 16-bit (bf16), you need at least 2x the model size in VRAM (because there are two bytes in 16 bits). So you need 14 GB or 90 GB. In practise, for Mixtral, you probalby need 2x A6000 or 2X A100. Now, you can fine-tune with QLoRA (see my earlier vid on that). Actually you'll notice in this video that there is a line when loading the model that is commented out when loading the model. I fyou comment this in (bitsandbytes nf4) then you can cut the VRAM in roughly 3x. So now you could train mixtral on about 5 GB VRAM or 30 GB. Last thing, you need some VRAM for the sequence length, which depends on the seq length you're training with. Maybe add 20% to the VRAM for buffer.

  • @winsonsou

    @winsonsou

    5 ай бұрын

    @@TrelisResearch thank you for the detailed response! Do you have a discord community or something?

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    @@winsonsou no Discord community. I just use KZread as the public forum and then offer paid lifetime membership (and scripts) to an Inference and also a Fine-tuning repo. There are quite a few members that post issues there. There's more info on Trelis.com

  • @winsonsou

    @winsonsou

    5 ай бұрын

    @@TrelisResearch thank you for your insights and responses! Very helpful and much appreciated!

  • @danieldemillard9412
    @danieldemillard94125 ай бұрын

    Have you explored serverles on runpod? It seems like this would be a good way of minimizing idle time and saving costs in production as you would only pay for what your customers are actually using. This might bring costs closer to a per token calculation and be competitive to OpenAI. I think for single concurrent requests, it is still much more expensive than OpenAI but curious about the economics of saturating a serverless GPU server and only paying for when it is active (scale down to 0). It would be great to see a video on this as well as what the impact is on startup latency for the overall api call. I have worked with non-gpu serverless and usually it only adds a couple of seconds to go from 0 to 1 instance. I would also be curious how many parallel requests one of these could handle.

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    Many thanks, that's a solid idea and I'm going to think through how to make a vid on it. Yes, serverless is about 4x more expensive per second, but downtime on a full GPU is very problematic so you're absolutely right.

  • @danieldemillard9412

    @danieldemillard9412

    5 ай бұрын

    @@TrelisResearch Fantastic, looking forward to watching it! I'm also probably going to buy your inference repo so would love some starter code for this. My specific use-case would be running Mixtral in 8 bit precision using runpod's "48 GB GPU" option but I can work through a general case too. One other thing I am curious about is how you might pre-bake the models into the docker image so that the load times are reasonable since downloading the model every time in serverless is a no-go. 80+ GB seems like a pretty massive docker image but they must have figured out how to make that efficient with their "Quick Deploy" models.

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    @@danieldemillard9412yeah I'm going to dig in on the serverless options

  • @alchemication
    @alchemication5 ай бұрын

    This is super cool !! I've tried to check the performance when using a RunPod template with multiple GPUs, but adding a flag `--gpus all` to docker command as per vllm docs did not work. Did you try running even more requests across N-GPU's?

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    Runpod makes it hard to add flags without updating the image. Can you use a TGI template instead from here: github.com/TrelisResearch/one-click-llms It’s faster and supports multi GPU out of the box. If you really want vLLM I have the gpus flag set on the Vast.AI one click template.

  • @LinkSF1
    @LinkSF15 ай бұрын

    Do you have some kind of lifetime membership? I've become a fan and want to continue following you as you create more content and tutorials,

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    Howdy! A few options: - Free option (Trelis LLM Updates Newsletter - get on it at Trelis.Substack.com) - Advanced fine-tuning repo (it's a lifetime membership to the fine-tuning scripts I make and regularly update). trelis.com/advanced-fine-tuning-scripts/ - Advanced inference repo (again, a lifetime membership that includes inference scripts). trelis.com/enterprise-server-api-and-inference-guide/ Access to either of those repos also allows you to create Issues to get some support.

  • @paolo-e-basta
    @paolo-e-basta5 ай бұрын

    the video content I was looking for, very nice. however the repo is not accessible anymore

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    Howdy. The repo is private so you won’t see it until after the purchase completes. See the top of the description for the link.

  • @optalgin2371
    @optalgin2371Ай бұрын

    Question:.. if say i change a server or migrate to different provider, all session info and logs are gone?

  • @TrelisResearch

    @TrelisResearch

    Ай бұрын

    Best options are to: 1. Push session info and logs elsewhere while running. 2. Use a persistent volume (either from runpod or by connecting up another service). I believe you can connect up most cloud services as your data volume.

  • @eric-theodore-cartman6151
    @eric-theodore-cartman61513 ай бұрын

    I am looking for something extremely cheap and somewhat fast. Natural language to sql project. Hardly 30-40 concurrent users , less than 100 visitors a day. What do you suggest?

  • @TrelisResearch

    @TrelisResearch

    3 ай бұрын

    Openchat 3.5 7B model! You can run it on an A10 on vast ai . Check out one-click-llms on Trelis Github

  • @AlexBerg1
    @AlexBerg15 ай бұрын

    Economies of scale for a company like OpenAI, which specializes in efficiently serving a single, general purpose model, is so cheap. It makes serving a tuned model so much more expensive, it is unfortunate.

  • @AlexBerg1

    @AlexBerg1

    5 ай бұрын

    Actually, I see Anyscale has seemingly affordable fine-tuning on a Llama 2 base model. "Fine Tuning is billed at a fixed cost of $5 per run and $/million-tokens. For example, a fine tuning job of Llama-2-13b-chat-hf with 10M tokens would cost $5 + $2x10 = $25. Querying the fine-tuned models is billed on a $/million-tokens basis."

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    Yup, even though this vid is about serving custom models, I felt I had to say it (twice), that in most cases it's best to use openai/gemini. That said: a) If your business has a lot of customers, then you also benefit from economies of scale on serving (once you're doing parallel requests you can start getting towards good economics). b) if you have a high value use case for a custom model, then it's not a problem paying $0.1/hour or $0.5/hr for your own GPU.

  • @jonathancat
    @jonathancat5 ай бұрын

    Hey what about google cloud?

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    Initially I tried out google cloud and AWS and Azure and it was really hard and expensive to get GPU access. That could be wrong now. Have you experience? What's the hourly price of an A6000 on demand?

  • @fkxfkx
    @fkxfkx5 ай бұрын

    Why can't any of these u tubers afford a decent haircut?

  • @jonathanvandenberg3571

    @jonathanvandenberg3571

    5 ай бұрын

    ??

  • @MMABeijing

    @MMABeijing

    5 ай бұрын

    Because none of the people talking about ai care about ur opinion. Who would have guessed

  • @anglikai9517

    @anglikai9517

    5 ай бұрын

    If they find haircut is important, they won't be intelligent enough for AI stuff.

  • @TrelisResearch

    @TrelisResearch

    5 ай бұрын

    @@anglikai9517 🤣

  • @MrQuicast
    @MrQuicast2 ай бұрын

    great video

Келесі