vLLM - Turbo Charge your LLM Inference

Ғылым және технология

vLLM - Turbo Charge your LLM Inference
Blog post: vllm.ai/
Github: github.com/vllm-project/vllm
Docs: vllm.readthedocs.io/en/latest...
Colab: drp.li/5ugU2
My Links:
Twitter - / sam_witteveen
Linkedin - / samwitteveen
Github:
github.com/samwit/langchain-t...
github.com/samwit/llm-tutorials
Timestamps:
00:00 Intro
01:17 vLLM Blog
04:27 vLLM Github
05:40 Code Time

Пікірлер: 57

@rajivmehtapy11 ай бұрын
As always, you are one of the few people who hit this topic on KZread.
@sp4yke11 ай бұрын
Thanks Sam for this video. It would be interesting to dedicate a video comparing OpenAI emulation such as LocalAI, Oobabooga, and vLLM
@user-ew8ld1cy4d11 ай бұрын
Sam I love you videos but this one takes the cake. Thank you!!!
@g-program-it11 ай бұрын
Finally AI models that don't take a year to give a response. Cheers for sharing this Sam.
@clray123
11 ай бұрын
Uhh... you already get instant response from GGML/llama.cpp (apart from the model weights loading time, but this is not anything that PagedAttention improves on). The deal with PagedAttention is that it prevents the KV cache from wasting memory by not overallocating the entire context length at once, but rather doing it in chunks as the sequence keeps growing (and possibly sharing chunks among different inference beams or users). This allows the same model to serve more users (throughput) - of course, provided that they generate sequences shorter than the context length. It should not affect the response time for any individual user (if anything, it makes it worse because of the overhead of mapping virtual to physical memory blocks). So if it improves HF in that respect, it just demonstrates that either HF's implementation of KV cache sucks or Sam is comparing non-KV-cached generation with KV-cached one.
@MultiSunix11 ай бұрын
Talked to its core developer, they don't have plan to support quantized model yet, hence you really need powerfull GPU(s) to run it.
@mayorc11 ай бұрын
A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.
@Rems76611 ай бұрын
Thanks mate, I am going to try to add that to langchain so it can integrate seamlessly to my product
@henkhbit574811 ай бұрын
Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍
@mayorc11 ай бұрын
This looks very useful.
@guanjwcn11 ай бұрын
This is very interesting. Thanks for sharing this. It would be nicer, I guess, if langchain can do the same.
@wilfredomartel778111 ай бұрын
Finally we can achieve fast responses.
@MeanGeneHacks11 ай бұрын
Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit? Edit: Noticed no falcon support..
@samwitteveenai
11 ай бұрын
AFAIK I don't think they are supporting Bitsandbytes etc.which doesn't surprise me as for what they are mainly using it for is comparing models which is not ideal a low resolution quantization.
@akiempaul11178 ай бұрын
Great Great Great
@jasonwong893411 ай бұрын
I’m surprised the bottleneck was due to memory inefficiency in the attention mechanism and not volume of matrix multiplications
@MrRadziu8611 ай бұрын
@Sam Witteveen do you know by any chance how it compares to latest other technics of speeding up model. I don't remember exactly, but sometimes it is just a settings, a parameter nobody didn't use, until somebody share it, as well other technics. AS well, if you would know, which are better suitable for falcon, llama, etc.?
@samwitteveenai
11 ай бұрын
for many of the options I have looked at this compares well for the models that it works with etc.
@NickAubert11 ай бұрын
It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.
@harlycorner
11 ай бұрын
Thanks for this video. Although I should mention that at least on my RTX 3090 TI the GPTQ 13B models with exllama loader are absolutely flying. Faster than ChatGPT-3.5 turbo. But I'll definitely take a look
@MariuszWoloszyn11 ай бұрын
vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use huggingface/text-generation-inference which can load falcon 40b in 8-bit flawlessly!
@samwitteveenai
11 ай бұрын
Yes none of these are flawless. I might make about video about hosting with HF Text-gen-inference as well.
@shishirsinha634410 ай бұрын
Where is the model comparison made in terms of execution time wrt HuggingFace?
@Gerald-xg3rq3 ай бұрын
can you run this on aws sagemaker too? does it also work with llama 2 model with 7 and 13 billion parameters?
@clray12311 ай бұрын
It should be noted that for whatever reason it does not work with CUDA 12.x (yet).
@samwitteveenai
11 ай бұрын
My guess is just because their setup is not using that yet and it will come. I actually just checked my Colab and that seems to be running in Cuda 12.0 but maybe that is not optimal.
@asmac001nolastname611 ай бұрын
Can this package be used with quantized 4-bit models? I don't see any support for them in the docs..
@samwitteveenai
11 ай бұрын
no I don't think it will work with that.
@frazuppi489711 ай бұрын
not sure since they compared with hf transformer and hf doesn't use flash attention to my knowledge so they are quite slow by default
@samwitteveenai
11 ай бұрын
They compared to TGI also which does have Flash-Attention huggingface.co/text-generation-inference and it is still quite a bit faster
@io902111 ай бұрын
I'm wondering how vLLM compares against conversion to onnx (e.g. with optimum) in terms of speed and ease of use. I'm struggling a bit with onnx 😅
@s0ckpupp3t
11 ай бұрын
does ONNX have a streaming ability? I can't see any mention of websocket or http/2
@io9021
11 ай бұрын
@@s0ckpupp3t Not that I know. I converted bloom-560 to ONNX and got similar latency as with vLLM. I guess with ONNX one could optimise it a bit further, but I'm impressed by vLLM because it's much easier to use.
@user-hf3fu2xt2j11 ай бұрын
Now I wonder if this is possible to launch on CPU Some models will work tolerable.
@andrewdang340111 ай бұрын
Is this possible with langchain and a gui
@navneetkrc11 ай бұрын
So can I use this with models downloaded from huggingface directly?? Context: In my office setup I can only use models weight downloaded separately.
@samwitteveenai
11 ай бұрын
Yes totally the colab I show was downloading a model from HuggingFace. Not all of the LLMs are compatible, but most the popular ones are.
@navneetkrc
11 ай бұрын
@@samwitteveenai In my office setup, these models cannot be downloaded (blocked), so I download them separately and use their weights using huggingface pipelines as LLM for Langchain and other use cases. Will try a similar approach for vLLM hoping that this approach works
@samwitteveenai
11 ай бұрын
@@navneetkrc Yes totally, will just need to load locally etc.
@navneetkrc
11 ай бұрын
@@samwitteveenai thanks a lot for the quick replies. You are the best 🤗
@TheNaive7 ай бұрын
could you show how to add any hugging face model to vllm? Also above colab aint working.
@chenqu77311 ай бұрын
I am wondering if it works with huggingface 8bit and 4bit quantization
@samwitteveenai
11 ай бұрын
If you are talking with bitsandbytes I don't hink it does just yet.
@ColinKealty11 ай бұрын
Is this usable as a model in langchain for tool use?
@samwitteveenai
11 ай бұрын
You can use it as an LLM in Langchain. Whether it will work with tools will depend on which model you serve etc.
@ColinKealty
11 ай бұрын
@@samwitteveenai I assume it doesn't support quants? Don't see any mention
@keemixvico97511 ай бұрын
it don't work.. daim it. I don't want to use Docker to make this work, so I'm stuck
@samwitteveenai
11 ай бұрын
what model you trying to get to work? It also doesn't support quantized models if you are trying for that.
@saraili3971
11 ай бұрын
@@samwitteveenai Hi Sam, thanks for the sharing(life-saver for newbies). Wonder your recommendation for quantized models ?
@napent11 ай бұрын
What about data privacy?
@samwitteveenai
11 ай бұрын
You are running it on a machine you control. What are the privacy issues ?
@napent
11 ай бұрын
@@samwitteveenai i though that it's cloud based 🎩
@seinaimut11 ай бұрын
can use with GGML model?
@samwitteveenai
11 ай бұрын
no so far these are for full resolution models only
@sherryhp1011 ай бұрын
still very slow
@eyemazed8 ай бұрын
It doesnt work on windows folks, trash
@eljefea2802
8 ай бұрын
they have a docker image. That's what im using right now