Mastering Google's VLM PaliGemma: Tips And Tricks For Success and Fine Tuning

Ғылым және технология

Colab (code) Inference : drp.li/GVIjV
Colab (code) Fine Tuning : drp.li/I0w8d
HF Blog: huggingface.co/blog/paligemma
HF Spaces: huggingface.co/spaces/big-vis...
Models : huggingface.co/collections/go...
🕵️ Interested in building LLM Agents? Fill out the form below
Building LLM Agents Form: drp.li/dIMes
👨‍💻Github:
github.com/samwit/langchain-t... (updated)
git hub.com/samwit/llm-tutorials
⏱️Time Stamps:
00:00 Intro
00:50 What is PaliGemma?
01:13 PaLI-3 Paper
01:26 SigLIP Paper
01:36 Hugging Face Blog: PaliGemma
03:19 PaliGemma: Three Pre-trained Checkpoints
05:11 PaliGemma different Sizes and Releases
05:53 PaliGemma Hugging Face Spaces Demo
09:39 ScreenAI Datasets
10:44 Code Time
10:55 Using PaliGemma with Transformers
14:54 PaliGemma Finetuning

Пікірлер: 19

  • @paulmiller591
    @paulmiller59121 күн бұрын

    This is an exciting sub-field. We have a lot of clients making observations so keen to try this. Happy travels Sam.

  • @amandamate9117
    @amandamate911721 күн бұрын

    excellent video, cant wait for more visual model examples especially with ScreenAI for agents who browse the web

  • @user-en4ek6xt6w
    @user-en4ek6xt6w21 күн бұрын

    Thank you for your video

  • @ricardocosta9336
    @ricardocosta933621 күн бұрын

    Ty my dude

  • @SonGoku-pc7jl
    @SonGoku-pc7jl19 күн бұрын

    thanks, we will see phi 3 with vision for compare :)

  • @miguelalba2106
    @miguelalba210615 күн бұрын

    Do you know how good the dataset should be in terms of completeness for fine tuning? I have lots of images-texts of clothes, but in some there are more details than others, so I guess during training the model will be confused. Ex. There are thousands of images of dresses with only the color, and thousands of images with color + other details

  • @unclecode
    @unclecode21 күн бұрын

    Fascinating. I wonder if there is any example for fine-tuning for segmentation involved. If so, the way we collate the data should be different. I have one question about the timeline at 15 minutes and 30 seconds. I noticed a part of the code that splits the data set into train and test. But after split it says `train_ds = split_ds["test"]` shouldn't be "train"?. I think that might be a mistake. What do you think? Very interesting content, especially if the model has the general knowledge to get into a game like your McDonald's example. This definitely has great applications in medical and education fields as well. Thank you for the content.

  • @samwitteveenai

    @samwitteveenai

    20 күн бұрын

    just look at the output from the model when you do segmentation and copy that. Yes you will need to to update the collate function. The "test" part is correct because it is just setting it to train on a very small number of examples, in a real training yes use the 'train' with is 95% of the data as opposed to 5% on the test.

  • @unclecode

    @unclecode

    20 күн бұрын

    @@samwitteveenai Oh ok, that was for just video demo, thx for clarification 👍

  • @unclecode

    @unclecode

    17 күн бұрын

    ​@@samwitteveenai Thx, I get it now, the "test" is just for the demo in this colab. Although It would've been clearer if they used a subset of like 100 rows from the train split. I experimented a bit, the model is super friendly to fine-tuning. Whatever they did, it made this model really easy to tune. We're in a time where "tune-friendly" actually makes sense.

  • @SenderyLutson
    @SenderyLutson21 күн бұрын

    I think the the Aria dataset from Meta is also open

  • @samwitteveenai

    @samwitteveenai

    20 күн бұрын

    interesting dataset. Didn't know about this. Thanks

  • @AngusLou
    @AngusLou20 күн бұрын

    Is it possible to make the whole thing local?

  • @willjohnston8216
    @willjohnston821621 күн бұрын

    Do you know if they are going to release a model for real time video sentiment analysis? I thought there was a demo of that by either Google or OpenAI?

  • @samwitteveenai

    @samwitteveenai

    20 күн бұрын

    not sure but you can do some of this already with Gemini, just not realtime (publicly at least)

  • @FirstArtChannel
    @FirstArtChannel21 күн бұрын

    Inference speed and size of the model still seems reasonable longer/larger than a Multimodal LLM such as LLaVA, or am I wrong?

  • @samwitteveenai

    @samwitteveenai

    20 күн бұрын

    honestly its a while since I played with LLaVA and mostly I have just used it on Ollama, so not sure how it compares. Phi3-Vision is also worth checking out. I may make a video on that as well

  • @SenderyLutson
    @SenderyLutson21 күн бұрын

    How many VRAM do this model consume on while running? And the Q4 version?

  • @samwitteveenai

    @samwitteveenai

    20 күн бұрын

    the inference was running on a T4 so it is manageable. The FT was on an A100.

Келесі