Implement and Train ViT From Scratch for Image Recognition - PyTorch

Ғылым және технология

We're going to implement ViT (Vision Transformer) and train our implementation on the MNIST dataset to classify images! Video where I explain the ViT paper and GitHub below ↓
Want to support the channel? Hit that like button and subscribe!
ViT (Vision Transformer) - An Image Is Worth 16x16 Words (Paper Explained)
kzread.info/dash/bejne/aqScr5NvmNexkrg.html
GitHub Link of the Code
github.com/uygarkurt/ViT-PyTorch
Notebook
github.com/uygarkurt/ViT-PyTorch/blob/main/vit-implementation.ipynb
ViT (Vision Transformer) is introduced in the paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
arxiv.org/abs/2010.11929
What should I implement next? Let me know in the comments!
00:00:00 Introduction
00:00:09 Paper Overview
00:02:41 Imports and Hyperparameter Definitions
00:11:09 Patch Embedding Implementation
00:19:36 ViT Implementation
00:29:00 Dataset Preparation
00:51:16 Train Loop
01:09:27 Prediction Loop
01:12:05 Classifying Our Own Images

Пікірлер: 48

@uygarkurtai8 ай бұрын
In order to use this code for images with multiple channels: change self.cls_token = nn.Parameter(torch.randn(size=(1, in_channels, embed_dim)), requires_grad=True) to self.cls_token = nn.Parameter(torch.randn(size=(1, 1, embed_dim)), requires_grad=True). Thanks @Yingjie-Li for pointing it out.
@learntestenglish9 ай бұрын
Thank you so much, a video that difficult to find on the internet again 👏👏
@uygarkurtai
9 ай бұрын
thank you :)
@goktankurnaz9 ай бұрын
Another invaluable guide!!
@uygarkurtai
9 ай бұрын
Thank you :)
@BOankur5 ай бұрын
Thank you for the tutorial! Great work.
@uygarkurtai
5 ай бұрын
Thank you!
@spml_css9 ай бұрын
Very useful tutorial. Thank you.
@uygarkurtai
9 ай бұрын
Thank you :)
@arnabdutta52816 ай бұрын
great video. Really loved it
@uygarkurtai
6 ай бұрын
Thank you :)
@Yingjie-Li9 ай бұрын
Thank you so much
@uygarkurtai
8 ай бұрын
thank you :)
@prashlovessamosa8 ай бұрын
Thanks for sharing
@uygarkurtai
8 ай бұрын
thank you :)
@h2o11h2o7 ай бұрын
well done. Thank u
@uygarkurtai
7 ай бұрын
Thank you :)
@emrahe4682 ай бұрын
Uygar hocam, selamlar bir çok diğer tutorialda olduğu gibi, görseli küçük parçalara ayırmak için nn.Conv2d kullanmışız (@14:15). ancak benim bildiğim, eğer Conv2d varsa, random initialize edilen weightler de olacaktır. dolayısyla, evet resmi küçük parçalara ayırıyoruz ama aynı zamanda convolutionların weightlerini de resimlere uygulamış oluyoruz. istersen nn.Conv2d/patcher'dan sonra oluşan küçük resimleri, orjinal resim parçalarıyla karşılaştır. farklı olduklarını göreceksin. belki de ben hata yapıyorumdur. iyi çalışmalar ve başarılar
@uygarkurtai
2 ай бұрын
Merhabalar, kendileri de implementationlarinda patchlere ayirmak icin conv layerlar kullaniyorlar. Nasil patchele ayirdiklarini ilk basta soylemiyorlar. Conv kullanmalarinin 2 sebebi var. 1: Conv layer kullanmak performansi arttiriyor. arxiv.org/abs/2106.14881 buna bakilabilir. 2: conv layer kullandigimizda gpu'da paralel kullanabiliyoruz. Bu da her seyin daha hizli olmasini sagliyor.
@FernandoPC252 ай бұрын
Hey Uygar, Thanks a lot for the tutorial, you're like my coding sensei! I was wondering about something while coding the ViT. Why do you define hidden_dim if you're not using it later on? Or maybe you are using it and I just haven't noticed? Appreciate your help!
@uygarkurtai
2 ай бұрын
Thank you! Seems like I don't use it yes. I don't remember exactly why I put it in the first place. Probably make a deeper MLP or something. In this case you can skip it.
@abrarluvrabit4 ай бұрын
hi, thank you so much for this video i really need that for understanding the training of ViT , can you please make a video for Multiscale Vision transformer MviT and MviTv2 for training them from scratch. i really appreciate all your efforts for ML DL and CV society.
@uygarkurtai
4 ай бұрын
Thank you! I noted it down and will look into it.
@federikky986 ай бұрын
Hello, very good explanation, i'm wondering how can i visualize the attention map of the transformer?
@uygarkurtai
6 ай бұрын
hey thank you. There's a tool like this: github.com/jessevig/bertviz You can play around with it.
@MrMadmaggot3 ай бұрын
Thats cool man, your coding skills and how smooth you are coding that is even scary, maybe AI is not for me xdddd. Anyways my question is here: You are using only one layer , what If i want to use multiple layers? 22:44 after encoder_layer should I add another encoder_layer_2 with different parameters?
@uygarkurtai
3 ай бұрын
Hey, thank you for your kind words :) You can do that. The thing is you have to experiment these kind of stuff. In AI, let's say there's this architecture that works. Why is it like that? Because it works. Adding another encoder will work? Probably. Will it improve performance? I don't know. You got to try.
@PheaKhayMSumo6 ай бұрын
Hi, I am a student and I was wondering if I could use your code as my basis for developing my thesis which is centered in sorting ripe and unripe strawberries?
@uygarkurtai
6 ай бұрын
Hey sure. It's open source. Feel free to use it.
@Movies_Daily_Ай бұрын
Can u tell me which version of python , torch , sckit learn , and other used
@uygarkurtai
Ай бұрын
Hey I didn't use a specific version. You can just use the latest one.
@arturovalle59902 ай бұрын
could you implement a DiT ? difussion transformer?
@uygarkurtai
2 ай бұрын
Hey that's a great idea! I added it to my list.
@Yingjie-Li8 ай бұрын
Hi, I get some advice for this code. I deal with the images which in_channels = 3. But your work can not fit the situation that in_channels = 3. I do some fix based your code. self.position_embedding = nn.Parameter(torch.randn(size=(1, num_patches + in_channels, embed_dim)), requires_grad=True) After that, the code can work in the in_channels = 3 images. HOPE YOUR REPLY! -China-Beijing
@uygarkurtai
8 ай бұрын
Hey that's a great catch! Thank you for pointing it out :) However you may not want to change position embedding dimensions. Because that "+1" stands for the extra CLS token. Try the following -> self.cls_token = nn.Parameter(torch.randn(size=(1, 1, embed_dim)), requires_grad=True) Let me know if it works!
@Yingjie-Li
8 ай бұрын
You are right! Thanks ! I know that the "+1" means extra CLS token. And I change the cls_token which size=(1,1, embed_dim). It work well! @@uygarkurtai
@uygarkurtai
8 ай бұрын
@@Yingjie-Li Good to know that :)
@FernandoPC25
5 ай бұрын
@@uygarkurtai If my understanding is correct, wouldn't it be preferable to hard-code the in_channels of the self.cls_token to always be 1, regardless of the actual value of in_channels? This is, "self.cls_token = nn.Parameter(torch.randn(size=(1, 1, embed_dim)), requires_grad=True)" for any case, as cls_token will always have 1 dimension. Thank you very much you both! ^^
@uygarkurtai
5 ай бұрын
@@FernandoPC25 hard-coding is fine too actually if all your images has the same number of channels. I just made it more generalizable like this.
@muhammadatique42935 ай бұрын
Can you please change the theme in white ? Its hard to see in black theme
@uygarkurtai
5 ай бұрын
I wasn't aware of that. It'll improve in the future!
@ABHISHEKRAJ-wx4vq3 ай бұрын
import torch import maxvit # from .maxvit import MaxViT, max_vit_tiny_224, max_vit_small_224, max_vit_base_224, max_vit_large_224 # Tiny model network: maxvit.MaxViT = maxvit.max_vit_tiny_224(num_classes=1000) input = torch.rand(1, 3, 224, 224) output = network(input) my purpose is to do give an input as an image (1,3,224,224) and generate output as its description for that. how should i do that, what should i add more to this code?
@uygarkurtai
3 ай бұрын
Hey, I have no idea about maxvit. If your input channels with models input channels and sizes match, there should be no problem. I suggest you check those out.
@sidbhattnoida2 ай бұрын
Please implement CLIP if you can.....
@uygarkurtai
2 ай бұрын
Noted.
@staffankonstholm350626 күн бұрын
Shouldn't x be first in x = torch.cat([x, cls_token], dim=1) ?
@uygarkurtai
22 күн бұрын
Hey, I'm not sure if it makes a difference. You can do like that too.
@staffankonstholm3506
7 күн бұрын
@@uygarkurtai I take it back, the cls_token has to be first.