Vision Transformers explained
Ғылым және технология
Vision Transformer, also known as ViT, is a deep learning model that applies the Transformer architecture, originally developed for natural language processing, to computer vision tasks. It has gained attention for its ability to achieve competitive performance on image classification and other vision tasks, even without relying on convolutional neural networks (CNNs).
Transformers: • Transformers for begin...
**************************************************************************************
For queries: You can comment in comment section or you can mail me at aarohisingla1987@gmail.com
**************************************************************************************
The key idea behind the Vision Transformer is to divide an input image into smaller patches and treat them as tokens, similar to how words are treated in natural language processing. Each patch is then linearly projected and embedded with position information. These patch embeddings, along with position embeddings, are fed into a stack of Transformer encoder layers.
The Vision Transformer has shown promising results, demonstrating competitive performance on image classification tasks, object detection, and semantic segmentation.
#computervision #transformers
Пікірлер: 86
Aarohi, I am watching you for 3 years now, and each time I understand the subject. You're literally the best
@CodeWithAarohi
29 күн бұрын
Thank you so much for your incredibly kind words! It means a lot to me😊
Thanks again for this very well explained tuto.
@CodeWithAarohi
8 ай бұрын
Glad it was helpful!
Your explanation is one for the best I’ve heard about ViT, thank you very much
@CodeWithAarohi
8 ай бұрын
Glad it was helpful!
@naziahossain3950
8 ай бұрын
i agree
Thanks for this tutorials its simple and deep
@CodeWithAarohi
9 ай бұрын
You're welcome 😊
Great....crystal clear the concepts greatly explained 😊
@CodeWithAarohi
3 ай бұрын
Glad it helped!
the video was awesome . And can i know the transformer model of all the 6 encoders and 6 decoders , as I have confusion in the input architecture of decoders . Thank you mam
beautifully explained!
@CodeWithAarohi
4 ай бұрын
Glad it was helpful!
What's about the extra class? and i think that only the extra class is used for the classification. Please could you explain this point?
Thanks very much the videos are awesome and genuine.
@CodeWithAarohi
3 ай бұрын
Glad you like them!
Very nice Presentation
@CodeWithAarohi
8 ай бұрын
Thanks a lot
Hello Ma’am Your AI and Data Science content is consistently impressive! Thanks for making complex concepts so accessible. Keep up the great work! 🚀 #ArtificialIntelligence #DataScience #ImpressiveContent 👏👍
@CodeWithAarohi
7 ай бұрын
My pleasure 😊
Excellent explanation
@CodeWithAarohi
9 ай бұрын
Glad it was helpful!
very good explanation. Thank you
@CodeWithAarohi
3 ай бұрын
You are welcome!
Your tutorials are always the best, thank you very much. I hope you would create tutorials on Segformer soon.
@CodeWithAarohi
3 ай бұрын
Thank you, I will
The content is amazing! Very informative, short, and to the point, which is great for beginners. Thank you for these amazing videos 😍I have only one small feedback for your future videos. The audio quality is a little bit bad and noisy. You might consider checking your microphone.
@CodeWithAarohi
4 ай бұрын
Thank you for the feedback. I will take care of noise.
Can you please suggest how to use vision transformer for Text classification? Please
Thanks for making the video
@CodeWithAarohi
11 ай бұрын
My pleasure!
Thank You, can you explain difference between CNN and ViT side by side.
Thanks for this vedio.this tutorial is very clear and explaining and we had learning to how to split the pattern
@CodeWithAarohi
2 ай бұрын
You are welcome 😊
Thank you soo much mam for this amazing video
@CodeWithAarohi
9 ай бұрын
Most welcome 😊
I came to this video to learn how to do positional encoding for 2D images -- the precise math. When you come to that portion, you simply reference your intro video, re Transformers for linear text (and in which even the linear positional encoding isn't really explained).
@CodeWithAarohi
9 ай бұрын
Sorry for inconvenience. I will try to cover the math's in separate video.
@ervinperetz5973
9 ай бұрын
@@CodeWithAarohi Thanks for responding. Your videos are terrific otherwise. Thanks for sharing your work and insights.
@MP-sx6tg
5 ай бұрын
‘The precise math for encoding’ Bro it’s deep learning and you talk about precise math 😂 Literally those people encoded 1,2,…256 for each patch.
Very well explained. Thanks alot
@CodeWithAarohi
21 күн бұрын
Glad it was helpful!
Nicely Explained..!
@CodeWithAarohi
11 ай бұрын
Thank you
very nice video.
@CodeWithAarohi
8 ай бұрын
Many many thanks
you are a genius ma Shaa Allah, thanks a lot
@CodeWithAarohi
8 ай бұрын
You are most welcome
How dimension is reduced for each 1D vector when each pixel of 1D vector is multiplied by weights? Can u clear it?
@QubitBrain
8 ай бұрын
Matrix multiplication! Let's assume an image is split into 3x3 pixel and each pixel has 16x16 vector embedding which is flattened to 256x1 (means 256 rows and 1 column). Because we have 3x3 pixel size of image it means we have total 9 pixels. Hence if we combine the vector embedding of all the pixels (means if each pixel embedding is 256x1, then for 9 pixels it will become 256x9 i.e 256 rows and 9 columns. Now we have to pass this through linear layer. Linear layer let's say has 5 neurons. so shape for each neuron will be 256 x 1 (means 256 rows and 1 column) and for 5 neurons it will become 256x5 (menas 256 rows and 5 columns). Now we have to do matrix Multiplication of Input with Linear layer, but wait, we cannot multiply the matrix because shape of input is 256x9 and shape of linear is 256x5. In order to multiply the matrices, the columns of Matrix A must be equal to the number of rows of Matrix B. So we will transpose the input matrix of shape 256x9 to 9x256. Now, Let's take this as Matrix A of 9x256 and Matrix B of size 256x5. Because column of Matrix A is same as row of Matrix B, hence, dot product is possible which will result in new matrix of size 9x5. We can see that the original matrix of patch was of size 9x256 which is reduced to 9x5. So we will get the 3 matrices of size 9x5 each for Key, Query and Value. Now based on attention model we can see that we have to do the matrix multiplication of Query and Key and to do so we again have to do the transpose of Matrix because both Key and Query are of same shape (Query Matrix - 9x5 , Key Matrix - 9x5). So if we transpose Key Matrix it will become 5x9 and then the matrix multiplication will be possible between these two matrices (9x5 and 5x9). The dot product output of these matrices will be a matrix of size (9x9) and this output matrix is called as Attention Filter. Then after training we have the final updated values of this attention filter which we have to scale between 0 and 1 using softmax activation function. This scaled attention matrix (9x9) is then multiplied with Value matrix (9x5) which will give the filtered value of Matrix (9x5). Hence based on attention matrix we get the important feature of an image. This is the process of single attention head to extract feature. We use multi-head attention to extract various important features of an image. Each head focuses on different combinations of features.
Awesome explanation mam
@CodeWithAarohi
7 ай бұрын
Glad you liked it
Good one ma’am
@CodeWithAarohi
7 ай бұрын
Thanks a lot
Please can you explain or give a series about the Vanilla Vision transformers from the paper to the to the programming side🙏🙏
@CodeWithAarohi
3 ай бұрын
The terms "Vanilla Vision Transformers" and "Vision Transformers" are often used interchangeably, and both refer to the same fundamental concept which is applying the Transformer architecture directly to image data for computer vision tasks.
Thanks a lot! it helps me :3
@CodeWithAarohi
6 ай бұрын
I'm glad!
can you make a video on SegFormer? thanks in advance for the amazing explanation!
@CodeWithAarohi
4 ай бұрын
I will try!
how to know the feature importance which are generated from ViT ? which features causes classification
@CodeWithAarohi
11 ай бұрын
While ViT doesn't inherently provide feature importance scores like some other models, you can analyze the importance of different features in the classification process by examining the attention maps generated by the model. Attention maps in ViT represent the importance of each image patch in relation to the final prediction. Higher attention values indicate greater importance. By visualizing these attention maps, you can gain insights into which image regions contribute most to the classification decision.
@RAZZKIRAN
11 ай бұрын
@@CodeWithAarohi please make video on it madam, for one classification task , dog vs cat classification example
thank you so much Aarohi, please,could you explain SWIN transformer too with its with coding ?
@CodeWithAarohi
11 ай бұрын
Sure, I have started a playlist on transformers and will try to cover every important topic which comes under transformers
@fayezalhussein7115
11 ай бұрын
@@CodeWithAarohi thank you again, waiting for it
excellent explanation. I wanna make a sugesstion. Maybe you should buy a microphone. There are lots of noise in background.
@CodeWithAarohi
9 ай бұрын
Thank you, I will
Can these be apply in bank cheque processing
@CodeWithAarohi
3 ай бұрын
Yes, vision transformers (ViTs) can be applied to bank cheque processing tasks.
Thank you for making this video. Please make a python code of ViT, if possible. Thank you.
@CodeWithAarohi
11 ай бұрын
Working on it!
@umarjibrilmohd8660
8 ай бұрын
Please, do it on how to train ViT on semantic segmentation tasks.
Code with Aarohi is Best KZread channel for Artificial Intelligence #CodeWithAarohi
please create a ViT on landmark detection
@CodeWithAarohi
8 ай бұрын
I will try
@sanjoetv5748
8 ай бұрын
@@CodeWithAarohi thank you so much you are the best
Why only 16×16
@CodeWithAarohi
5 ай бұрын
Original paper used this. You can try with different numbers also.
Can you provide code
@CodeWithAarohi
4 ай бұрын
In this video, I have explained Vision transformer theory. You can check next video and Code link is mentioned in description section of that video.
nice, can you share slide with me?
will you do vision transformers with tensorflow?
@CodeWithAarohi
8 ай бұрын
I will try.
@alis5893
8 ай бұрын
@@CodeWithAarohi Thank you. Your method of teaching is amazing. But i am never comfortable with torch. Tensorflow is so natural for deep learning. I look forward to this .