Swin Transformer - Paper Explained
Brief explanation of swin transformer paper.
Paper link: arxiv.org/abs/2103.14030
Table of Content:
00:00 Intro
00:13 Patch Embedding
02:56 Swin transformer block
03:57 W-MSA
05:14 SW-MSA
08:56 Masked MSA implementation
14:58 Patch Merging
16:12 stages
18:28 Image classification result
19:12 Relative position bias
Icon made by Freepik from flaticon.com
Пікірлер: 24
By far one of the best + complete, SWIN transformer explanations on the entire Internet.
@soroushmehraban
29 күн бұрын
Thanks!
@FinalProject-rw1yf
27 күн бұрын
@@soroushmehraban Hi sir, could you also explain the FasterViT and GCViT paper...
Thorough! Very comprehensible, thank you.
Thanks for the good explanation!
Really informative, helped me lot to understand many concepts here. Keep up the good work
@soroushmehraban
Жыл бұрын
Thanks! I’ll try my best.
Very well explained, thank you Soroush.
@soroushmehraban
10 ай бұрын
Glad you liked it
Great video! Thanks
@soroushmehraban
Жыл бұрын
Thanks for the feedback 🙂
I enjoy very much
That's The Most Illustrative Video Of Swin-Transformers on The Internet!
@soroushmehraban
11 ай бұрын
Glad you enjoyed it 😃
@omarabubakr6408
11 ай бұрын
@@soroushmehraban yes abs thx so much, although I Have a Quick Question More Related to PyTorch actually which is in min 12:49 in line 239 in the code 1st what does -1 here means and what does it do exactly with the tensor 2nd from where we get [4,16] the 4 here from where we got it cuz its not mentioned in the reshaping. Thanks in advance.
perfect description.
@soroushmehraban
11 ай бұрын
Glad it was helpful 🙂
Amazing video !
@soroushmehraban
8 ай бұрын
Thanks!
You deserve more likes and subscribers
@soroushmehraban
6 ай бұрын
Thanks man🙂 appreciated
Why channel increasees c to 4c after merging
@soroushmehraban
6 ай бұрын
Because we downsample the width by 2 and height by 2. That means we have 4x downsampling in spatial resolution that we give it to the channel dimension. It's just a simple tensor reshaping. For example 10x10x2 = 200. After merging it's 5x5x8 = 200.
2:43 C would be equal to the number of filters not the number of kernels. In the torch.nn.conv2d operation being performed we have 3 kernels for each input channel and then C number of filters. Each filter having 3 kernels not C number of kernels.