Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Ғылым және технология

In this video I teach how to code a Transformer model from scratch using PyTorch. I highly recommend watching my previous video to understand the underlying concepts, but I will also rehearse them in this video again while coding. All of the code is mine, except for the attention visualization function to plot the chart, which I have found online at the Harvard university's website.
Paper: Attention is all you need - arxiv.org/abs/1706.03762
The full code is available on GitHub: github.com/hkproj/pytorch-tra...
It also includes a Colab Notebook so you can train the model directly on Colab.
Chapters
00:00:00 - Introduction
00:01:20 - Input Embeddings
00:04:56 - Positional Encodings
00:13:30 - Layer Normalization
00:18:12 - Feed Forward
00:21:43 - Multi-Head Attention
00:42:41 - Residual Connection
00:44:50 - Encoder
00:51:52 - Decoder
00:59:20 - Linear Layer
01:01:25 - Transformer
01:17:00 - Task overview
01:18:42 - Tokenizer
01:31:35 - Dataset
01:55:25 - Training loop
02:20:05 - Validation loop
02:41:30 - Attention visualization

Пікірлер: 285

@comedyman48968 ай бұрын
personally, I find that seeing someone actually code something from scratch is the best way to get a basic understanding
@zhilinwang6303
4 ай бұрын
indeed
@user-jb2ex2ux9i
4 ай бұрын
indeed
@CM-mo7mv
3 ай бұрын
i don't need to see someone typing... but you might also enjoy watching the gras grow or paint dry
@FireFly969
Ай бұрын
Yeah, and you see how these technologies works. It's insane, that in the end it looks easy that you can do something that companies of millions and billions of dollars do. In a small way but the same idea at the end.
@maskedvillainai
Ай бұрын
Yeah kinda ironic how that works. The simplest stuff required the most complex explanations
@raviparihar32985 күн бұрын
best video I have ever seen on whole youtube eon transformer model. Thank you so much sir!
@umarjamilai Жыл бұрын
The full code is available on GitHub: github.com/hkproj/pytorch-transformer It also includes a Colab Notebook so you can train the model directly on Colab. Of course nobody reinvents the wheel, so I have watched many resources about the transformer to learn how to code it. All of the code is written by me from zero except for the code to visualize the attention, which I have taken from the Harvard NLP group article about the Transformer. I highly recommend all of you to do the same: watch my video and try to code your own version of the Transformer... that's the best way to learn it. Another suggestion I can give is to download my git repo, run it on your computer while debugging the training and inference line by line, while trying to guess the tensor size at each step. This will make sure you understand all the operations. Plus, if some operation was not clear to you, you can just watch the variables in real time to understand the shapes involved. Have a wonderful day!
@AiEdgar
11 ай бұрын
The best video ever
@odyssey0167
8 ай бұрын
Can you provide with the pretrained models?
@wilfredomartel7781
2 ай бұрын
🎉is this Bert architecture?
@terryliu36354 күн бұрын
I learnt a lot from following the steps out of this video and create a transformer myself step by step!! Thank you!!
@yangrichard78745 ай бұрын
Greeting from China! I am PhD student focused on AI study. Your video really helped me a lot. Thank you so much and hope you enjoy your life in China.
@umarjamilai
5 ай бұрын
谢谢你！我们在领英联系吧
@ArslanmZahid6 ай бұрын
I have browsed KZread for the perfect set of videos on transformer, but your set of videos (the video explanation you did on the transformer architecture) and this one is by far the best !! Take a bow brother, you have really contributed to the viewers in amount you cant even imagine. Really appreciate this !!!
@aiden30855 ай бұрын
Thank you Umar for our extraordinary excellent work! Best transformer tutorial ever I have seen!
@maxmustermann10668 ай бұрын
This video is incredible, never understood it like this before. I will watch your next videos for sure, thank you so much!
@abdulkarimasif6457 Жыл бұрын
Dear Umar, your video is full of knowledge; thanks for sharing.
@zhengwang14026 ай бұрын
This feels really fantastic when looking someone write a program from bottom up
@kozer19867 ай бұрын
I'm not sure if it is because I have study this content 1000000 times or not, but is the first time that I understood the code, and feel confident about it. Thanks!
@shresthsomya74193 ай бұрын
Thanks a lot for such a detailed video. Your videos on transformer are best.
@shakewingo32167 ай бұрын
Thanks for making it so easy to understand. I definitely learn a lot and gain much more confidence from this!
@MuhammadArshad7 ай бұрын
Thank God, it's not one of those 'ML in 5 lines of Python code' or 'learn AI in 5 minutes'. Thank you. I can not imagine how much time you must have spent on making this tutorial. thank you so much. I have watched it three times already and wrote the code while watching the second time (with a lot of typos :D).
@VishnuVardhan-sx6bq5 ай бұрын
This is such a great work, I don't really know how to thank you but this is an amazing explanation of an advanced topic such as transformer.
@manishsharma22116 ай бұрын
WOW WOW WOW, though it was a bit tough for me to understand it, I was able to understand around 80 % of the code, beautiful. Thank you soo much
@lyte697 ай бұрын
Hey there! I enjoyed watching that video, you did a wonderful job explaining everything, and I found it super easy to follow along. Overall, it was a really great experience!
@salmagamal56764 ай бұрын
I can't possibly thank you enough for this incredibly informative video
@abdullahahsan38597 ай бұрын
Keep doing what you are doing. I really appreciate you taking out so much time to spread such knowledge for free. Been studying transformers for a long time but never have I understood it so well. The theoretical explanation in the other video combined with this practical implementation, just splendid. Will be going through your other tutorials as well. I know how much time taking it is to produce such high level content and all I can really say is that I really am grateful for what you are doing and hope that you continue doing it. Wish you a great day!
@umarjamilai
7 ай бұрын
Thank you for your kind words. I wish you a wonderful day and success for your journey in deep learning!
@user-db8nb5wz2z7 ай бұрын
Really great explanation to understand Transformer, many thanks to you.
@michaelscheinfeild97689 ай бұрын
Im enjoying clear explanation of The Transformer Coding !
@sagarpadhiyar3666Ай бұрын
Best video I came across for transformer from scratch.
@mikehoops7 ай бұрын
Just to repeat what everyone else is saying here - many thanks for an amazing explanation! Looking forward to more of your videos.
@balajip50307 ай бұрын
Thanks Bro. With your explanation, I am able to build the transformer model for my application. You explained so awesome. Please do what you are doing.
@codevacaphe37635 күн бұрын
Hi, I just happen to see your video. It's really amazing, your channel is so good with valuable information. Hope, you keep this up because I really love your contents.
@goldentime11Ай бұрын
Thanks for your detailed tutorial. Learned a lot!
@saziedhassan39769 ай бұрын
Thank you so much for taking the time to code and explain the transformer model in such detail. You are amazing and please do a series on how transformers can be used for time series anomaly detection and forecasting!
@amiralioghli8622
8 ай бұрын
My question and request is same as you. if you found any tutorial please share with me.
@prajolshrestha96867 ай бұрын
I appreciate you for this explanation. Great video!
@mohamednabil37411 ай бұрын
Thanks Umar for this comprehensive tutorial, after watching many videos I would say, this is AWESOME! It would be really nice if you can provide us with more tutorials on Transformers especially training them for longer sequences. :)
@umarjamilai
10 ай бұрын
Hi mohamednabil374, stay tuned for my next video on the LongNet, a new transformer architecture that can scale up to 1 billion tokens.
@ansonlau70402 ай бұрын
Big thankyou for the video, makes transformer so easy to learn(also the explanation video)👍👍
@amiralioghli86228 ай бұрын
Thank you so much for taking the time to code and explain the transformer model in such detail. You are amazing and, if possible please do a series on how transformers can be used for time series anomaly detection and forecasting. it is extremly necessary on yotube for somone! Thanks in advance.
@physicswithbilalasmatullah2 ай бұрын
Hi Umar. I am a first year student at MIT who wants to do AI startups. Your explanation and comments during coding were really helpful. After spending about 10 hours on the video, I walk away with great learnings and great inspiration. Thank you so much, you are an amazing teacher!
@umarjamilai
2 ай бұрын
Best of luck with your studies and thank you for your support!
@qikangdeng1487Ай бұрын
What a WONDERFUL example of transformer! I am Chinese and I am doing my PhD program in Korea. My research is also about AI. This video helps me a lot. Thank you! BTW, your Chinese is very good!😁😁
@ghabcdef4 ай бұрын
Thanks a ton for making this video and all your other videos. Incredibly useful.
@umarjamilai
4 ай бұрын
Thanks for your support!
@user-ru4nb8tk6f7 ай бұрын
You are a great professional, thanks a ton for this
@jeremyregamey4956 ай бұрын
I love your videos. Thank you for sharing your knowledge and i cant wait to learn more.
@si0n4ra10 ай бұрын
Umar, thank you for the amazing example and clear explanation of all your steps and actions.
@umarjamilai
10 ай бұрын
Thank you for watching my video and your kind words! Subscribe for more videos coming soon!
@si0n4ra
10 ай бұрын
@@umarjamilai , mission completed 😎. Already subscribed. All the best, Umar
@JohnSmith-he5xg7 ай бұрын
Loving this video (only 13 minutes in), really like you using type hints, commenting, descriptive variable names, etc. Way better coding practices than most of the ML code I've looked at. At 13:00, for the 2nd arg of the array indexing, you could just do ":" and it would be identical.
@tonyt1343
5 ай бұрын
Thank you for this comment! I'm coding along with this video and I wasn't sure if my understanding was correct. I'm glad someone else was thinking the same thing. Just to be clear, I am VERY THANKFUL for this video and am in no way complaining. I just wanted to make sure I understand because I want to fully internalize this information.
@nhutminh15525 ай бұрын
Thank you admin. Your video is great. It helps me understand. Thank you very much.
@Patrick-wn6uj2 ай бұрын
Hi Umar thank you for all the work you are doing, please consider making a video like this on vision transformers
@godswillanosike8962 ай бұрын
Great explanation! Thanks very much
@dapostop738423 күн бұрын
Wow super usefull! Coding really helps me understand the process better than visuals.
@skirazai75915 ай бұрын
Great video, you are insanely talented btw.
@gunnvant8 ай бұрын
This was really good. I understood multihead attention better with the code explanation.
@linyang95365 ай бұрын
这是我见过最详细的从零创建Transformer模型的视频，从代码实现到数据处理，再到可视化，up主真是嚼碎磨细了讲，感谢！
@decarteao
4 ай бұрын
Nn entendi nada! Mas botei meu like.
@astrolillo
2 ай бұрын
@@decarteaoO cara da China e muito engracado con o video
@Mostafa-cv8jc6 ай бұрын
Very good video. Tysm for making this, you are making a difference
@MrSupron008 ай бұрын
This is excellent! Thank you for putting this together. I do have one point of confusion with how the final multihead attention concatenation takes place. I believe the concatenation takes place on line 110 where V' = (V1, V2,.. Vh) (sequenc_length, h*dk) This is intended to be multiplied by matrix W0 (h*dk, dmodel) to give something of shape (sequenc_length, dmodel ) as is required. However, here you implement a linear layer operation which takes the concat V' (sequence_length, d_model) and is fed into a linear layer constructed so that we do the following: W*V'+b where the dimension of W and b are chose to satisfy the output dimension. This is different from multiplying directly with a predefined trainable matrix of size W0. Now, I can see how these are nearly the same thing and in practice it may not matter, but it would be helpful to point out these tricks of the trade so folks like myself don't get bogged down with these subtleties. Thanks
@toxicbisht43444 ай бұрын
Amazing explanation Thank you for this
@jihyunkim43157 ай бұрын
perfect video!! Thank you so much. I always wonder the detail code and its explanation and now I almost understand all of it. thanks:) you are the best for me!
@umarjamilai
7 ай бұрын
You're welcome!
@keflatspiral46335 ай бұрын
what to say.. just WOW! thank you so much !!
@oborderies6 ай бұрын
Sincere congratulations for this fine and very useful tutorial ! Much appreciated 👏🏻
@texwiller75772 ай бұрын
Dottore...sei un grande!
@divyanshbansal23213 ай бұрын
Thank you mate. You are a godsend!
@SyntharaPrime6 ай бұрын
Great Job. Amazing. Thanks a lot. I really appreciate you. It is so much effort.
@FailingProject1853 ай бұрын
You are one of the coolest dude in this area. It'd be helpful if you provide a roadmap to reach your expertise. I'd really love to learn from you but i can't understand. Roadmap will help so many of your subscribers.
@angelinakoval83606 ай бұрын
Dear Umar, thank you so so much for the video! I don't have much experience in deep learning, but your explanations are so clear and detailed I understood almost everything 😄. It wil be a great help for me at my work. Wish you all the best! ❤
@umarjamilai
6 ай бұрын
Thank you for your kind words, @angelinakoval8360!
@forresthu62047 ай бұрын
At 22:39, it describes the essentials of self-attentions computation in very clear and easy to understand way.
@aspboss19739 ай бұрын
Its really awesome video with clear explanations. And flow of code is very easy to understand. One question, how to implement this transformer architecture for Question-Answer based model ? (Q/A on very specific topic lets say a manual of instrument..) Thank you ! so much for this video !!!
@user-qo7vr3ml4c20 күн бұрын
Thank you very much, this is very useful.
@user-eu3ok8dc8b8 ай бұрын
one of the best videos thanks a lot for the video.
@LeoDaLionEdits10 күн бұрын
thank you so much for these videos
@coc29128 ай бұрын
Thanks for your video and code.
@cicerochen3136 ай бұрын
Awesome! Highly appreciate. 超級讚！非常的感謝。
@pawanmoon8 ай бұрын
Great work!!
@kailazarov1078 ай бұрын
Really great video - learned a lot. Your inference notebook is using the dataset batching, but how can you build inference with user typed sentences?
@wellhellothere17859 ай бұрын
How do you suggest to apply this model to time series forcasting what do you think should be changed, so far I believe that there is no source language and target language in the forcasting there is only time based. and also for the error or loss function I should use MSE in this case. Is there anything else I might be missing ?
@phanindraparashar8930 Жыл бұрын
It is really amazing video. I tried understanding the code of it from various other youtube channel; but was always getting confused. Thanks a lot :) . Can you make a series on BERT & GPT aswell; where you build these models and train on custom data?
@umarjamilai
Жыл бұрын
Hi Phanindra! I'll definetely continue making more videos. It takes a lot of time and patience to make just one video, not considering the preparation time to study the model, write the code and test it. Please share the channel and subscribe, that's the biggest motivation to continue providing high quality content to you all.
@rubelahmed5458
6 ай бұрын
A coding example for BERT would be great!@@umarjamilai
@user-ul2mw6fu2e5 ай бұрын
Wow Your explanation amazing
@sypen16 ай бұрын
This is amazing thank you 🙏
@omidsa83234 ай бұрын
Great Job!
@sypen16 ай бұрын
Mate you are a beast!
@rafa_br345 күн бұрын
Great video! I'm wondering, is there any reason to save the positional encoding vector? I don't see why you would need to save it since it seems to always be the same value considering the init parameters don't change.
@Hdjandbkwk9 ай бұрын
Just want to say thank you!! This is easily one of my favorite video on KZread! I have watched a few videos on transformers but none explained it as clear as you, at first I was scared by the length of the video but you managed to have my attention for the full 3 hours! Following your instructions I am now able to train my very first transformer! Btw, I am using the tokenizer the way you are but looking at the tokenizer file it looks like my tokenizer didn’t split the sentences into words and it is using the whole sentence as token. Do you have any idea why? I am using mac if that matters.
@umarjamilai
9 ай бұрын
Hi! Thanks for your kind words! Make sure your PreTokenizer is the "Whitespace" one and that the Tokenizer is the "WordLevel" tokenizer. As a last resort, you can clone the repository from my GitHub and compare my code with yours. Have a wonderful rest of the day!
@Hdjandbkwk
9 ай бұрын
I have PreTokenizer set as whitespace and using WordLevel tokenizer and trainer but it will still encode the sentence as a whole. I did a direct swap to use BPE tokenizer and that is correctly encoding the sentences, maybe there is bug in WordLevel tokenizer for macOS. Another question that I have is what determines the max context size for LLMs? Is it the d_model size?@@umarjamilai
@vigenisayan23433 ай бұрын
it was very useful to watch. Question: What books or learning sources you would suggest to learn pytorch deeply. Thanks
@JohnSmith-he5xg7 ай бұрын
OMG. And you also note Matrix shapes in comments! Beautiful. I actually know the shapes without having to trace some variable backwards.
@michaelscheinfeild97689 ай бұрын
i enjoyed the video ! now i can transform the world !
@user-wr4yl7tx3w6 ай бұрын
the code is really well written. very easy and nicely organized.
@user-sp5pf5du3m2 ай бұрын
You are a genius
@user-gj2cl2rr9x3 ай бұрын
in the 13:13/2:59:23, when we build the PositionalEncoding function, this line x = x + (self.pe[:,:x.shape[1],:]).requires_grad_(False), the x.shape[1] looks like not be used in the transformer model, because when we build the dataset.py function, we pad all the sentences into the same length, and then we load the (batch, seq_len, input_embedding_dim) into the PositionalEncoding function, where all x.shape[1] in the batch is the seq_len, instead of varying by their original sentence length.
@FireFly969Ай бұрын
Thank you umar jamil for this wonderfull content, to be honest i find it so hard to keep undertanding each part and what happens in each line of code for a beginner in pytorch. I wonder what i need to know before starting one of your videos. I think i need to read the paper multiple times till understand it?
@ageofkz3 ай бұрын
At 29:14, the part on multihead attention, we feed each Q,V, K multiply by Wq, Wv, Wk then split them into n heads then dot product and concat them again. But should we not split them first, then apply Wq_h where Wq_h is the weight matrix for the hth query matrix, same for V and K? Because it seems like we just split them, apply attention, then concat?
@sudo_codex10 ай бұрын
Thanks Umar for the amazing video, but I'm still confused, that how can we build and apply transformers from scratch for multi-label classification.
@rafabaranowski5133 ай бұрын
At 1:42:03 you are using SOS special token from source language tokenizer in sentence with target language. Tokenizers are trained on different languages so is it correct to use special tokens between them? SOS token from source language tokenizer won't have different idx compared to SOS from target language tokenizer?
@user-zx7un1yq4d10 ай бұрын
Excellent video. Well done! Can I request a yaml file in the repo to setup the environment (with the version numbers)?
@therealvortex10010 ай бұрын
Hello, at 51:16 why do we add normalization in the end of encoder?
@umarjamilai
10 ай бұрын
Hi! About the layer normalization, there are different opinions on where to add it in the model. I suggest you read this paper (arxiv.org/abs/2002.04745) which discusses this issue. Have a nice day!
@julianmejiez800610 ай бұрын
HI thank you for the videos. I got a question, while tryning process im using (en-fr) dataset but im having an error with the paddings. im having raise ValueError("Sentence is too long").How can i avoid this while training, any thoughts on this?.
@umarjamilai
10 ай бұрын
If I remember correctly, that Exception is raised when the sentence length is too short to accomodate for the sentences present in the data set. Try to increase the sentence length of the model.
@babaka185021 күн бұрын
for determining the max len of tgt sentence, I believe you should point to tokenizer_tgt rather than tokenizer_src. tgt_ids = tokenizer_tgt.encode(item['transaltion'][config['ang_tgt']]).ids
@md.shahabulalam9484Ай бұрын
Umar thank you so much.........
@guoweishieh7755 ай бұрын
thanks for this video. Super cool. I have one question though. What determines if a module should have dropout or not? InputEmbedding has no dropout but things as simple as ResidualConnection has dropout? LayerNorm has no dropout. I don't know what the pattern it is there.
@zhuxindedongchang4229Ай бұрын
Hello Umar, really impressive work on Transformer. I have followed your step on this experiment. One small thing I am not sure is when you compute the loss you use the nn.CrossEntropyLoss() method, this method have already apply the softmax itself. As their document said:"The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general)". But in your project method in the built Transformer model, it has applied softmax. I wonder if we should only output the logits without this softmax to fit the nn.CrossEntropyLoss() method? Thank you anyway.
@ngocchienchu10910 ай бұрын
thank you so much
@mohsinansari3584 Жыл бұрын
Just finished watching. Thanks so much for the detailed video. I plan to spend this weekend on coding this model. How long did it take to train on your hardware?
@umarjamilai
Жыл бұрын
Hi Mohsin! Good job! It took around 3 hours to train 30 epochs on my computer. You can train even for 20 epochs to see good results. Have a wonderful day!
@NaofumiShinomiya
Жыл бұрын
@@umarjamilai what is your hardware? Just started studying deep learning few days ago and i didnt know transformers could take this long to train
@umarjamilai
Жыл бұрын
@@NaofumiShinomiya Training time depends on the architecture of the network, on your hardware and the amount of data you use, plus other factors like learning rate, learning scheduler, optimizer, etc. So many conditions to factor in.
@marlonfermat81157 ай бұрын
Great video by the way, Thank You! around 6:30, why does positional encoding need d_model? wouldn't seq_length suffice?
@umarjamilai
7 ай бұрын
Because each vector of the positional encoding has d_model dimensions. Otherwise you wouldn't be able to add the embedding and the position vectors together, they need to have the same dimensions.
@kareemsaid886329 күн бұрын
i tried to modify the code bit for code generation task and i am stuck at 1.2 loss what do you think is the problem litterily the same code i just changed abit in the get_ds function and get_sequances function and i treated the src tokenizer and tgt tokenizer as the same tokenizer what is wrong with my changes in your implementation ?
@ghabcdef4 ай бұрын
In the matrices w_k, w_q, w_v and w_o in the MultiHeadAttention module, why did you not set the bias=False? Don't you need that for this to work properly?
@albert43928 ай бұрын
This is an excellent video, your explanation is so clear and the live coding helps understanding! Can you give us tips to debug such an huge model? Because it is really hard to make sure the model works well. My tips on debugging is to print out the shape of the tensor in each step, but this only make sure the shape is correct, there may be some logical error I may miss out. Thank you!
@umarjamilai
8 ай бұрын
Hi! I'd love to give a golden rule for debugging models, but unfortunately, it depends highly on the architecture/loss/data itself. One thing that you can do is, before training the model on a big dataset, it is recommended to train it on a very small dataset to make sure everything is working and the model should overfit on the small dataset. For example, if instead of training on many books, you train a LLM on a single book, hopefully it should be able to write sentences from that book, given a prompt. The second most important thing is to validate the model as the training is proceeding to verify that the quality is improving over time. Last but not least, use metrics to decide if the model is going in the right direction and make experiments on hyper parameters to verify assumptions, do not just make assumptions without validating them. When you have a model with billions of parameters, it is difficult to predict patterns, so every assumption must be verified experimentally. Have a nice day!
@tonyt13435 ай бұрын
Thanks!
@omarbouaziz23035 ай бұрын
I'm working on Speech-to-Text conversion using Transformers, this was very helpful, but how can I change the code to be suitable for my task?
@txxie5 ай бұрын
This video is great! But can you explain how you convert the formula of positional embeddings into log form?
@cocohand7813 ай бұрын
i wander what is difference between using lambda function and using direct function in line 163,164,165 thankyou