Niels Rogge
Жыл бұрын
48,871
1

How a Transformer works at inference vs training time

Ғылым және технология

I made this video to illustrate the difference between how a Transformer is used at inference time (i.e. when generating text) vs. how a Transformer is trained.
Disclaimer: this video assumes that you are familiar with the basics of deep learning, and that you've used HuggingFace Transformers at least once. If that's not the case, I highly recommend this course: cs231n.stanford.edu/ which will teach you the basics of deep learning. To learn HuggingFace, I recommend our free course: huggingface.co/course.
The video goes in detail explaining the difference between input_ids, decoder_input_ids and labels:
- the input_ids are the inputs to the encoder
- the decoder_input_ids are the inputs to the decoder
- the labels are the targets for the decoder.
Resources:
- Transformer paper: arxiv.org/abs/1706.03762
- Jay Allamar's The Illustrated Transformer blog post: jalammar.github.io/illustrate...
- HuggingFace Transformers: github.com/huggingface/transf...
- Transformers-Tutorials, a repository containing several demos for Transformer-based models: github.com/NielsRogge/Transfo....

Пікірлер: 106

@vsucc31769 күн бұрын
I didn't find a lot of resources that include both drawings of the process, as well as code examples / snippets that demonstrate the drawings practically. Thank you, this helps me a lot :)
@TempusWarrior9 ай бұрын
i rarely comment on YT videos, but I wanted to say thanks. This video doesn't have all the marketing BS and provides the type of understanding I was looking for
@waynelau3256
8 ай бұрын
Gosh, imagine the day videos were ranked based on content and not fake marketing tactics 😂
@NielsRogge
7 ай бұрын
Thanks for the kind words!
@sohelshaikhh5 ай бұрын
Beautifully explained! I want to shamelessly request you for a series where you go one step deeper to explain this beautiful architecture.
@kevinsummerian6 ай бұрын
For someone comming from a software enginering background this was hands down the most useful explanation of the transformer architecture.
@zobinhuang39559 ай бұрын
The most clear explaination of transformer model I have seen. Thanks Niels!
@marcoxs36 Жыл бұрын
Thank you Niels, this was really helpful to me for understanding this complex topic. These aspects of the model are not normally covered in most resources I've seen.
@RamDhiwakarSeetharaman Жыл бұрын
Unbelievably great and intuitive explanation. Something for us to learn. Thanks a lot, Niels.
@ashishnegi9663 Жыл бұрын
You are a great teacher Niels! Would really appreciate if you add more such videos on hot ML/DL topics.
@jagadeeshm9526 Жыл бұрын
Amazing video... exactly covered what most other resources on this topic is missing.. keep this great work going Niels
@jasonzhang53789 ай бұрын
This is one of the cleanest explaination of transformer inference and training on the web. Great Video!
@henrik-ts4 күн бұрын
Great video, very comprehensible explanation of a complex subject.
@farrugiamarc02 ай бұрын
This is the best explanation I have met so far on this particular topic (inference vs training). I hope that more videos like this are released in the future. Well done!
@user-yk4hv8tz7m9 ай бұрын
Inference: 1. Tokens are generated one at a time conditioned on input+prev generation2 2. Language modelling head converts the hidden states to logits 3. Greedy search or beam search is possible Training: 1. Input ids: input prompt, labels: output 2. Decoder input ids are copied from labels, prepended with 3. Decoder generates text all at once but uses causal attention mask to mask out future tokens from decoder input ids 4. -100 is given to padded position in labels to indicate cross entropy function to not compute loss there
@shivamsengupta1215 ай бұрын
This is the best video on transformers. Everybody explains about the structure and attention mechanism but you choose to explain the training and inference phase. Thank you so much for this video. You are awesome 😎. Love from India ❤
@omgwenxx3 ай бұрын
I am using the huggingface library and this video finally gave me a clear understanding of the wordings used and the transformer architecture flow. Thank you!
@chenqu773 Жыл бұрын
Very intuitive, concise explanation to a very important topic. Thank you very much !
@thomasvrancken895 Жыл бұрын
Great video! I like the pace and easy explanation on things that are not necessarily straightforward. And clean excalidraw skills 😉 Hope to see more soon
@abhikhubby10 ай бұрын
Best video on AI ive seen so far. Thank you so much for making & sharing! Only parts that might need a bit more explanation are logits area + vector embedding creation (but the later already has lots of content)
@amitsingha16376 ай бұрын
Thanks Man. We need more this type of Video.
@sanjaybhatikar5 ай бұрын
Thanks so much, you hit upon the points that are confusing for a first-time user of LLMs. Thank you!
@forecenterforcustomermanag7715Ай бұрын
Excellent overview of how the encoder-decoder work together. Thanks.
@VitalContribution Жыл бұрын
I watched the whole video and I understand now so much more. Thank you very much for this great video! Please keep it up!
@mytr8986 Жыл бұрын
Excellent and simple video to understand the working of the transformer thanks a lot!
@lucasbandeira53924 ай бұрын
Niels, thank you very much for this video! It was really helpful! The concept behind Transformers is pretty complicated, but your explanation definitely helped me to understand it.
@lovekesh883 ай бұрын
Thanks Niels for the video. I look forward to more content on the topic.
@thorty242 ай бұрын
This is one of the greatest explanations I know. Thanks!
@zagorot Жыл бұрын
Great video! I have to say thank you. This video is just what I need, because I have learned some basic ideas about word2vec, LSTM, RNN and something like that, but, I cannot understand how the Transformer works and what are the input and output, your video make me all clear about them. Yes, someone drop comments said this video is "pointless" or something, no, I cannot agree that, as different audiences have different background, so it is really hard to make something happy for everyone! Someone lack some basic ideas like word2vec(why use input_ids) then they would not be able to understand this video, and instead that someone are superior good at Transformer/Diffusion, then they won't need to watch this video! So how can I say that? This video taught me how are the encoder and decoder working on every single step, very detailed, really appreciated!
@samilyalciner Жыл бұрын
Thanks Niels. Such a great explanation!
@HerrBundesweit Жыл бұрын
Very informative. Thanks Niels!
@mathlife54959 ай бұрын
Very nice lecture. It clarified so many concepts for me.
@trilovio5 ай бұрын
This explanation is gold! Thank you so much! 💯
@fabianaltendorfer1110 ай бұрын
Wonderful, thank you Niels!
@minhajulhoque2113 Жыл бұрын
Great explanation video, really informative!
@user-kd2st5vc5tАй бұрын
谢谢你，讲得很好，之前只是大概了解，现在是更清楚其中的细节了。非常感谢，爱来自瓷器
@kmsravindra Жыл бұрын
Thanks Niels. This is pretty useful
@sambitmukherjee1713 Жыл бұрын
Very clearly explained!
@nageswarsahoo11329 ай бұрын
amazing videos . Clear lot of doubt . Thanks Niels .
@phucdoitoanable10 ай бұрын
Nice explanation! Thank you!
@sebastianconrady7696 Жыл бұрын
Awesome! Great explanation
@imatrixx5728 ай бұрын
Thanks you very much! Now I can say that I completely understand the Transformer!
@PravasMohanty8 ай бұрын
Great tutorial!! It will be great if you make a video personalize GPT , how to keep trained data and load for Q&N any recommendation.
@muhammadramismajeedrajput56322 ай бұрын
Loved your explanation
@nizamphoenix8 ай бұрын
One word, Perfect!
@junaidbutt3000 Жыл бұрын
Very clearly explained Neils. I have a question about the decoder inputs. At training time, we added padding to the source and target sequences to make them a particular length. But at inference time at t=1, we only feed the start of sequence token to the decoder. Do we not require padding to make the sequence lengths consistent as well? it seems at inference time, we’re feeding different sequence lengths to the decoder. Is this true or is there implicit padding being applied here as well?
@user-sr8zf9ms3o6 ай бұрын
Great vid, thanks!
@mohammedal-hitawi4667 Жыл бұрын
Very nice work , can you please make modification on decoder part in TrOCR model like replacing language model by gpt-2 ?
@user-cv2fh3sh9x5 ай бұрын
Very nice explanation. I request you to create video on how LLM can be derived based on, Prompt engineering., Fine tuning and generating New LLM with practical approach.❤❤❤❤❤❤❤
@achyutanandasahoo47759 ай бұрын
thank you. great explanation.
@omerali33202 ай бұрын
I learned a lot thank you.
@sitrakaforler8696 Жыл бұрын
really great vidéo ! Merci beaucoup !
@dhirajkumarsahu9992 ай бұрын
Thank you so Much!! Subscribed
@aspboss197310 ай бұрын
Nice explanation ! I have these doubts - -During training, do we learn the Query, Value and Key matrix ? , in short do we learn the final embeddings of encoder through back propagation ? -During training, we supply encoders final embeddings to decoder, one at a time ? (Suppose we have 5 final encoders embeddings, then for first time step do we supply only first out of 5 embeddings to decoder?) - How this architecture is used in QA model ? (I am confuse !!!)
@FalguniDasShuvo10 ай бұрын
Awesome!🎉
@pulkitsingh21497 ай бұрын
Hi Niels, great explanation on this. I just couldn't get my head around one point. At each time step we are producing n number of vectors (same as decoder input). Is it guaranteed that the previous predicted tokens vector won't change? What if the decoded token vector changes as we include more tokens in decoder input?
@giofou7115 ай бұрын
@NielsRogge thanks for the super clear and helpful video! It's really one of the most clean and concise presentations I've watched on this topic! 🙌 I had a question though: At point 24:09, you are saying that *during inference* in the *last hidden state of the decoder* we get a hidden vector *for each of the decoder input ids*. In your example after 6 time steps, we have 6 decoder tokens: , salut, ..., mignon, which means the last hidden state (at time step t = 6) would produce a 6 x 768 matrix. Is that true though? I thought the last hidden state of the decoder produces the embedding of the *next token*. In other words, a 1 x 768 vector, that is later passed through a `nn.Linear(768, 50000)` layer to give us the next decoder input id. In other words, the 1 x 768 vector is passed to `nn.Linear(768, 50000)` and gives us a 1 x 50000 logit vector. But if what you say it's true, then when a 6 x 768 matrix is created at time step t = 6, then the end result after the last linear head would be 6 x 50000 logit matrix. No?
@lucasbandeira5392Ай бұрын
Thank you very much for the explanation, Niels. It was excellent. I have just one question regarding 'conditioning the decoder' during inference: How exactly does it work? Does it operate in the same way it does during training, i.e., the encoder hidden states are projected into queries, keys, and values, and then the dot products between the decoder and encoder hidden states are computed to generate the new hidden states? It seems like a lot of calculations for me, and in this way, the text generation process would be very slow, wouldn't it?
@atmismahir7 ай бұрын
great content thank you very much for the detailed explanation :)
@botfactory1510 Жыл бұрын
Thanks Niels
@syerwinD11 ай бұрын
Thank you
@yo-yoyo23037 ай бұрын
This is sooooooo good
@mbrochh82 Жыл бұрын
great video. the only thing that literally all videos on transformers don't mention is: how and when happens some kind of backpropagation? I understand how it works for a simple neural network with a hidden layer and we use gradient descent to update all the weights... but in the transformer architecture I find it hard to visualize which numbers get updated after we calculated the loss.
@jeffrey5602
Жыл бұрын
yeah, conceptually at first maybe but I would argue the transformations themselves are not more complicated than a normal NN for classification, coz its really doing just that, predicting the most probable token from the dictionary. At least its way easier than backprop for RNNs, LSTMs etc. The transformers book from Huggingface has a great explanation for attention which is really all you need to know to demystify the whole transformer architecture. And attention is really just adding a few linear projections and doing a dot product.
@leiyang217611 ай бұрын
That's a great video, I just have one question related to the video. In translation, there could be multiple valid translations. In this example the english output could be 'Hello, my dog is cute' or 'Hi, my dog is a cute dog' etc. In the real translation product, would there be use of metric like BLEU score, and how to use this score to evaluate and improve the product quality ?
@NaveenRock1 Жыл бұрын
Great work. Thanks a lot for this video. I had a small doubt, during the transformer inference you mentioned we stop generating the sequence when we reach the token. But during the training, in the decoder_input_ids, I noticed you didn't add the token to the sentence, did I miss something here ?
@NielsRogge
Жыл бұрын
Hi, during training, the token is indeed added to the labels (and in turn, to the decoder input ids), should have mentioned that!
@NaveenRock1
Жыл бұрын
@@NielsRogge Got it. Thanks. I believe will be added before the padding tokens ? " sentence tokens + padding tokens to reach the fixed sequence length. Am I correct ?
@NielsRogge
Жыл бұрын
@@NaveenRock1 yes correct!
@NaveenRock1
Жыл бұрын
@@NielsRogge Awesome. Thank you. :)
@norman9174 Жыл бұрын
Sir can you please provide that ExcaliDraw notes . Thanks for this amazing explanation .
@BB-uy4bb Жыл бұрын
In the description around 45:00 isn't there an end-token missing in the labels which the model should predict after the last label(231)?
@zbynekba9 ай бұрын
Hi Niels, Here's a corrected version: I greatly appreciate that you've taken the time to create a fantastic summary of training and inference times from the user's perspective. Q1: during training, do you also involve the end-of-sentence token generation into the loss function? You haven’ mentioned it though IMHO a good model must detect the end of translation. Q2: why do you need to introduce padding? Everything works perfectly with arbitrary length of input and output sentence which is a true beauty. Why is it needed for batch training? Thank you.
@nouamaneelgueddari7518
8 ай бұрын
he said in the video that padding is introduced because the training is done in batches. The elements of the batches will have a very different lengths. If we don't use padding, we will have to dynamically allocate memory for every element in the batch. This is not very efficient for the computation.
@zbynekba
8 ай бұрын
@@nouamaneelgueddari7518 Makes sense to me. Thanks.
@Wlodixpro4 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 🧭 *Overview of Transformer Model Functionality* - Provides an overview of the Transformer model. - Discusses the distinction between using a Transformer during training versus inference. - Highlights the importance of understanding Transformer usage for tasks like text generation. 02:05 🤖 *Tokenization Process* - Describes the tokenization process where input text is converted into tokens. - Explains the mapping of tokens to integer indices using vocabulary. - Discusses the role of input IDs in feeding data to the model. 06:06 📚 *Vocabulary in Transformer Models* - Explores the concept of vocabulary in Transformer models. - Illustrates how tokens are mapped to integer indices in the vocabulary. - Emphasizes the importance of vocabulary in processing text inputs for Transformer models. 07:44 🧠 *Transformer Encoder Functionality* - Details the process of the Transformer encoder, converting tokens into embedding vectors. - Explains how the encoder generates hidden representations of input tokens. - Highlights the role of embedding vectors in representing input sequences. 10:45 🛠️ *Transformer Decoder Operation at Inference* - Demonstrates how the Transformer decoder operates during inference. - Discusses the generation process of new text using the decoder. - Describes the utilization of cached embedding vectors for generating subsequent tokens. 23:04 🔄 *Iterative Generation Process* - Illustrates the iterative process of token generation by the Transformer decoder. - Explains how the decoder predicts subsequent tokens based on previous predictions. - Discusses the termination condition of the generation process upon predicting the end-of-sequence token. 25:33 🧠 *Illustrating Inference Process with Transformers* - At inference time, text generation with Transformer models occurs in a loop, generating one token at a time. - Transformer models like GPT use a generation loop, allowing for flexibility in text generation. - Different decoding strategies, such as greedy decoding and beam search, impact the text generation process. 30:59 🛠️ *Explaining Decoding Strategies for Transformers* - Greedy decoding is a basic method where the token with the highest probability is chosen at each step. - Beam search is a more advanced decoding strategy that considers multiple potential sequences simultaneously. - Various decoding strategies, including beam search, are available in the `generate` method of Transformer libraries like Hugging Face's Transformers. 31:13 🎓 *Training Process of Transformer Models* - During training, the model learns to generate text by minimizing a loss function based on input sequences and target labels. - Teacher forcing is used during training, where the model is provided with ground truth tokens at each step. - The training process involves tokenizing input sequences, encoding them, and using labeled sequences to compute loss via cross-entropy calculations. 48:58 🤯 *Understanding Causal Attention Masking in Transformers* - Causal attention masking prevents the model from "cheating" by looking into the future during training. - At training time, the model predicts subsequent tokens based on the ground truth sequence, with the help of the causal attention mask. - This mechanism ensures that the model generates text one step at a time during training, similar to the inference process. Made with HARPA AI
@VaibhavPatil-rx7pc Жыл бұрын
NICE!!!!
@kaustuvray50667 ай бұрын
31:02 Training
@shaxy66893 ай бұрын
It was so helpful, could you please share the drawing notes. Thank you!
@braunagn7 ай бұрын
Question on the tensor shapes of the Encoder that go into the Decoder during inference: If the Encoder output is of shape (1,6,768), during cross attention, how can this be combined with the Decoder's input which is only one token in length [e.g. Shape (1,1,768)]?
@bhujithmadav14812 ай бұрын
Superb video. Just a doubt. @11:46 you mention that decoder would use the embeddings from encoder and the start of sequence token to generate the first output token. By embeddings did you mean the key value vectors from the last encoder stage? Also if encoder is being used to encode the input question then why are GPT, llama, etc., called decoder only models? Thanks
@NielsRogge
2 ай бұрын
Yes the embeddings from the encoder (after the last layer) are used as keys and values in the cross-attention operations of the decoder. The decoder inputs serve as queries. Decoder-only models like ChatGPT and Llama don't have an encoder. They directly feed the text to the decoder, and only use self-attention (with a causal mask to prevent future leakage).
@bhujithmadav1481
2 ай бұрын
@@NielsRogge Thanks for the quick reply. But my confusion is that when we ask a question to GPT or llama like "what is transformer?", as per all the sources and including this video, they mention that decoders start with the SOS or EOS token to generate the output. But from where does the decoder learn the context? Even in this video you use the encoder to encode the input question and then pass the encoded embeddings to decoder right?
@sporrow Жыл бұрын
are attention vectors used during inference?
@robmarks680011 ай бұрын
Can you elaborate on why seemingly all new models are decoder-only? And are trained with the sole objective of next token prediction. Does the enc-dec architecture of T5 have any advantages? And is there any reason to train in different ways that T5 do?
@NielsRogge
11 ай бұрын
Hi, great question! Encoder-decoder architectures are typically good at tasks where the goal is to predict some output given a structured input, like machine translation or text-to-SQL. One first encodes the structured input, and then uses that as condition to the decoder using cross-attention. However, nowadays you can actually perfectly do these tasks with decoder-only models as well, like ChatGPT or LLaMa. The main disadvantage of encoder-decoders is that you need to recompute the keys/values at every time step, which is why all companies are using decoder-only at the moment (much faster at inference time)
@schwajj
11 ай бұрын
Thanks so much for the video, and answering questions! Can you explain (or provide a pointer to a paper) how the key/values can be cached to avoid recomputation in a decoder-only transformer? Edit: I figured it out while re-watching the training part of your video, so you needn’t answer unless you think others would benefit (I wouldn’t be able to explain very well, I fear)
@robmarks6800
11 ай бұрын
Don’t you have to recalculate in the decoder-only architecture aswell? Or is this where the non-default KV-cache comes in?
@37-2ensoiree7 Жыл бұрын
Missing softmax during training, mandatory to calculate cross entropy loss. An unrelated question : Am I understanding right that there is a thus a maximum length for all these sentences, like 512 tokens ? Isn't that an issue ?
@navdeep8697
Жыл бұрын
i think cross entropy loss in pytorch (atleast!) apply the softmax internally. yes token limit is a sort of limitation because of how encoder and decoder internally works but it can be resolved while making the dataset pipeline for training and inference.
@DmitryPesegov7 ай бұрын
What is the shape of the target tensor in training phase? (batch_size, maximum_supported_sequence_len_by_model, 50000) ? ( PLEASE answer anybody )
@arjunwankhede37063 ай бұрын
can you share excalildraw explanation link here
@andygrouwstra1384 Жыл бұрын
Hi Niels, you describe a lot of steps that are taken, but don't really explain why they are taken. It becomes a kind of magic formula. For example, you have a sentence and break it up in tokens. OK. But hang on, why break it up in tokens rather than in words? What's different? Then you look up the tokens in a dictionary to replace them by numbers. Is that because it is easier to deal with numbers than with words? Then you do "something" and each number turns into a vector of 768 numbers. What is it that you do there, and why? What is the information in the other 767 numbers and where does that information come from? What do you want it for? It would be nice if you could give the context, both the big picture and the details.
@NielsRogge
Жыл бұрын
Yes good point! I indeed assume in the video that you take the architecture of the Transformer as is, without asking why it looks that way. Let me give you some pointers: - subword tokens rather than words are used because it was proven in papers prior to the Transformer paper that they improved performance on machine translation benchmarks, see e.g. arxiv.org/abs/1609.08144. - we deal with numbers rather than text since computers only work with numbers, we can't do linear algebra on text. Each token ID (integer) is turned into a numerical representation, also called embedding. Tokens that have a similar meaning (like "cat" and "dog") will be closer in the embedding space (when you would project these embeddings in a n-dimensional space, with n = 768 for instance). The whole idea of creating embeddings for words or subword tokens comes from the Word2Vec paper: en.wikipedia.org/wiki/Word2vec.
@EkShunya
Жыл бұрын
I like the video Crisp and concise Keep it up
@dhirajkumarsahu9992 ай бұрын
One doubt please, does ChatGPT (decoder-only model) also use the Teacher forcing technique while training?
@NielsRogge
2 ай бұрын
Yes it does!
@dhirajkumarsahu999
2 ай бұрын
@@NielsRogge Thanks a lot for your reply !!
@adrienforbu5165 Жыл бұрын
Perfect french :)
@SanKum72 ай бұрын
Transformers are '"COMPLICATED" ? Not really after this video. Thanks.
@frazuppi489710 ай бұрын
heyy niels
@IevaSimas2 ай бұрын
Unless the token is predicted with 100% probability, you will still have non-zero loss
@acasualviewer58616 ай бұрын
It seems wasteful to run the entire decoder each time. Since it will do computations for all 6 positions regardless. There seems to be an opportunity to optimize this by only using the relevant part of the decoder mask each iteration.
@NielsRogge
5 ай бұрын
Yes indeed! That's where the key-value cache comes in: huggingface.co/blog/optimize-llm#32-the-key-value-cache
@isiisorisiaint Жыл бұрын
ok man, you tried, but honestly this is a totally pointless video, someone who knows what the transformer is about learns absolutely nothing except that -100 means 'ignore', and somebody who's still trying to wrap their heads around the transformer won't understand a single piece of what you kept typing in there. There you go, it's not just a thubs-down from me, i also took a couple of minutes to write this reply. Just try and see if you can define what the target audience of this video is, and you'll instantly see just how meaningless this video is.
@navdeep8697
Жыл бұрын
agree a little...this is good for audience who is interested in using huggingface library especially ...but not understanding the transformer and attention in generic way !