Lecture 12.1 Self-attention

ERRATA:
- In slide 23, the indices are incorrect. The index of the key and value should match (j) and theindex of the query should be different (i).
- In slide 25, the diagram illustrating how multi-head self-attention is computed is a slight departure from how it's usually done (the implementation in the subsequent slide is correct, but these are not quite functionally equivalent). See the slides PDF below for an updates diagram.
In this video, we discuss the self-attention mechanism. A very simple and powerful sequence-to-sequence layer that is at the heart of transformer architectures.
annotated slides: dlvu.github.io/sa
Lecturer: Peter Bloem

Пікірлер: 97

  • @derkontrolleur1904
    @derkontrolleur19043 жыл бұрын

    Finally an actual _explanation_ of self-attention, particularly of the key, value and query that was bugging me a lot. Thanks so much!

  • @Epistemophilos

    @Epistemophilos

    2 жыл бұрын

    Exactly! Thanks Mr. Bloem!

  • @rekarrkr5109

    @rekarrkr5109

    Жыл бұрын

    OMG , me too i was thinking of relational databases bcs they were saying database and it wasnt making any sense

  • @constantinfry3087
    @constantinfry30873 жыл бұрын

    Wow - only 700 views for probably the best explanation of Transformers I came across so far! Really nice work! Keep it up!!! (FYI: I also read the blog post)

  • @ArashKhoeini
    @ArashKhoeini Жыл бұрын

    This is the best explanation of self-attention I have ever seen! Thank you VERY MUCH!

  • @Mars.2024
    @Mars.20247 күн бұрын

    Finally i have intuitive view of seld_attention . Thank you😇

  • @MrOntologue
    @MrOntologue7 ай бұрын

    Google should rank videos according to the likes and the number of previously viewed videos on the same topics: this should go straight to the top for Attention/Transformer searches because I have seen and read plenty, and this is the first time the QKV as dictionary vs RDBMs made sense; that confusion had been so bad it literally stopped me thinking every time I had to consider Q, or K, or V and thus prevented me grokking the big idea. I now want to watch/read everything by you.

  • @sohaibzahid1188
    @sohaibzahid11882 жыл бұрын

    A very clear and broken down explanation of self-attention. Definitely deserves much more recognition.

  • @dhruvjain4372
    @dhruvjain43723 жыл бұрын

    Best explanation out there, highly recommended. Thank you!

  • @nengyunzhang6341
    @nengyunzhang63412 жыл бұрын

    Thank you! This is the best introductory video to self-attention!

  • @thcheung
    @thcheung2 жыл бұрын

    The best ever video showing how self-attention works.

  • @juliogodel
    @juliogodel3 жыл бұрын

    This is a spectacular explanation of transformers. Thank you very much!

  • @josemariabraga3380
    @josemariabraga33802 жыл бұрын

    This is the best explanation of multi-head self attention I've seen.

  • @muhammadumerjavaid6663
    @muhammadumerjavaid66633 жыл бұрын

    thanks, man! you packed some really complex concepts in a very short video. going to watch more material that you are producing.

  • @maxcrous
    @maxcrous3 жыл бұрын

    Read the blog post and then found this presentation, what a gift!

  • @farzinhaddadpour7192
    @farzinhaddadpour7192 Жыл бұрын

    I think one of the best videos describing self-attention. Thank you for sharing.

  • @sathyanarayanankulasekaran5928
    @sathyanarayanankulasekaran59282 жыл бұрын

    I have gone through 10+ videos on this, but this is the best ...hats off

  • @Ariel-px7hz
    @Ariel-px7hz Жыл бұрын

    This is a really excellent video. I was finding this a very confusing topic but I found it really clarifying.

  • @MonicaRotulo
    @MonicaRotulo2 жыл бұрын

    The best explanation of transformers and self-attention! I am watching all of your videos :)

  • @szilike_10
    @szilike_10 Жыл бұрын

    This is the kind of content that deserves the like, subscribe and share promotion. Thank you for your efforts, keep up!

  • @xkalash1
    @xkalash13 жыл бұрын

    I had to leave a comment, the best explanation of Query, Key, Value I have seen!

  • @huitangtt
    @huitangtt2 жыл бұрын

    Best transformer explanation so far !!!

  • @olileveque
    @olilevequeАй бұрын

    Absolutely amazing series of videos! Congrats!

  • @AlirezaAroundItaly
    @AlirezaAroundItaly Жыл бұрын

    best explanation i found for self attention and multi head attention on internet , thank you sir

  • @RioDeDoro
    @RioDeDoro Жыл бұрын

    Great lecture! I really appreciated your presentation by starting with simple self-attention, very helpful.

  • @marcolehmann6477
    @marcolehmann64773 жыл бұрын

    Thank you for the video and the slides. Your explanations are very clear.

  • @davidadewoyin468
    @davidadewoyin468 Жыл бұрын

    This is the best explanation i have ever heard

  • @HiHi-iu8gf
    @HiHi-iu8gf11 ай бұрын

    holy shit, been trying to wrap my head around self-attention for a while, but it all finally clicked together with this video. very well explained, very good video :)

  • @BenRush
    @BenRush Жыл бұрын

    This really is a spectacular explanation.

  • @peregudovoleg
    @peregudovoleg Жыл бұрын

    Great video explanation and there is also a good read of this for those interested. Thank you very much professor.

  • @impracticaldev
    @impracticaldev Жыл бұрын

    Thank you. This is as simple as it can get. Thanks a lot!!!

  • @workstuff6094
    @workstuff60942 жыл бұрын

    Literally the BEST explanation of attention and transformer EVER!! Agree with everyone else about why this is not ranked higher :( I'm just glad I found it !

  • @maganaluis92
    @maganaluis92 Жыл бұрын

    This is a great explanation, I have to admit I read your blog thinking the video was just a summary of it but it's much better than expected. I would appreciate it if you can create lectures in the future of how transformers are used for image recognition, I suspect we are just getting started with self attention and we'll start seeing more in CV.

  • @fredoliveira7569
    @fredoliveira75698 ай бұрын

    Best explanation ever! Congratulations and thank you!

  • @shandou5276
    @shandou52763 жыл бұрын

    This is incredible and deserves a lot more views! (glad KZread ranked it high for me to discover it :))

  • @mohammadyahya78
    @mohammadyahya78 Жыл бұрын

    Fantastic explanation for self-attention

  • @rahulmodak6318
    @rahulmodak63183 жыл бұрын

    Finally found the Best explanation TY.

  • @MrFunlive
    @MrFunlive3 жыл бұрын

    such a great explanation with examples :-) one has to love it. thank you

  • @zadidhasan4698
    @zadidhasan46989 ай бұрын

    You are a great teacher.

  • @imanmossavat9383
    @imanmossavat93832 жыл бұрын

    This is a very clear explanation. Why KZread does not recommend it??!

  • @martian.07_
    @martian.07_2 жыл бұрын

    Take my money, you deserve everything, greatest of all time. GOT

  • @junliu7398
    @junliu7398 Жыл бұрын

    Very good course which is easy to understand!

  • @senthil2sg
    @senthil2sg11 ай бұрын

    Better than the Karpathy explainer video. Enough said!

  • @clapathy
    @clapathy2 жыл бұрын

    Thanks for such a nice explanation!

  • @deepu1234able
    @deepu1234able3 жыл бұрын

    best explanation ever!

  • @user-fd5px2cw9v
    @user-fd5px2cw9v6 ай бұрын

    Thanks for your sharing! Nice and clear video!

  • @Raven-bi3xn
    @Raven-bi3xn Жыл бұрын

    This is the best video I've seen on attention models. The only thing is that I don't think the explanation of the multihead part in minute 19 is accurate. What multihead does it not treating the word "too" and "terrible" different from the word "restaurant. What is does is that, instead of using the same weight for all elements of the embedding vector, as shown in 5':30", it calculates 2 weights, one for each half of the embedding vector. So, in other words, we break down the embedding vectors of the input words into small pieces and do self attention to ALL embedding sub-vectors, as opposed to doing self attention for the embedding of "too" and "terrible" differently from the attention of "restaurant".

  • @bello3137
    @bello3137 Жыл бұрын

    very nice explanation of self attention

  • @LukeZhang1
    @LukeZhang13 жыл бұрын

    Thanks for the video! It was super helpful

  • @laveenabachani
    @laveenabachani2 жыл бұрын

    Thank you so much! This was amazing! Keep it up! This vdo is so underrated. I will share. :)

  • @darkfuji196
    @darkfuji1962 жыл бұрын

    This is a great explanation, thanks so much! I got really sick of explanations just skipping over most of the details.

  • @HeLLSp4Wn123
    @HeLLSp4Wn1232 жыл бұрын

    Thanks, found this very useful !!

  • @randomdudepostingstuff9696
    @randomdudepostingstuff9696 Жыл бұрын

    Excellent, excellent, excellent!

  • @Markste-in
    @Markste-in2 жыл бұрын

    Best explanation I have seen so far on the topic! One of the few that describe the underlaying math and not just show a simple flowchart. The only thing that confuses me: at 6:24 you say W = X_T*X but on your website you show a pytorch implementation wiith W = X*X_T. Depending on what you use you get either a [k x k] or a [t x t] matrix?

  • @somewisealien
    @somewisealien2 жыл бұрын

    VU Master's student here revisiting this lecture to help for my thesis. Super easy to get back into after a few months away from the concept. I did deep learning last December and I have to say it's my favourite course of the entire degree, mostly due to the clear and concise explanations given by the lecturers. I have one question though, I'm confused as to how simple self-attention would learn since it essentially doesn't use any parameters? I feel I'm missing something here. Thanks!

  • @manojkumarthangaraj2122
    @manojkumarthangaraj21222 жыл бұрын

    I know this is the best explanation about transformers I've come across so far. Still I'm having an problem with understanding Key, query and value part. Is there any recommendation, where I can learn completely from the basics? Thanks in advance

  • @jiaqint961
    @jiaqint961 Жыл бұрын

    This is gold.

  • @ChrisHalden007
    @ChrisHalden00711 ай бұрын

    Great video. Thanks

  • @adrielcabral6634
    @adrielcabral66349 ай бұрын

    I loved u explanation !!!

  • @TheCrmagic
    @TheCrmagic2 жыл бұрын

    You are a God.

  • @stephaniemartinez1294
    @stephaniemartinez1294 Жыл бұрын

    Good sir, I thank ye for this educational video with nice visuals

  • @aiapplicationswithshailey3600
    @aiapplicationswithshailey3600 Жыл бұрын

    so far the best video describing this topic. Only questoin i have is how do we get around the fact that a word will have highest self attention with it self. You said you would clarify about this but I could not find this point.

  • @WahranRai
    @WahranRai2 жыл бұрын

    9:55 if the sequence get longer, the weights become smaller (soft max with many components ) : is it better to have shorter sequences ?

  • @wolfisraging
    @wolfisraging2 жыл бұрын

    Amazing video!!

  • @kafaayari
    @kafaayari2 жыл бұрын

    Wow, this is unique.

  • @abhishektyagi154
    @abhishektyagi1542 жыл бұрын

    Thank you very much

  • @karimedx
    @karimedx2 жыл бұрын

    Man I was looking for this for a long time, thank you very much for this best explanation, yep it's the best, btw KZread recommended this video, I guess this is the power of self-attention in recommended systems.

  • @WM_1310
    @WM_1310 Жыл бұрын

    Man, if only I had found this video early on during my academic project, would've probably been able to do a whole lot better in my project. Shame it's already about to end

  • @user-oq1rb8vb7y
    @user-oq1rb8vb7y10 ай бұрын

    Thanks for the great explanation! Just one question, if simple self-attention has no parameters, how can we expect it to learn? it is not trainable.

  • @saurabhmahra4084
    @saurabhmahra40848 ай бұрын

    Watching this video feels like trying to decipher alien scriptures with a blindfold on.

  • @geoffreysworkaccount160
    @geoffreysworkaccount1602 жыл бұрын

    This video was raaaaad THANK YOU

  • @VadimSchulz
    @VadimSchulz3 жыл бұрын

    thank you so much

  • @ax5344
    @ax53443 жыл бұрын

    I love it when you are talking about the different ways of implementing multi-head attention, there are so many tutorials just glossing over it or taking it for granted, but I would wish to know more details @ 20:30. I came here because your article discussed it but i did not feel I have a too clear picture. Here, with the video, I still feel unclear. Which one was implemented in Transformer and which one for BERT? Suppose they cut the original input vector matrix into 8 or 12 chunks, why did not I see in their code the start of each section? I only saw a line dividing the input dimension by number of heads. That's all. How would the attention head the input vector idx they need to work on? Somehow I feel the head need to know the starting index ...

  • @dlvu6202

    @dlvu6202

    3 жыл бұрын

    Thanks for you kind words! In the slide you point to the bottom version is used in every implementation I've seen. The way this "cutting up" is usually done is with a view operation. If I take a vector x of length 128, and do x.view(8, 16), I get a matrix with 8 rows and 16 columns, which I can then interpret as the 8 vectors of length 16 that will go into the 8 different heads. Here is that view() operation in the Huggingface GPT2 implementation: github.com/huggingface/transformers/blob/8719afa1adc004011a34a34e374b819b5963f23b/src/transformers/models/gpt2/modeling_gpt2.py#L208

  • @soumilbinhani8803
    @soumilbinhani88038 ай бұрын

    Hello can someone explain me this, the key and the values for each iteration wont it be the same, as we compare it to 5:29 , please help me on this

  • @turingmachine4122
    @turingmachine41223 жыл бұрын

    Hi, thank you for this nice Explanation. However, there is one thing that I don‘t get. How can the self attention model, for instance in the sentence „John likes his new shoes“, compute high value for „his“ and „John“. I mean, we know that they are related, but the embeddings for these words can be very different. Hope you can help me out :)

  • @recessiv3
    @recessiv32 жыл бұрын

    Great video, I just have a question: When we compute the weights that are then multiplied by the value, are these vectors or just a single scalar value? I know we used the dot product to get w so it should be just a single scalar value, but just wanted to confirm. As an example, at 5:33 are the values for w a single value or vectors?

  • @TubeConscious

    @TubeConscious

    2 жыл бұрын

    Yes, it is a single scalar the result of the dot product further normalize by softmax, so the sum of all weights equals to one.

  • @mohammadyahya78
    @mohammadyahya78 Жыл бұрын

    Question: At 8:46 May I know please why since Y is defined as multiplication, it's purely linear and thus is non-vanishing gradient, which means the gradient will be of a linear operation? While W=SoftMax(XX^T) is non-linear and thus can cause vanishing gradients. Second, what is the relationship between linearity/non-linearity and vanishing/non-vanishing gradient?

  • @balasubramanyamevani7752
    @balasubramanyamevani7752 Жыл бұрын

    It was very well put the presentation on self-attention. Thank you for uploading this. I had a doubt @15:56 how it will suffer from vanishing gradients without the normalization. As dimensionality increases, the overall dot product should be larger. Wouldn't this be a case of exploding gradient? I'd really love some insight on this. EDIT: Listened more carefully again. The vanishing gradient on the "softmax" operation. Got it now. Great video 🙂

  • @abdot604
    @abdot6042 жыл бұрын

    fine , i will subscribe

  • @ecehatipoglu209
    @ecehatipoglu209 Жыл бұрын

    Hi extremely helpful video here, I really appreciate but i have a question i dont understand how multi head self attention works if we are not generate extra parameters for each stack of self attention layer, what is the difference in each stack so that we can grasp the different relations of the same word in each layer

  • @ecehatipoglu209

    @ecehatipoglu209

    Жыл бұрын

    Yeah after 9 days and re-watching this video i think I grasped why we are not using extra parameters. Lets say you have an embedding dimension of 768 and you want to make 3 attention head meaning somehow dividing the 768 vector so you could have a 256x1 vector for each attention head. (This splitting is actually a linear transformation so there is no weights to be learned here right. ) . After that, for each of this 3 attention heads we have parameters 3 of [K, Q, W](superscripted for each attention head). For each attention head our K will be the dimension of 256xwhatever, Q will be the dimension of 256xwhatever and V will be the dimension of 256xwhatever. And this is for one head. Concatanating all learned vectors K. Q and will end up a 768xwhatever for each of them, exact size that we would have with a single attention. Voila.

  • @jrlandau
    @jrlandau Жыл бұрын

    At 16.43, why is d['b'] = 3 rather than 2?

  • @dlvu6202

    @dlvu6202

    Жыл бұрын

    This was a mistake, apologies. We'll fix this in the slides.

  • @donnap6253
    @donnap62533 жыл бұрын

    On page 23, should it not be ki qj rather than kj qi?

  • @superhorstful

    @superhorstful

    3 жыл бұрын

    I totally agree on your opinion

  • @dlvu6202

    @dlvu6202

    3 жыл бұрын

    You're right. Thanks for the pointer. We'll fix this in any future versions of the video.

  • @joehsiao6224

    @joehsiao6224

    2 жыл бұрын

    @@dlvu6202 Why the change? I think we are querying with current ith input against every other jth input, and the figure looks right to me.

  • @dlvu6202

    @dlvu6202

    2 жыл бұрын

    @@joehsiao6224 It's semantics really. Since the key and query are derived from the same vector it's up to you which you call the key and which the query, so the figure is fine in the sense that it would technically work without problems. However, given the analogy with the discrete key-value store, it makes most sense to say that the key and value come from the same input vector (i.e. have the same index) and that the query comes from a (potentially) different input.

  • @joehsiao6224

    @joehsiao6224

    2 жыл бұрын

    @@dlvu6202 it makes sense. Thanks for the reply!

  • @cicik57
    @cicik57 Жыл бұрын

    how self-attention has sence on word embedding, where each word is represented by random vector so this self-correlation has no sence?

  • @edphi
    @edphi Жыл бұрын

    Everything was clear till the query key and value.. anyone has a slower video or resource for understanding??

  • @vukrosic6180
    @vukrosic6180 Жыл бұрын

    I finally understand it jesus christ

  • @Isomorphist
    @Isomorphist10 ай бұрын

    Is this ASMR?

  • @tizianofranza2956
    @tizianofranza29563 жыл бұрын

    Saved lots of hours with this simple but awesome explanation of self-attention, thanks a lot!