I learned a lot as an Azerbaijani student. Thanks a lot <3
@ramilsabirov65915 күн бұрын
Really great explanations. I also really like your calm way of explaining things. I get the feeling that you distill everything important before recording the video. Keep up the great work!
@kamperh4 күн бұрын
Thanks a ton for this!! I enjoy making the videos, but it definitely takes a bit of time :)
@liyingyeo59206 күн бұрын
Thank you
@rahilnecefov20187 күн бұрын
bro just keep teaching, that is great!
@josephengelmeier985611 күн бұрын
These videos are sorely underrated. Your explanations are concise and clear, thank you for making this topic so easy to understand and implement. Cheers from Pittsburgh.
@kamperh10 күн бұрын
Thanks so much for the massive encouragement!!
@Aruuuq13 күн бұрын
Working in NLP myself, I very much enjoy your videos as a refresher of the current ongoings. Continuing from your epilogue, will you cover the DPO process in detail?
@kamperh12 күн бұрын
Thanks for the encouragement @Aruuuq! Jip I still have one more video in this series to make (hopefully next week). It won't explain every little detail of the RL part, but hopefully the big stuff.
@OussemaGuerriche21 күн бұрын
your way of explanation is very good
@shylilak21 күн бұрын
Thomas 🤣
@MuhammadSqlain25 күн бұрын
good sir
@TechRevolutionNow25 күн бұрын
thank you very much professor.
@ozysjahputera766927 күн бұрын
One of the best explanations on PCA relationship with SVD!
@martinpareegol5263Ай бұрын
Why is it prefered to solve the problem as minimize the cross entropy over minimize de NLL? Are there more efficient properties doing that?
@chetterhummin1482Ай бұрын
Thank you, really great explanation, I think I can understand it now.
@zephyrus1333Ай бұрын
Thanks for lecture.
@adosar7261Ай бұрын
With regards to the clock analogy (0:48): "If you know where you are on the clock then you will know where you are in the input". Why not just a single clock with very small frequency? A very small frequency will guarantee that even for large sentences there will be no "overlap" at the same position in the clock for different positions in the input.
@ex-pwian1190Ай бұрын
The best explanation!
@frogvonneumann9761Ай бұрын
Great explanation!! Thank you so much for uploading!
@Le_ParrikarАй бұрын
Great video. That meow from the cat though
@kobi981Ай бұрын
Thanks ! great video
@harshadsaykhedkar15152 ай бұрын
This is one of the better explanations of how the heck we go from maximum likelihood to using NLL loss to log of softmax. Thanks!
@shahulrahman25162 ай бұрын
Great Explanation
@shahulrahman25162 ай бұрын
Thank you
@yaghiyahbrenner89022 ай бұрын
Sticking to a simple Git workflow is beneficial, particularly using feature branches. However, adopting a 'Gitflow' working model should be avoided as it can become a cargo cult practice within an organization or team. As you mentioned, the author of this model has reconsidered its effectiveness. Gitflow can be cognitively taxing, promote silos, and delay merge conflicts until the end of sprint work cycles. Instead, using a trunk-based development approach is preferable. While this method requires more frequent pulls and daily merging, it ensures that everyone stays up-to-date with the main branch.
@kamperh2 ай бұрын
Thanks a ton for this, very useful. I think we ended up doing this type of model anyway. But good to know the actual words to use to describe it!
@basiaostaszewska77752 ай бұрын
It very clear explanation, thank you very much!
@bleusorcoc10802 ай бұрын
Does this algorithm work with negative instance? I mean can i use vectors with both negative and postive values?
@kundanyalangi29222 ай бұрын
Good explanation. Thank you Herman
@niklasfischer31462 ай бұрын
Hello Herman, first of all a very informative video! I have a question: How are the weight matrices defined? Are the matrices simply randomized in each layer? Do you have any literature on this? Thank you very much!
@kamperh2 ай бұрын
This is a good question! These matices will start out being randomly initialised, but then -- crucially -- they will be updated through gradient descent. Stated informally, each parameter in each of the matrices will be wiggled so that the loss goes down. Hope that makes sense!
@anthonytafoya34512 ай бұрын
Great vid!
@electric_sand2 ай бұрын
6:23 Your face need not be excused :)
@kamperh2 ай бұрын
:)
@ChrisNorulak2 ай бұрын
Had to basically learn git in 10 minutes and cook it down to 5 minutes for a group project at school - glad to something so visual and well explained (and code included!)
@kamperh2 ай бұрын
Wasn't sure this video was worth posting, so very happy this helped someone! :)
@delbarton3141592 ай бұрын
so in Q = XW, every single entry on the right side of this calculation needs to be learned?
@delbarton3141592 ай бұрын
Q, K and V are all populated with parameters all of which need to be learned?
@delbarton3141592 ай бұрын
D sub k is the dimensionality of the embeddings?
@delbarton3141592 ай бұрын
also, at 10:36 you refer to a relevant google ai blog post but I also cannot find that reference in the notes below this video. Could you post?
@kamperh2 ай бұрын
Happy to help! On p. 4 of the notes, you can just click the link in blue.
@delbarton3141592 ай бұрын
at the very beginning of this video, you mention "watch the videos on RNNs". I have been unable to find them....
@darh782 ай бұрын
What a great explanation of DTW!
@delbarton3141592 ай бұрын
great stuff! would have liked to see the RNN lectures as well, but they don't seem to be in your channel.
@kamperh2 ай бұрын
Really happy that the videos are helping! The RNN videos are the last videos on my list; they have been recorded, but I still need to edit them substantially. I need to have it released before the middle of July, in case that helps. Sorry for delays!
@vivi412a8nl3 ай бұрын
I have a question regarding the u and v vectors. If I understand correctly (hopefully), then a word will have 2 embeddings, one for when it is a center word (which is v), and one for when it is a context word (which is u)? If so, which embedding will be used to represent the word after we trained the network? Let's say we initialize the matrices V and U at random, then we'd train the network to update both V and U? Then which matrix do we use for our embeddings? Sorry if the question doesn't make sense I'm very new to NLP.
@kamperh3 ай бұрын
Have a look at my other videos in the paylist (kzread.info/head/PLmZlBIcArwhPN5aRBaB_yTA0Yz5RQe5A_). I believe it is answered in one of them. Hope that helps!
@sw_24213 ай бұрын
Thanks for explaination
@guestvil3 ай бұрын
Thanks! Best explanation on this that I've seen so far -and I've seen a lot.
@equationalmc98623 ай бұрын
I am learning and completely fascinated.. but the cat interrupting was hilarious as well.
@richsajdak3 ай бұрын
Fantastic job! This is one of the best explanations of DTW I've seen
@adrianjohn81113 ай бұрын
Wow. Thank you
@sauravgahlawat90773 ай бұрын
GOATed explaination!
@delbarton3141593 ай бұрын
K is ~ 5,000? (stated around time 6 minutes [6:00]) I thought k was number of "states" which, in turn, I thought was the POS. The number of parts of speech does not seem to be anywhere near 5,000. More like a handful....7? 10? 20? What am i missing?
@manoharmishra81723 ай бұрын
Thanks a Ton HK, I followed this whole series of NLP and truly its great, google and references helped as well, and your explanations are fresh and easily graspable, classroom talks were best part, I did struggle in hmm a bit, but eventually I got better here as well. Thanks for the great course. Any chance I get any questions paper or something to test myself over the course??
@delbarton3141593 ай бұрын
best explanation of positional encoding that I've seen. TY
@MarcoColangelo-mu6de3 ай бұрын
Thank you very much, I found your explaination one of the clearest ones on the web, very useful
@EzraSchroeder3 ай бұрын
4:49 if anyone asks what you're doing: watching cat videos on the Internet
@kamperh3 ай бұрын
🤣
@Charles_Reid3 ай бұрын
Thanks, this is a very helpful video. One question, in the video you mentioned that since probability is between 0 and 1 and probabilities sum to 1, you need to raise e to the power of each score and divide by the sum of the scores to obtain a probability. Is there a reason that you choose e as the base of the exponent? Why not choose another number? My confusion is that if I chose a number like 10 as the base, I'm pretty sure my softmax model would classify everything the same as if I had chosen e, but that the probabilities calculated would be different. I'm wondering if softmax is actually returning the real probability, or just a number between 0 and1 that behaves like the real probability. Thanks!
@kamperh3 ай бұрын
This is a really good question that I hadn't thought about before. First, using base 10 will probably work fine because of all the reasons you say. If you were training a neural network, you could probably use any number and the network would just adjust the logits to do what it must do. I see there are some practical reasons to use e: forums.fast.ai/t/why-does-softmax-use-e/78118 And finally I want to ask tongue-in-cheek: What does it mean when you say "real probability"? : ) No one knows the real probability except the Creator, and all we're doing is trying to model it ;)
@Charles_Reid3 ай бұрын
@@kamperh Yeah maybe the "real probability" can only be 0 or 1, as the data point either does belong to the class or does not. But we don't know which class it belongs to, so SoftMax gives us a probability that is different from the so-called "real probability" but that helps us make a guess. Thank you for your help!
Пікірлер
I learned a lot as an Azerbaijani student. Thanks a lot <3
Really great explanations. I also really like your calm way of explaining things. I get the feeling that you distill everything important before recording the video. Keep up the great work!
Thanks a ton for this!! I enjoy making the videos, but it definitely takes a bit of time :)
Thank you
bro just keep teaching, that is great!
These videos are sorely underrated. Your explanations are concise and clear, thank you for making this topic so easy to understand and implement. Cheers from Pittsburgh.
Thanks so much for the massive encouragement!!
Working in NLP myself, I very much enjoy your videos as a refresher of the current ongoings. Continuing from your epilogue, will you cover the DPO process in detail?
Thanks for the encouragement @Aruuuq! Jip I still have one more video in this series to make (hopefully next week). It won't explain every little detail of the RL part, but hopefully the big stuff.
your way of explanation is very good
Thomas 🤣
good sir
thank you very much professor.
One of the best explanations on PCA relationship with SVD!
Why is it prefered to solve the problem as minimize the cross entropy over minimize de NLL? Are there more efficient properties doing that?
Thank you, really great explanation, I think I can understand it now.
Thanks for lecture.
With regards to the clock analogy (0:48): "If you know where you are on the clock then you will know where you are in the input". Why not just a single clock with very small frequency? A very small frequency will guarantee that even for large sentences there will be no "overlap" at the same position in the clock for different positions in the input.
The best explanation!
Great explanation!! Thank you so much for uploading!
Great video. That meow from the cat though
Thanks ! great video
This is one of the better explanations of how the heck we go from maximum likelihood to using NLL loss to log of softmax. Thanks!
Great Explanation
Thank you
Sticking to a simple Git workflow is beneficial, particularly using feature branches. However, adopting a 'Gitflow' working model should be avoided as it can become a cargo cult practice within an organization or team. As you mentioned, the author of this model has reconsidered its effectiveness. Gitflow can be cognitively taxing, promote silos, and delay merge conflicts until the end of sprint work cycles. Instead, using a trunk-based development approach is preferable. While this method requires more frequent pulls and daily merging, it ensures that everyone stays up-to-date with the main branch.
Thanks a ton for this, very useful. I think we ended up doing this type of model anyway. But good to know the actual words to use to describe it!
It very clear explanation, thank you very much!
Does this algorithm work with negative instance? I mean can i use vectors with both negative and postive values?
Good explanation. Thank you Herman
Hello Herman, first of all a very informative video! I have a question: How are the weight matrices defined? Are the matrices simply randomized in each layer? Do you have any literature on this? Thank you very much!
This is a good question! These matices will start out being randomly initialised, but then -- crucially -- they will be updated through gradient descent. Stated informally, each parameter in each of the matrices will be wiggled so that the loss goes down. Hope that makes sense!
Great vid!
6:23 Your face need not be excused :)
:)
Had to basically learn git in 10 minutes and cook it down to 5 minutes for a group project at school - glad to something so visual and well explained (and code included!)
Wasn't sure this video was worth posting, so very happy this helped someone! :)
so in Q = XW, every single entry on the right side of this calculation needs to be learned?
Q, K and V are all populated with parameters all of which need to be learned?
D sub k is the dimensionality of the embeddings?
also, at 10:36 you refer to a relevant google ai blog post but I also cannot find that reference in the notes below this video. Could you post?
Happy to help! On p. 4 of the notes, you can just click the link in blue.
at the very beginning of this video, you mention "watch the videos on RNNs". I have been unable to find them....
What a great explanation of DTW!
great stuff! would have liked to see the RNN lectures as well, but they don't seem to be in your channel.
Really happy that the videos are helping! The RNN videos are the last videos on my list; they have been recorded, but I still need to edit them substantially. I need to have it released before the middle of July, in case that helps. Sorry for delays!
I have a question regarding the u and v vectors. If I understand correctly (hopefully), then a word will have 2 embeddings, one for when it is a center word (which is v), and one for when it is a context word (which is u)? If so, which embedding will be used to represent the word after we trained the network? Let's say we initialize the matrices V and U at random, then we'd train the network to update both V and U? Then which matrix do we use for our embeddings? Sorry if the question doesn't make sense I'm very new to NLP.
Have a look at my other videos in the paylist (kzread.info/head/PLmZlBIcArwhPN5aRBaB_yTA0Yz5RQe5A_). I believe it is answered in one of them. Hope that helps!
Thanks for explaination
Thanks! Best explanation on this that I've seen so far -and I've seen a lot.
I am learning and completely fascinated.. but the cat interrupting was hilarious as well.
Fantastic job! This is one of the best explanations of DTW I've seen
Wow. Thank you
GOATed explaination!
K is ~ 5,000? (stated around time 6 minutes [6:00]) I thought k was number of "states" which, in turn, I thought was the POS. The number of parts of speech does not seem to be anywhere near 5,000. More like a handful....7? 10? 20? What am i missing?
Thanks a Ton HK, I followed this whole series of NLP and truly its great, google and references helped as well, and your explanations are fresh and easily graspable, classroom talks were best part, I did struggle in hmm a bit, but eventually I got better here as well. Thanks for the great course. Any chance I get any questions paper or something to test myself over the course??
best explanation of positional encoding that I've seen. TY
Thank you very much, I found your explaination one of the clearest ones on the web, very useful
4:49 if anyone asks what you're doing: watching cat videos on the Internet
🤣
Thanks, this is a very helpful video. One question, in the video you mentioned that since probability is between 0 and 1 and probabilities sum to 1, you need to raise e to the power of each score and divide by the sum of the scores to obtain a probability. Is there a reason that you choose e as the base of the exponent? Why not choose another number? My confusion is that if I chose a number like 10 as the base, I'm pretty sure my softmax model would classify everything the same as if I had chosen e, but that the probabilities calculated would be different. I'm wondering if softmax is actually returning the real probability, or just a number between 0 and1 that behaves like the real probability. Thanks!
This is a really good question that I hadn't thought about before. First, using base 10 will probably work fine because of all the reasons you say. If you were training a neural network, you could probably use any number and the network would just adjust the logits to do what it must do. I see there are some practical reasons to use e: forums.fast.ai/t/why-does-softmax-use-e/78118 And finally I want to ask tongue-in-cheek: What does it mean when you say "real probability"? : ) No one knows the real probability except the Creator, and all we're doing is trying to model it ;)
@@kamperh Yeah maybe the "real probability" can only be 0 or 1, as the data point either does belong to the class or does not. But we don't know which class it belongs to, so SoftMax gives us a probability that is different from the so-called "real probability" but that helps us make a guess. Thank you for your help!