Herman Kamper

28 күн бұрын

Solutions to exploding and vanishing gradients (in RNNs) (NLP817 9.6)

Пікірлер

@rahilnecefov20182 күн бұрын

I learned a lot as an Azerbaijani student. Thanks a lot <3

@ramilsabirov65915 күн бұрын

Really great explanations. I also really like your calm way of explaining things. I get the feeling that you distill everything important before recording the video. Keep up the great work!

@kamperh4 күн бұрын

Thanks a ton for this!! I enjoy making the videos, but it definitely takes a bit of time :)

@liyingyeo59206 күн бұрын

Thank you

@rahilnecefov20187 күн бұрын

bro just keep teaching, that is great!

@josephengelmeier985611 күн бұрын

These videos are sorely underrated. Your explanations are concise and clear, thank you for making this topic so easy to understand and implement. Cheers from Pittsburgh.

@kamperh10 күн бұрын

Thanks so much for the massive encouragement!!

@Aruuuq13 күн бұрын

Working in NLP myself, I very much enjoy your videos as a refresher of the current ongoings. Continuing from your epilogue, will you cover the DPO process in detail?

@kamperh12 күн бұрын

Thanks for the encouragement @Aruuuq! Jip I still have one more video in this series to make (hopefully next week). It won't explain every little detail of the RL part, but hopefully the big stuff.

@OussemaGuerriche21 күн бұрын

your way of explanation is very good

@shylilak21 күн бұрын

Thomas 🤣

@MuhammadSqlain25 күн бұрын

good sir

@TechRevolutionNow25 күн бұрын

thank you very much professor.

@ozysjahputera766927 күн бұрын

One of the best explanations on PCA relationship with SVD!

@martinpareegol5263Ай бұрын

Why is it prefered to solve the problem as minimize the cross entropy over minimize de NLL? Are there more efficient properties doing that?

@chetterhummin1482Ай бұрын

Thank you, really great explanation, I think I can understand it now.

@zephyrus1333Ай бұрын

Thanks for lecture.

@adosar7261Ай бұрын

With regards to the clock analogy (0:48): "If you know where you are on the clock then you will know where you are in the input". Why not just a single clock with very small frequency? A very small frequency will guarantee that even for large sentences there will be no "overlap" at the same position in the clock for different positions in the input.

@ex-pwian1190Ай бұрын

The best explanation!

@frogvonneumann9761Ай бұрын

Great explanation!! Thank you so much for uploading!

@Le_ParrikarАй бұрын

Great video. That meow from the cat though

@kobi981Ай бұрын

Thanks ! great video

@harshadsaykhedkar15152 ай бұрын

This is one of the better explanations of how the heck we go from maximum likelihood to using NLL loss to log of softmax. Thanks!

@shahulrahman25162 ай бұрын

Great Explanation

@shahulrahman25162 ай бұрын

Thank you

@yaghiyahbrenner89022 ай бұрын

Sticking to a simple Git workflow is beneficial, particularly using feature branches. However, adopting a 'Gitflow' working model should be avoided as it can become a cargo cult practice within an organization or team. As you mentioned, the author of this model has reconsidered its effectiveness. Gitflow can be cognitively taxing, promote silos, and delay merge conflicts until the end of sprint work cycles. Instead, using a trunk-based development approach is preferable. While this method requires more frequent pulls and daily merging, it ensures that everyone stays up-to-date with the main branch.

@kamperh2 ай бұрын

Thanks a ton for this, very useful. I think we ended up doing this type of model anyway. But good to know the actual words to use to describe it!

@basiaostaszewska77752 ай бұрын

It very clear explanation, thank you very much!

@bleusorcoc10802 ай бұрын

Does this algorithm work with negative instance? I mean can i use vectors with both negative and postive values?

@kundanyalangi29222 ай бұрын

Good explanation. Thank you Herman

@niklasfischer31462 ай бұрын

Hello Herman, first of all a very informative video! I have a question: How are the weight matrices defined? Are the matrices simply randomized in each layer? Do you have any literature on this? Thank you very much!

@kamperh2 ай бұрын

This is a good question! These matices will start out being randomly initialised, but then -- crucially -- they will be updated through gradient descent. Stated informally, each parameter in each of the matrices will be wiggled so that the loss goes down. Hope that makes sense!

@anthonytafoya34512 ай бұрын

Great vid!

@electric_sand2 ай бұрын

6:23 Your face need not be excused :)

@kamperh2 ай бұрын

@ChrisNorulak2 ай бұрын

Had to basically learn git in 10 minutes and cook it down to 5 minutes for a group project at school - glad to something so visual and well explained (and code included!)

@kamperh2 ай бұрын

Wasn't sure this video was worth posting, so very happy this helped someone! :)

@delbarton3141592 ай бұрын

so in Q = XW, every single entry on the right side of this calculation needs to be learned?

@delbarton3141592 ай бұрын

Q, K and V are all populated with parameters all of which need to be learned?

@delbarton3141592 ай бұрын

D sub k is the dimensionality of the embeddings?

@delbarton3141592 ай бұрын

also, at 10:36 you refer to a relevant google ai blog post but I also cannot find that reference in the notes below this video. Could you post?

@kamperh2 ай бұрын

Happy to help! On p. 4 of the notes, you can just click the link in blue.

@delbarton3141592 ай бұрын

at the very beginning of this video, you mention "watch the videos on RNNs". I have been unable to find them....

@darh782 ай бұрын

What a great explanation of DTW!

@delbarton3141592 ай бұрын

great stuff! would have liked to see the RNN lectures as well, but they don't seem to be in your channel.

@kamperh2 ай бұрын

Really happy that the videos are helping! The RNN videos are the last videos on my list; they have been recorded, but I still need to edit them substantially. I need to have it released before the middle of July, in case that helps. Sorry for delays!

@vivi412a8nl3 ай бұрын

I have a question regarding the u and v vectors. If I understand correctly (hopefully), then a word will have 2 embeddings, one for when it is a center word (which is v), and one for when it is a context word (which is u)? If so, which embedding will be used to represent the word after we trained the network? Let's say we initialize the matrices V and U at random, then we'd train the network to update both V and U? Then which matrix do we use for our embeddings? Sorry if the question doesn't make sense I'm very new to NLP.

@kamperh3 ай бұрын

Have a look at my other videos in the paylist (kzread.info/head/PLmZlBIcArwhPN5aRBaB_yTA0Yz5RQe5A_). I believe it is answered in one of them. Hope that helps!

@sw_24213 ай бұрын

Thanks for explaination

@guestvil3 ай бұрын

Thanks! Best explanation on this that I've seen so far -and I've seen a lot.

@equationalmc98623 ай бұрын

I am learning and completely fascinated.. but the cat interrupting was hilarious as well.

@richsajdak3 ай бұрын

Fantastic job! This is one of the best explanations of DTW I've seen

@adrianjohn81113 ай бұрын

Wow. Thank you

@sauravgahlawat90773 ай бұрын

GOATed explaination!

@delbarton3141593 ай бұрын

K is ~ 5,000? (stated around time 6 minutes [6:00]) I thought k was number of "states" which, in turn, I thought was the POS. The number of parts of speech does not seem to be anywhere near 5,000. More like a handful....7? 10? 20? What am i missing?

@manoharmishra81723 ай бұрын

Thanks a Ton HK, I followed this whole series of NLP and truly its great, google and references helped as well, and your explanations are fresh and easily graspable, classroom talks were best part, I did struggle in hmm a bit, but eventually I got better here as well. Thanks for the great course. Any chance I get any questions paper or something to test myself over the course??

@delbarton3141593 ай бұрын

best explanation of positional encoding that I've seen. TY

@MarcoColangelo-mu6de3 ай бұрын

Thank you very much, I found your explaination one of the clearest ones on the web, very useful

@EzraSchroeder3 ай бұрын

4:49 if anyone asks what you're doing: watching cat videos on the Internet

@kamperh3 ай бұрын

🤣

@Charles_Reid3 ай бұрын

Thanks, this is a very helpful video. One question, in the video you mentioned that since probability is between 0 and 1 and probabilities sum to 1, you need to raise e to the power of each score and divide by the sum of the scores to obtain a probability. Is there a reason that you choose e as the base of the exponent? Why not choose another number? My confusion is that if I chose a number like 10 as the base, I'm pretty sure my softmax model would classify everything the same as if I had chosen e, but that the probabilities calculated would be different. I'm wondering if softmax is actually returning the real probability, or just a number between 0 and1 that behaves like the real probability. Thanks!

@kamperh3 ай бұрын

This is a really good question that I hadn't thought about before. First, using base 10 will probably work fine because of all the reasons you say. If you were training a neural network, you could probably use any number and the network would just adjust the logits to do what it must do. I see there are some practical reasons to use e: forums.fast.ai/t/why-does-softmax-use-e/78118 And finally I want to ask tongue-in-cheek: What does it mean when you say "real probability"? : ) No one knows the real probability except the Creator, and all we're doing is trying to model it ;)

@Charles_Reid3 ай бұрын

@@kamperh Yeah maybe the "real probability" can only be 0 or 1, as the data point either does belong to the class or does not. But we don't know which class it belongs to, so SoftMax gives us a probability that is different from the so-called "real probability" but that helps us make a guess. Thank you for your help!

Herman Kamper

Reinforcement learning from human feedback (NLP817 12.3)

The difference between GPT and ChatGPT (NLP817 12.2)

Large language model training and inference (NLP817 12.1)

Extensions of RNNs (NLP817 9.7)

Solutions to exploding and vanishing gradients (in RNNs) (NLP817 9.6)

Vanishing and exploding gradients in RNNs (NLP817 9.5)

Backpropagation through time (NLP817 9.4)

RNN definition and computational graph (NLP817 9.3)

RNN language model loss function (NLP817 9.2)

From feedforward to recurrent neural networks (NLP817 9.1)

Embedding layers in neural networks

Git workflow extras (including merge conflicts)

A Git workflow

Evaluating word embeddings (NLP817 7.12)

GloVe word embeddings (NLP817 7.11)

Skip-gram with negative sampling (NLP817 7.10)

Continuous bag-of-words (CBOW) (NLP817 7.9)

Skip-gram example (NLP817 7.8)

Skip-gram as a neural network (NLP817 7.7)

Skip-gram optimisation (NLP817 7.6)

Skip-gram model structure (NLP817 7.5)

Skip-gram loss function (NLP817 7.4)

Skip-gram introduction (NLP817 7.3)

One-hot word embeddings (NLP817 7.2)

Why word embeddings? (NLP817 7.1)

What can large spoken language models tell us about speech? (IndabaX South Africa 2023)

Hidden Markov models in practice (NLP817 5.13)

The log-sum-exp trick (NLP817 5.12)

Why expectation maximisation works (NLP817 5.11)

Пікірлер