Deep Learning Lecture 12: Recurrent Neural Nets and LSTMs
Slides available at: www.cs.ox.ac.uk/people/nando....
Course taught in 2015 at the University of Oxford by Nando de Freitas with great help from Brendan Shillingford.
Slides available at: www.cs.ox.ac.uk/people/nando....
Course taught in 2015 at the University of Oxford by Nando de Freitas with great help from Brendan Shillingford.
Пікірлер: 48
Key bookmarks, LSTM explanation starts at 25:30 LSTM implementation at 31:49 Torch code at 34:15
@nikolatanev3293
8 жыл бұрын
thank you :)
@rajupowers
7 жыл бұрын
46:40 image captioning
@WahranRai
6 жыл бұрын
I totally desagree with your approach !!! all things are related and we have to understand how we come to one concept from another !!!
Thanks for publishing those videos!
I am really curious and looking forward to the next parts.
Thank you for your great explanations!
Great explanations! Thank you!
Love it ! Just genius !
can somebody use a neural network to filter out those low frequencies?
@ganujha6586
7 жыл бұрын
Alexander Bollbach
@yannisran7312
6 жыл бұрын
You can speed the play to make high-frequency noise trimmed
Thank you for such a useful video
23:54 why explode? you have a upper bound, if the upper bound goes to zero then gradient vanish, but if the upper bound goes to infinite, it bounds nothing.
The lecture explains the high conceptual level of using RNN, LSTM and their applications. Thanks very much for that. I would appreciate more if it can tell a little more in details about how LSTM solves the vanishing or exploding problem in RNN by using gates. Also, how backpropagation works in this case to minimize the error in details? Probably I am going to learn more about it. Anyway, tks much,
How back propagation works in bidirectional LSTM ?
When will lectures 13 and 14 become available?
12.48: In the RNN's cartoon: x_t and x_(t-1) should not be connected?
Thank you for making your lecture available! Why is the attribution for LSTM given as Alex Graves instead of Juergen Schmidhuber?
@sudhaannangi8143
7 жыл бұрын
Steve Rowe
@chrisanderson1513
7 жыл бұрын
It might be that he used Alex Graves' slides?
super good
at @36:57 what is the variable 'opt'?
I think your explanation at the beginning about the coolness of the convnet that yan leCunn demoed on the class is missing something? Specifically: taking a picture of something (in your talk, a picture of the crowd), pointing the camera away, and having a program signal "high" when the crowd is back in the field of view isn't that exciting? (you could do this with just dot product and threshold). Does the convnet provide scale and rotational invariance? Just based on your explanation, I don't see how the convnet provides advantages over much simpler methods.
@isaamthalhath4359
4 жыл бұрын
convnets can capture features better than an RNN, thats why we mainly use convnets instead of RNN when it comes to image processing. It can downsample or upscale images by using the captured features.
At 19:40... why is there Theta^T (that is, Theta transpose) in the derivative and not just Theta?
@user-pg4bq1wo7t
7 жыл бұрын
I think this is a widespread error. Typically people don't write down BPTT in an explicit manner: instead, they define intermediate variables \delta_t and use it to express BPTT. At the year of 2012, a paper ("On the difficulty of training recurrent neural networks") tries to directly write the derivatives out. In this paper, the matrix is transposed, which I think to be an error. Though, this error doesn't compromise its correctness: the paper was aimed at illustrating why gradient explosion/vanishing occurs, which will not be affected by the additional transpose operation. But the transpose will affect BPTT's implementation (This might be overcome by automatic differentiation system of modern deep learning framework, such as TensorFlow.) But since then many slides cite that work, including but not limited to many famous courses (e.g.: Stanford's CS224d, etc.), which has a very bad influence. If you search Google for "RNN Jacobian transpose", you will see many Stanford students questioning about this! It's really strange that the course instructor doesn't correct this error and keeps making students prove this.
THx
At 38:19. Why is it 2000 x 4 dimensions per sentence. The hidden state is 1000 numbers and we have 4 levels of LSTMs so it should be 4 x 1000. Where are the other 4000 coming from ? I noticed the same in the original paper as well arxiv.org/pdf/1409.3215.pdf , so I am surely missing something.
I just found that this LSTM is a mimic of plc ladder logic diagram.
What does he mean when he states "(...) recurrence is essential for Turing Completeness"?
@chrisanderson1513
7 жыл бұрын
I think he's talking about requirements for something being Turing complete. This might help: cs.stackexchange.com/questions/991/are-there-minimum-criteria-for-a-programming-language-being-turing-complete
Why is it like nn.Sigmoid()(...) in the torch code?
8 жыл бұрын
+Gökçen Eraslan Aah, it's something like a = nn.Sigmoid(); c = a:forward(b)
Shouldn't the recurrent part of RNN be h_t = φ(θ h_{t-1} + θ_x x_t) ? The activation should take all the input including h and x.
@FariborzGhavamian
7 жыл бұрын
I don't think so. See \phi(h_{t-1}) as the output at time step t-1, which is injected back as input for the time step t.
@rutapetra8795
7 жыл бұрын
That`s the part that confused me too. I have checked couple more different papers on RNN and all included both h and x.
@emadwilliam45
5 жыл бұрын
I am also confused about it, did you come up with an explanation?
Helo sou do Brasil
Why dont you use standard notation for RNN recursive formula !!!!!!???? h(t) = phi( h(t-1)*W + U*x(t) ) and y(t) = psi( V*h(t) ) + eventually biases
Somewhat confused about this. Is h a scalar or a vector? Also if h is a vector then what is a product of vectors?
@chrisanderson1513
7 жыл бұрын
I think h is a vector. It could be point-wise multiplication. en.wikipedia.org/wiki/Hadamard_product_(matrices)
there's a lot of magic there. It looks like you can't REALLY explain in which way sentences are generated. I wonder how you can design systems, if you can't control the parameters, because you don't exactly know what they do. Moreover, how are the parameters learnt by the system ? there is no minimizing process here of a cost function, or is there one ? it's not clear, at least to me.
@IgorAherne
6 жыл бұрын
As far as I know, the gates (input, forget, output) are 1-layer "mini-neural-nets" themselves. So their weights get tweaked through back propagation as well. This increases processing cost by a large amount however. However I still don't see how the exploding / vanishing gradient (during training) is solved with these complex LSTM systems...
Guy loves to hear himself talk. The actual lecture doesn't begin until about 7:50
This kind of "condensed" lecture is only suitable for people who have solid background knowledge in NN already.
At 19:40... why is there Theta^T (that is, Theta transpose) in the derivative and not just Theta?
THx