XLNet: Generalized Autoregressive Pretraining for Language Understanding
Ғылым және технология
Abstract:
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.
Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
arxiv.org/abs/1906.08237
Пікірлер: 43
Please keep making these videos. Your work is amazing:))
Really cool! The "New York is a city" example helped a lot with my understanding of this!
I didn't really understand the random permutation idea from other sources but this video made it clear on how shuffled permutation allows to combine AR and BERT's AE idea. Thanks!
after a point searching on the internet gives you nothing this channel is the only place where i find explanations for very complex things in a way a newbie can understand please dont stop
I was not getting core idea behind XLNet and you made it look like piece of cake. Subscribed!! . Thank you.
you are genuinely changing the way I read and understand papers. your work is amazing do more NLP papers plz
I was eagerly waiting for it... Thanks, Yannic :)
Was actualy waiting for you to post this, thanks
Thanks Yannic, this explanation is super helpful!!
Great Video. The explanation made it very simple to understand and was very helpful !!
I liked the quick digression into language modeling before getting into the meat of the paper. Awesome video!
@hemichael2111
5 жыл бұрын
so do I
This is a really nice rundown, compare to me half reading and half sleeping over the long paper, thank you so much.
Excellent explanation, easy to understand and to the point 👌👌
7 min in and finally i get it where i didn't understand! Thank you!
very clear explanation, thanks for the video
Yannic is the best guy on the internet
Thank you So Much!
This is so enlightening!!!
I took 2.20 hours to understand this, but worth I don't forgot anymore
Thank you
Thank you for a very clear explanation. I wonder how many samples they perform for each sentence. I couldn't find it in the paper.
Thank you man
Thanks, You are doing god's work!
Great effort.
Thanks for the video - it is very helpful. Could you please make a video on Cross-lingual Language Model Pretraining (XLM)?
"Hmmmm " :P Great video.
Thanks
Hi, could you also clarify why are embedding being multiplied to the representation produced by network in the equation 1,2 formulation, my understanding was you could directly apply softmax to the representation to train?
thankq
@12:40 is what model is listening to ! :D
Language Modelling where Autoregressive is used to predict the next word by using the windows of previous words and Autoencoding is predicting the missing words in the windows of words. Aren't These two techniques are the same which we used to train word embedding for Word2Vec where CBOW(continuous bag of words) used to predict the next word by taking the previous window of words and N-gram method which used to predict the missing word by using previous and next words. What's the difference? Am I missing something?
@YannicKilcher
4 жыл бұрын
The difference is that in autoregressive decoding you do it again and again in a sequence.
Cool video!! Thanks for it. However, the voice quality was not that great and clearly, there is a scope of improvement for it here.
Now all my mind is like New Hmm is a Hmm, New York is a Hmm Hmm and Hmm~ Hmm~ Hmm~ Hmm~~~
2 out of 5 words is closer to 40%
18:23 😳😂
In this AI journey, I find some explain papers. leave behind the code. some explain the code. hopelessly though. and leave the theory. Can't we have like a paper explanation followed by an explanation of the code in tensorflow or pytorch ?? OR maybe everyone just knows only the high-level overview and thus, ignoring that part. although requiring great necessity. please upvote guys.
@YannicKilcher
5 жыл бұрын
If I were to also review the code, the videos would be 2+ hours 😁 but thanks for the feedback, will consider doing separate code reviews
@robinranabhat3125
5 жыл бұрын
@@YannicKilcher if you do code review as well, trust me your channel we be the one of its kind. Anyone strudy enough to learn these papers, would want to see implementation details
@abcdxx1059
4 жыл бұрын
@@YannicKilcher damn you would do that for us 🤗🤗🤗
@tanny411
4 жыл бұрын
I swear to sit through the 2 hours+ videos. This channel is life!
Hmm..hmm...hmm...hmmm