BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Ғылым және технология

arxiv.org/abs/1810.04805
Abstract:
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.
Authors:
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Пікірлер: 93

@kevinnejad10724 жыл бұрын
“the man went to MASK store” makes a lot of sense these days.
@YannicKilcher
4 жыл бұрын
Holy crap I had to laugh at this 😁
@LouisChiaki
4 жыл бұрын
LOL
@zhangc57235 жыл бұрын
Its so kind of you to introduce these papers to us in such a decent way. Thanks a lot.
@ramiyer38415 жыл бұрын
Fantastic overview. Really appreciate your patient and detailed walk-through of the paper.
@teodorflorianafrim42203 жыл бұрын
I watch the ads entirely just to show my support for this amazing channel
@PhucLe-qs7nx
3 жыл бұрын
It's better if you just click it 10 times :))
@marcobuiani26283 жыл бұрын
this is one of your best NLP videos to me, a very quick but clear recap of language models, RNN, word vectors, attention. All to explain the bert revolution. This is awesome! and I would love a series of recap videos like this. Kudos!
@niduttbhuptani93014 жыл бұрын
I like the way he knows he isnt the best at explaining stuff but still tried 110% to explain ! Thanks man for the amazing papers.
@shrikanthsingh82434 жыл бұрын
This is my second comment on your videos. I am really thankful to you for creating such an informatory video on BERT. Now I can go through the paper with some confidence.
@YannicKilcher
4 жыл бұрын
Thanks for the feedback. Glad it helped
@StevenWernerCS5 жыл бұрын
Thanks for doing your part :)
@ahmedbahaaeldin7505 жыл бұрын
Thank you so much for your efforts.
@Konstantin-qk6hv3 жыл бұрын
Great explanation! Thank you!
@Alex-ms1yd Жыл бұрын
special thanks for tokenization detour and deeper dive into finetuning/evaluation tasks!
@sasna88003 жыл бұрын
Thank you a lot I search a lot and I read the paper but I have difficulty to understand it until I watch your video you make everything easy
@asifalhye50623 жыл бұрын
Absolutely LOVED the video.
@sofia.eris.bauhaus3 жыл бұрын
"the problem is that a character in itself doesn't really have a meaning" f
@paulntalo14253 жыл бұрын
Thank for the illustrations
@TechVizTheDataScienceGuy3 жыл бұрын
Nicely done!
@user-bj5bb7rl9f6 ай бұрын
Nicely! Thanks a lot.
@panoss41495 жыл бұрын
a big thank you
@tae8983 жыл бұрын
Bert is so cool!
@swarajshinde39504 жыл бұрын
Loved It :)
@HelloPython5 жыл бұрын
Thanks a lot !
@goelnikhils Жыл бұрын
Amazing Explanation
@tempvariable3 жыл бұрын
Thank you. In 10:51 I think although in ELMo they're concatenating left and right side, when making a prediction if there is a softmax the back-propagating error to left should be effected by right side and vice versa. However, I understand what you mean by they're not that coupled.
@csam111002 жыл бұрын
Thank you so much for the amazing paper explanation! Is the speaking at 16:28 means they pre-train two tasks at the same time(predict the mask "and" the isNext label) or have an order training(pre-train the task 1 then the task 2).
@saurabhgoel2034 жыл бұрын
Very nice explanation. Can you please elaborate the token embedding used in Bert. Are these the same 300 dimensions vector from glove or these embedding are trained from scratch in Bert. How are we getting the base embedding is something I am not able to understand. Thanks in advance for clarifying.
@dr.deepayogish53984 жыл бұрын
Sir ,wonderful and clear explanation..i have douth that qa system with bert technique is supervised or unsupervised...is bert is pre training model
@tuhinmukherjee81413 жыл бұрын
Hey, what BERT claims is infact very similar to the working of a Transformer Encoder layer as described in the "Attention Is All You Need Paper". The encoder submodel is allowed to peek into future tokens as well.
@gorgolyt
3 жыл бұрын
That's not a secret, indeed they describe the architecture in the paper as a transformer encoder. The novelty is in using this transformer encoder for language model pre-training.
@antonispolykratis32834 жыл бұрын
I cannot say that I understood Bert from this video.
@bryanye4490
3 жыл бұрын
To understand a technical paper, a basic level of tech foundation is required. There are explanation videos out there targeted to laymen, but this video is for an audience who either can already read the paper and wants a summary instead, or those who knows what is going on but gets thrown off by academic language and jargons in paper.
@marybaxart89983 жыл бұрын
Hi! Thanks a lot for this video! I was searching for the information about out-of-vocab words - and I found it in your talk :) However, only one moment remains unclear: How do we tokenize out-of-vocab words? I mean how do we divide words into characters or word-pieces? What algorithm is used to divide "subscribe" into "sub + s + c + r + i + b + e" and not "sub + scribe"? I understand that it depends on the vocabulary but how it is exactly performed? Thanks a lot again) (BTW I subscribed :)) )
@YannicKilcher
3 жыл бұрын
That's usually determined by a heuristic. It tries to split it into as few tokens as possible, given some vocabulary.
@vinayreddy86834 жыл бұрын
At 25:10 you're taking about character level tokens. Does that refers to "Enriching word vectors with sub subword representation" paper?
@YannicKilcher
4 жыл бұрын
I'm referring to wordpieces, which refers to the sub-words, yes.
@seanspicer5165 жыл бұрын
HYPE!
@thak4564 жыл бұрын
Does bert take in fixed length sentences for the question and paragraph task ? if not then how is the variable length input is handled? basically what is the size of data fed into the network
@YannicKilcher
4 жыл бұрын
The total state is fixed length, with padding or cropping if needed.
@elnazsn3 жыл бұрын
somehow the image in figure 1 comparison is different on the arvix 2019 paper?
@gorgolyt
3 жыл бұрын
arxiv papers are pre-publication, not necessarily final versions. the text is a bit different too.
@xiquandong11834 жыл бұрын
Nice video. This is an excerpt from the paper which I am not able to understand "Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context". Can you please help me ? I am not able to understand how can word see itself after incorporating bidirectionality? Thanks.
@YannicKilcher
4 жыл бұрын
Consider the sequence "A B C" and try to reconstruct the tokens with bidirectional contexts and two hidden layers. The embedding of C in hidden layer 1 will have attention to the input B and the embedding of B in hidden layer 2 will have attention to the layer 1 embedding of C. So the embedding of B has direct access to the input token B, which makes the reconstruction task trivial.
@xiquandong1183
4 жыл бұрын
@@YannicKilcher Oh, I get it now. Thanks.
@fahds25833 жыл бұрын
have a question? When you train a BERT model, lets say for a named-entity recognition task like "Subscribe to Pewdiepie", does BERT model automatically map the words 'Subscribe', 'to', 'Pewdiepie' to its already trained word embeddings read off the corpus? If it does, it means the BERT model comes with its huge bag of word embeddings.
@tankimwai1885
3 жыл бұрын
If you are using Pytorch, it comes with a BERT Tokenizer! I am not sure if Tensorflow has this.
@YannicKilcher
3 жыл бұрын
It splits into word pieces, and worst case into characters
@susmitislam19103 жыл бұрын
BERT was ready for the pandemic way before it even started.
@tamvominh32724 жыл бұрын
Dear Yannic, Could you please share with me how to use BERT for fine-tuning in a regression task? My data looks like: input: a sentence with length 30 words output: a score in [0, 5]. Is it good to use BERT for a dataset like this? I found some document said transfer learning is effective for a new dataset which is the same with the source task/dataset. Thank you!
@YannicKilcher
4 жыл бұрын
It depends. If your sentences are natural language (and preferably English), then it can make sense. Take a pre-trained BERT and use the CLS vector to put into a regression head. Maybe huggingface has pre-built modules for exactly that already.
@tamvominh3272
4 жыл бұрын
@@YannicKilcher Thank you for your prompt response! I still have some questions. 1) Do I need to put [SEP] at the end of my sentence or just only put [CLS] at the beginning? I see some tutorial, they put [SEP] at the end and some did not, for a classification task (here I think we don't need to put [SEP]). 2). I did not see any pre-built module for regression on huggingface, just only classification, question-answering... available! Do you mean I use the 'general' BertModel as you used in your tutorial and modify it for the regression task? I am sorry if it is a silly question, because I am just take some walk into DL and I do not usually use Pytorch. Could you please show me more detail for this part? Thank you so much!
@YannicKilcher
4 жыл бұрын
@@tamvominh3272 1) you just need to try out these things. 2) in that case, just take the standard bert encoder, take the CLS output and run it through a linear layer with a regression loss.
@tamvominh3272
4 жыл бұрын
@@YannicKilcher Thank you so much for your help!
@snippletrap
4 жыл бұрын
@@tamvominh3272 HuggingFace has a model with a classification head built-in. Follow their tutorials and examples. Let the tokenizer do the work. Very handy
@1animorph5 жыл бұрын
Lol loved the explanation and the pewdiepie reference. Hope to learn a lot more from your paper explanations.
@sagaradoshi2 жыл бұрын
Hello Sir, Thanks for the video. I have a question. What confuses me when I see BERT or GPT in picture at 14.16 sec why transformers are shown so many times in a layer format? When I read about the transformer, it takes all the words of a sequence at a time and pass through layers of Encoders (Attention + Fully connected layer). In Bert also we are passing all the words to transformer. Right? Then why are we showing so many transformers (in circle)? Is BERT collection of many transformers? (combination of encoders + decoders)
@YannicKilcher
2 жыл бұрын
A transformer is multiple layers of attention and feed forward stacked on top of each other
@sagaradoshi
2 жыл бұрын
@@YannicKilcher Thanks for your kind reply. But why do we show series of transformers in the picture? Shouldn't it be one transformer within which we have series of encoders (attention + feed forward)
@purviprajapati84135 жыл бұрын
sir is it possible to apply BERT model for Wikipedia Tagging? And could we combine BERT with other classifier?
@YannicKilcher
4 жыл бұрын
What do you mean by tagging?
@vg93114 жыл бұрын
Can someone please explain how the language modeling task used to train OpenAI GPT is unsupervised as mentioned at 12:43 ? Thanks
@YannicKilcher
4 жыл бұрын
All training signal comes from the input data itself, there are no external labels.
@kevintoner60683 жыл бұрын
Elmo and Bert... What next? Kermit?
@Kerrosene4 жыл бұрын
ElMo does left and right also. Why does it not do as well as BERT? Because Bert uses attention maybe...any thoughts?
@zxzhaixiang
4 жыл бұрын
He explained that very well in the video. To me, BERT is actually not the traditional bidirectional like elmo, it is more like omni directional!
@NoName-iz8td
4 жыл бұрын
yeah but is not at the same time, they go from left to right and then from right to left make concatenation so it's a shallow way
@honglu6793 жыл бұрын
Why the inputs (word, segment, and position embeddings) are sum together instead of concatenated to a vector ? Doesn't the summation lead to ambiguity/info loss ?
@purneshdasari56675 жыл бұрын
Can we downlaod the pretrained BERT model and use it on our GPU machines ?
@pr3st0n2
5 жыл бұрын
github.com/google-research/bert has what you need
@iedmrc99
5 жыл бұрын
you can also have sentence vectors with github.com/hanxiao/bert-as-service
@snippletrap
4 жыл бұрын
HuggingFace has what you want
@mustafasuve31093 жыл бұрын
How does BERT handle various sized inputs?
@YannicKilcher
3 жыл бұрын
Usually you pad them
@monart42103 жыл бұрын
Could I extract word embeddings from BERT and use them for unsupervised learning, e.g. topic modeling? :)
@YannicKilcher
3 жыл бұрын
Sure
@bryanye4490
3 жыл бұрын
But why not take the encoding of the full sentence for topic modeling? Why stop at word embedding? You will lose all the context in the sentence/paragraph
@ximingdong5033 жыл бұрын
Hi, firstly thanks for posting this one. I have a question, let say I want to pre -train BERT, so I have some text, but how to generate the word embedding as input part(token embedding part ), is that firstly generate randomly ? for example, we only have two word "yes or no" then After one hot encoding we can say yes ->10 and no ->01, then we have a sentence called "yes no", hje sentence will enter bert model, how to initialize word embedding those two words? also , if we want to fine tune model, is that means we have a pretrain embedding(for example such as the same way in glove word embedding ) ? or randomly as pre-train model ? in other words, does fine tune bert has pre-train embedding?
@dARKf3n1Xx
2 жыл бұрын
A pre-train model learns the embeddings of the vocabulary provided to it. These learnt embeddings are then used as initialised embeddings in fine tuning phase, along with classification layer weights
@SAINIVEDH3 жыл бұрын
Manchiga chepinav anna
@JoshFlorii5 жыл бұрын
it would be good if you ran your audio through a compressor, its a little hard to understand
@YannicKilcher
5 жыл бұрын
Thanks, will try
@vslaykovsky3 жыл бұрын
I've subscribed to PewDiePie!
@traindiesel70053 жыл бұрын
I just hope Amazon doesn't come out with ERNIE...
@Tuguldur5 жыл бұрын
Best part was when you explained word table on subscribe to pewdiepie
@SirJonyG5 жыл бұрын
21:16
@kristjan2838
5 жыл бұрын
9 year olds teaching NLP
@pixel7038
5 жыл бұрын
Kobe Bryant said the best type of success is to have a child’s heart of passion. Constantly asking questions.
@blanamaxima3 жыл бұрын
I lost the interest after 10m , as you explain useless parts and not the core. Might make sense to focus on the topic and assume some things are known.