Multi-Label Classification on Unhealthy Comments - Finetuning RoBERTa with PyTorch - Coding Tutorial

Ғылым және технология

A practical Python Coding Guide - In this guide I train RoBERTa using PyTorch Lightning on a Multi-label classification task. In particular the unhealthy comment corpus - this creates a language model that can classify whether an online comment contains attributes such as sarcasm, hostility or dismissiveness.
---- TUTORIAL NOTEBOOK
colab.research.google.com/dri...
remember to press copy to drive to save a copy of the notebook for yourself
Intro: 00:00:00
Video / project outline: 00:00:27
Getting Google Colab set up: 00:02:00
Imports: 00:03:23
Inspect data: 00:07:05
Pytorch dataset: 00:11:15
Pytorch lightning data module: 00:27:08
Creating the model / classifier: 00:35:45
Training and evaluating model: 01:07:30
This series attempts to offer a casual guide to Hugging Face and Transformer models focused on implementation rather than theory. Let me know if you enjoy them! Will be doing future videos on computer vision if that is something people are interested in, let me know in the comments :)
----- Research material for theory
RoBERTa paper: arxiv.org/abs/1907.11692
HuggingFace: huggingface.co/
Unhealthy Comment Corpus paper: arxiv.org/abs/2010.07410
Check out my website: www.rupert.digital

Пікірлер: 82

  • @vincentcoulombe6864
    @vincentcoulombe68642 жыл бұрын

    Great video! Never heard of Pytorch Lightning before. It looks really useful!

  • @mytabby6463
    @mytabby64639 ай бұрын

    Hi Rupert! thanks for putting this together.

  • @HeadshotComing
    @HeadshotComing2 жыл бұрын

    This guide is really helpful! Your explanations are very easy to understand and everything flows very smoothly. In next videos, could you please include a little pop-up recording of yourself at the side of the screen like in the previous one? Makes it easier to maintain the focus and listen more carefully. Also would be great if the volume was higher. Other than that, phenomenal man!

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Thanks for the feedback! I'll keep it in mind :) hopefully i'll get a better mic/camera set up at some point soon! Appreciate the feedback

  • @danialhayati2779
    @danialhayati27792 жыл бұрын

    Awesome! keep up the work!

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Thanks a lot Danial!

  • @SuirouNoJutsu
    @SuirouNoJutsu2 жыл бұрын

    Great tutorial, thanks for this. Like your style and setup. Future videos could be made better by sharing a viewable link to the Collab Notebook. I kept having to rewind/go back to find mistakes in my code compared to yours and having a Notebook I could pull up would go a long way.

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hey, thanks so much for the feedback! You're totally right, I have added a link to the notebook in the description - I think this is the correct one :) let me know if you have any issues!

  • @MattRosinski
    @MattRosinski2 жыл бұрын

    Fantastic tutorial Rupert! Thank you for putting this together. I was wondering if you might spend some time demonstrating how to save and load the state of the model for inference and if possible recover the model state that had the lowest loss on the validation set before overfitting started to creep in?

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi Matt - sorry for the late reply! I can certainly do a tutorial on saving/loading models in Pytorch and I will also discuss saving the best model (w.r.t validation loss). I will add that to my new computer vision series on image classification - although a different task it will be exactly the same code!

  • @Khushpich
    @Khushpich2 жыл бұрын

    Great tutorial

  • @ppeng08
    @ppeng082 жыл бұрын

    Can you add the datasets like 'ucc_train.csv', 'ucc_val.csv', and 'ucc_test.csv' to the repo? The colab notebook cannot copy the datasets from your drive. Thanks.

  • @HikiMaxSimo
    @HikiMaxSimo10 ай бұрын

    Hey Rupert, thank you so much for sharing this amazing use-case. Where can I find the 2 datasets (ucc_train.csv and ucc_val.csv) you used in your notebook? Thanks, Massimo

  • @user-jn8ht9ww4q
    @user-jn8ht9ww4q10 ай бұрын

    Excellent video. Please make a video on codeBert finetuning

  • @megacapojuan
    @megacapojuan2 жыл бұрын

    🤩🤩

  • @user-zo2sc5uf9z
    @user-zo2sc5uf9z Жыл бұрын

    There are some new information i didnt know, especially the difference between hugging face version VS RoBERTa! I stuck into making multi label claasification model. I learned a lot from your video!!! THANKS

  • @dimabear
    @dimabear11 ай бұрын

    @1:12:33 How does classify_raw_comments know to create predictions on the validation data (and not the training data) if ucc_data_module contains both the train_dataset and val_dataset? You're passing "ucc_data_module" to the datamodule parameter, however both train_dataset and val_dataset were created from the setup() method

  • @alessiogarau7948
    @alessiogarau7948 Жыл бұрын

    The video is great thanks. I have a question: this example can be generalized to the case of a problem with many classes? let say 400?

  • @ShubhamSharma-qb1bw
    @ShubhamSharma-qb1bw Жыл бұрын

    Can you please share me the snippet code to finding f1, recall and precison score after prediction like your tutorial?

  • @sune8823
    @sune8823 Жыл бұрын

    Hey, something that would really be helpful is to illustrate how to use a pre-trained model on a single use case, i.e. Utilize the model to classify a single comment according to the attributes. Thanks again for a very helpful video! Best wishes from across the Atlantic

  • @rupert_ai

    @rupert_ai

    Жыл бұрын

    Hey Sun E! It is exactly the same way you would run a batch of sentences through (except with a batch size of 1). You can simply call the model on the tokenised sentence! I may make a video at some point - thanks for the suggestion :)

  • @sune8823

    @sune8823

    Жыл бұрын

    @@rupert_ai totally makes sense when you lay it out like that. Thank you again for the work you're doing!

  • @nekro9t2
    @nekro9t2 Жыл бұрын

    I would like to adapt this process for a multi-language identification task. I think I can just use the language labels instead of comments. I also need to the word start and end points, do you know how I can predict these as well?

  • @nadavge
    @nadavge11 ай бұрын

    Did you let the whole model train? What were the considerations between training only the last layers vs letting everything go training?

  • @mahmudhasan3093
    @mahmudhasan30932 жыл бұрын

    Thank you for the amazing tutorial. For my use case I have three target labels '0', '1' '2' where 0 is neutral 1 is positive and 2 is negative. I wont be able to use a BCE loss can I? what might be the alternative for it?

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi Mahmud, no you won't be able to use binary cross entropy - since it is for binary targets (hence the name!) - however you could still use the cross entropy loss which can take as many targets as you like! I would look up categorical cross entropy loss if I were you :)

  • @user-gl7dl8uu6m
    @user-gl7dl8uu6m Жыл бұрын

    Can you pls add the test dataset code in the UCC_Data_Module and make it more usable ?

  • @majidafra
    @majidafra8 ай бұрын

    Hi Rupert, thanks for this nice training video. I have been trying to use your work on my data but I get an error. Is it possible to share the error with you?

  • @gauravdev1872
    @gauravdev18723 ай бұрын

    Great tutorial Rupert. Can we make a multilabel model with around 0.1M labels like Skill extraction from the given job postings. More clearly, can we create a BERT model which can classify a job postings based on the skills it contains? Rows : job postings, Cols: Skill-labels ?

  • @vanesamena3974
    @vanesamena3974 Жыл бұрын

    Hi Rupert! Fantastic tutorial. Your videos are excellent and really helpful!. I would like to know how to export the trained model. Do you know how it can be done? Thanks!!

  • @rupert_ai

    @rupert_ai

    Жыл бұрын

    Hi Vanesa, I will cover this in a video soon but the model you have trained is a regular pytorch model and can be saved/loaded as normal - see this link pytorch.org/tutorials/beginner/saving_loading_models.html for a guide

  • @vanesamena3974

    @vanesamena3974

    Жыл бұрын

    @@rupert_ai Thank you so much for your reply!! I'll see that documentation that you passed me and I wait for your video soon. Your material has been very useful to me. Thank you!!

  • @mjc0090
    @mjc00902 жыл бұрын

    This is a great tutorial! I have been able to replicate your idea using my own data. The question I have now is how to go from an input string like “I love this movie!” to a set of predicted labels and their confidence scores (i.e. probabilities).??

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi, glad you like the tutorial. The answer to your question is quite simple, take the output of your model and pass it through a sigmoid function (this squishes everything between 0 and 1), just like I do when I make my predictions. These can be interpreted as your probabilities from 0 to 1, in order to retrieve your actual yes/no predictions per label you simply threshold any probability over > 0.5 as a positive prediction and anything less than 0.5 as a negative prediction. Hope that helps!

  • @soccihighdigger287
    @soccihighdigger2872 жыл бұрын

    Great work! Do you have a colab/gitlab with the code?

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Just updated the description to include this :)

  • @runjhunsingh2348
    @runjhunsingh23483 ай бұрын

    tried just everything but getting 38% hamming score accuracy on my multilabel classificastion of 24000 dataset into 26 labels, please suggest something

  • @TanveerAhmed-kn8eh
    @TanveerAhmed-kn8eh2 жыл бұрын

    Fantastic Tutorial, I have one question- how to predict a single text after training

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi Tanveer! All you need to do is make a batch with one piece of text and then pass it through your model!

  • @SatyamSingh-om7rc
    @SatyamSingh-om7rc5 ай бұрын

    where can i get the dataset you guys use here to do so?

  • @sniffersmc2827
    @sniffersmc28272 жыл бұрын

    Incredibly good tutorial, but I'm using my own dataset and it's been hanging at the [00:00

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    You should see if something is wrong, however, what I would do if I were you is try a much smaller subset first (e.g. 500 samples), check that is working as expected and go from there : )

  • @sniffersmc2827

    @sniffersmc2827

    2 жыл бұрын

    Yeah, I did, Pl doesn't like my dataloader for some reason. Ended up writing the model as a pytorch NN and writing my own trainer

  • @sagarpanda750

    @sagarpanda750

    Жыл бұрын

    @@sniffersmc2827 can you help me with what changes you have implemented as my raining is also stuck at 00:00 irrespective of number of samples.

  • @sniffersmc2827

    @sniffersmc2827

    Жыл бұрын

    @@sagarpanda750 I stopped using the Pytorch Lightning trainer and wrote my own training loop. I spent way too much time trying to debug it and at some point it just became a waste of time

  • @shahed8762
    @shahed87622 жыл бұрын

    Will this model work on Chinese text if I use Autotokenizer? model- chinese roberta

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi! Yes it will work on Chinese if you use a Chinese based model, tokenizer and dataset - I am unable to tell you how well it will perform as I have no experience in it but good luck!

  • @balamuruganm2019
    @balamuruganm20192 жыл бұрын

    Hi bro nice video… how to pretrain and finetune for (Indian languages)other than English ? Could you suggest any code?

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hey bro, thanks. I would look at a BERT model that has been pre-trained on multi-lingual data, ideally with a focus on Indian languages. Here is a model that is fine tuned on 12 indian languages - huggingface.co/ai4bharat/indic-bert - I'd suggest you start with this model, and maybe also look at this one - huggingface.co/bert-base-multilingual-cased - which isn't indian specifically but trained on 104 different languages. In both cases you'd need to swap out the 'base' model that I use and instead copy in the model card name e.g. ai4bharat/indic-bert. I would then try and finely tune your model again on any data you have available (if you do) and go from there! Let me know how you get on :)

  • @LJ-ci3wz
    @LJ-ci3wz2 жыл бұрын

    I have trained this model on my data set but how to deploy it on a document containing text?

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi LJ, you'll need to extract the text from your document and pass it through the same process as during training (e.g. converting to input ids, etc.). The maximum size for this model is around 512 input tokens if I remember correctly which might be a good deal shorter than your whole document. If that is the case then you'll either have to predict different sections of your document or take an average prediction over the whole document!

  • @holthuizenoemoet591
    @holthuizenoemoet591 Жыл бұрын

    Great video, but i get two question. How long did it take to train this model? And secondly i get a warning: The current process just got forked. Disabling parallelism to avoid deadlocks... To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false). any clue on how to fix this? Because setting TOKENIZERS_PARALLELISM to false will probably hurt performance?

  • @user-tx3mo1ez2n
    @user-tx3mo1ez2n Жыл бұрын

    Why no-one uses concept of class_weight when the dataset is found to be unbalanced.

  • @rupert_ai

    @rupert_ai

    Жыл бұрын

    Definitely worth trying! It can be a useful technique but I didn't include it for simplicity

  • @olgierd245
    @olgierd2452 жыл бұрын

    Okay. It maybe a stupid question but I'm new to these libraries. How do you predict one comment with trained model? Did someone try to do that?

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi Olgierd, yes you can use a model to predict one comment or lots of comments at the same time if you like! remember the model is expecting a 'batch' of data, so one comment would be a batch size of 1. You'll need to add this extra dimension (indicating how many items in the batch) before passing it to your model! Hope this helps

  • @emoyano58

    @emoyano58

    Жыл бұрын

    @@rupert_ai im trying to predict one comment with the trained model but i dont know how, i tried using the "clasiffy_raw_comments" function and using a csv with the comments i want to predict(im recording the comments that im entering and his predictions), but i get in response values between 1 and 0 in the labels and i would like to get 1's or 0's. btw thank u for the very useful video, greetings from chile

  • @rupert_ai

    @rupert_ai

    Жыл бұрын

    @@emoyano58 the values are between 0-1 because that is the model's prediction. You can take this a pseudo confidence value e.g. a value of 1 is very confident it is the label, a value of 0.5 means its not confident, and a value of 0 means its confident the comment doesn't contain the label. Simply apply > 0.5 to find comments it thinks contain the label, and vice versa for the comments that don't contain the label. Let me know if you have any other questions!

  • @elahehsalehirizi618
    @elahehsalehirizi618 Жыл бұрын

    Hi how could we save the model after training and use it in another file? (like test file) thanks

  • @rupert_ai

    @rupert_ai

    Жыл бұрын

    Hi Elaheh! Yes of course. The model is a standard pytorch model - have a look here at how to save/load pytorch models pytorch.org/tutorials/beginner/saving_loading_models.html. You can then run this model on any batch of test input you like :) good luck!

  • @elahehsalehirizi618

    @elahehsalehirizi618

    Жыл бұрын

    @@rupert_ai great, thanks a lot for your answer. And thanks again for the tutorial

  • @leckiPn
    @leckiPn Жыл бұрын

    Hi, I'm trying to run the code but it fails in both (jupter notebook and google collab) with error "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn" any idea why its happening?

  • @nghiaducduong

    @nghiaducduong

    Жыл бұрын

    i have the same problem, have you found the solution ?

  • @Sukant98

    @Sukant98

    Жыл бұрын

    I have exactly the same problem. I used MuRIL BERT instead of RoBERTa but everything else is the same.

  • @nghiaducduong

    @nghiaducduong

    Жыл бұрын

    @@Sukant98 i found the solution, the trainer somehow lock the requires grad druing forward state when training, just implement your train loop, dont use pl trainer

  • @Sukant98

    @Sukant98

    Жыл бұрын

    @@nghiaducduong Thanks!!! I also found a solution, I imported the AdamW optimizer from torch.optim.AdamW instead of the transformers library. Now at least the model is being trained.

  • @rupert_ai

    @rupert_ai

    Жыл бұрын

    Glad you found solutions! Good job everyone

  • @XX-vu5jo
    @XX-vu5jo2 жыл бұрын

    Do it in Tensorflow!!

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Although I have used both TF and PT - I definitely prefer Pytorch! So my suggestion would be to learn PT ;)

  • @RuloGames1
    @RuloGames12 жыл бұрын

    ⚠⚠⚠🛐🛐🛐

  • @shahed8762
    @shahed87622 жыл бұрын

    Hi i have requested for the access to the notebook, please allow it

  • @rupert_ai

    @rupert_ai

    2 жыл бұрын

    Hi anyone with the link can copy the notebook - make sure to press copy to drive and save a version of the notebook for yourself.

  • @shahed8762

    @shahed8762

    2 жыл бұрын

    @@rupert_ai ok Thanks

  • @not_amanullah
    @not_amanullah11 ай бұрын

    Zoom in can't see clearly

  • @zebulonsaloum9935
    @zebulonsaloum99352 жыл бұрын

    🔥 Promo'SM!!!

  • @jorgeromero4680
    @jorgeromero4680 Жыл бұрын

    your sound it's so low even in 100% i barely hear you

  • @rupert_ai

    @rupert_ai

    Жыл бұрын

    Hey Jorge, you are totally right! In my recent videos I have a more professional grade microphone :)

  • @gk4539
    @gk4539 Жыл бұрын

    🙇‍♀️

Келесі