SMOTE (Synthetic Minority Oversampling Technique) for Handling Imbalanced Datasets

Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam.
SMOTE synthesises new minority instances between existing (real) minority instances.
If you do have any questions with what we covered in this video then feel free to ask in the comment section below & I'll do my best to answer those.
If you enjoy these tutorials & would like to support them then the easiest way is to simply like the video & give it a thumbs up & also it's a huge help to share these videos with anyone who you think would find them useful.
Please consider clicking the SUBSCRIBE button to be notified for future videos & thank you all for watching.
You can find me on:
GitHub - github.com/bhattbhavesh91
Medium - / bhattbhavesh91
#ClassImbalance #SMOTE #SyntheticMinorityOversamplingTechnique #machinelearning #python #deeplearning #datascience #youtube

Пікірлер: 151

  • @bhattbhavesh91
    @bhattbhavesh915 жыл бұрын

    Something went wrong while using pd.crosstab! So the updated confusion matrices are as follows - At 7:50 The correct confusion matrix is 92303 14 1535 135 At 10:30 The correct confusion matrix is 93798 41 40 108 Sorry for the mistake :)

  • @sahubiswajit1996

    @sahubiswajit1996

    5 жыл бұрын

    Why we are using "random_state=12" ?

  • @chrislam1341

    @chrislam1341

    4 жыл бұрын

    @@sahubiswajit1996 it is just his preference, for being able to get the same result from the randomness.

  • @sumitshukla3689

    @sumitshukla3689

    4 жыл бұрын

    When we apply SMOTE, the number of samples doesn't changes. But as explained by you, if we are adding some synthetic samples, the training example should also increase right??

  • @KumarHemjeet

    @KumarHemjeet

    3 жыл бұрын

    @@sahubiswajit1996 you can take any number

  • @elliothank2823

    @elliothank2823

    3 жыл бұрын

    I guess it's kinda off topic but does anybody know a good site to stream new tv shows online ?

  • @prathameshmohite3008
    @prathameshmohite30084 жыл бұрын

    Hi Bhavesh, Very good explanation. I was particularly confused about implementing SMOTE on the main data. But I guess you're correct that we must implement SMOTE on training data. Thank You

  • @SurajSingh-pw9ew
    @SurajSingh-pw9ew4 жыл бұрын

    Thanku Bhavesh❣️❣️.Bina bore kiye padhaya 👏🏻👏🏻👏🏻 excellent

  • @ddxccc
    @ddxccc3 жыл бұрын

    Most helpful and professional video I found on SMOTE. Thanks a lot!

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    I'm glad you like it

  • @carl2143
    @carl214310 ай бұрын

    I'll come back to this video. Seems helpful!

  • @dhananjaykansal8097
    @dhananjaykansal80974 жыл бұрын

    Your handwriting is pretty. Thanks for the explanation once again. Cheers!

  • @srikrshnap6036
    @srikrshnap6036 Жыл бұрын

    Lovely Explanation! Thank you!

  • @bishalmohari8748
    @bishalmohari87483 жыл бұрын

    I started watching the undersampling video for a problem and ended up watching the full series cause of how well explained they are. Gald I discovered your channel! Wish I did sooner xD

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Glad it was helpful!

  • @siddharthkenia9089
    @siddharthkenia90892 жыл бұрын

    Not only you explained really well the illustration were perfect for a beginner to understand what oversampling mean. Thank you:)

  • @bhattbhavesh91

    @bhattbhavesh91

    2 жыл бұрын

    Glad it was helpful!

  • @sparshdutta
    @sparshdutta5 жыл бұрын

    Thanks for teaching new stuff.☺

  • @AizirekTolonova-od1ks
    @AizirekTolonova-od1ks2 ай бұрын

    Thank you so much for the great explanation!

  • @bhattbhavesh91

    @bhattbhavesh91

    2 ай бұрын

    Glad it was helpful!

  • @harshparikh7060
    @harshparikh70603 жыл бұрын

    Thanks, Bhavesh!

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Glad you enjoyed it

  • @kokl123ify
    @kokl123ify2 жыл бұрын

    hi bhavesh could you please confirm in order to ensure the oversampling method doesnt reduce the accuracy of the model should we always use hyperparameter tuning or is there some other method also to undo the damage of oversampling method in logistic regression for attrition prediction

  • @karndeepsingh
    @karndeepsingh4 жыл бұрын

    Very well explained sir!!!

  • @MY_PARIDE
    @MY_PARIDE28 күн бұрын

    Great Explanation....👏

  • @princeok12
    @princeok125 жыл бұрын

    Very well explained Thank you. Especially appreciated the explanation of nearest neighbor

  • @7810
    @78104 жыл бұрын

    Quite interesting! Thanks for the lesson.

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    Glad you liked it!

  • @shishirdixit5996
    @shishirdixit59964 жыл бұрын

    Here while fitting the training dataset after tuning hyperparameters using gridsearchcv why you have used X_train and y_train and why not X_train_res and y_train_res dataset

  • @shandou5276
    @shandou52763 жыл бұрын

    This is very well done :) Nothing overly flashy and yet very clear.

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Glad you enjoyed it

  • @bhargav7476
    @bhargav74764 жыл бұрын

    You have no idea how helpful that was

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    Thank you so much :)

  • @nesrinehadjamar2197
    @nesrinehadjamar2197 Жыл бұрын

    Thank you ! Simple and clear explanation

  • @bhattbhavesh91

    @bhattbhavesh91

    Жыл бұрын

    Glad it was helpful!

  • @bintehawa7712
    @bintehawa77129 ай бұрын

    Thanks to explain with notes help me alot

  • @KaushikJasced
    @KaushikJasced2 жыл бұрын

    Thank you sir for giving a wonderful lecture. Can you tell me how I can put the sampling ratio as per my choice instead of 1:1 using SMOTE?

  • @bhuvneshsaini93
    @bhuvneshsaini935 жыл бұрын

    Hi, you used only two target 0 and 1 , how to do with more than two . Suppose target 1 is around 2000 , target 2 is around 200 , target 3 is around 11 and so on.

  • @TejaDuggirala
    @TejaDuggirala5 жыл бұрын

    Good work bro.. thank you

  • @shishirdixit5996
    @shishirdixit59964 жыл бұрын

    I have a categorical dependent variable with 3400 records in which the distribution of 0s and 1s are 2677 and 723 respectively, Will this be considered as an imbalanced dataset ? or if I would have 1s less than 5% of the total record only then it would be considered as imbalanced. Kindly clarify the doubt

  • @EcommerceAdvices
    @EcommerceAdvices3 жыл бұрын

    Thanks alot. You mk it so simple :) Liked n subscribed bro.

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Thanks and welcome

  • @Nirja3
    @Nirja33 жыл бұрын

    When I tried to set up the smote ration, getting invalid ratio parameter for SMOTE.Can u help?

  • @sadiaafrin7143
    @sadiaafrin71433 жыл бұрын

    Good work man! Thanks

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Glad it helped!

  • @danielniels22
    @danielniels222 жыл бұрын

    6:20 what library u imported before declaring SMOTE() class?

  • @jampavy6446
    @jampavy64462 жыл бұрын

    Nice explanation

  • @ganeshreddypuli3101
    @ganeshreddypuli31012 жыл бұрын

    If we want to normalize the data as well, should we do it before applying SMOTE?

  • @adityaraikwar6069
    @adityaraikwar6069 Жыл бұрын

    very informative video, simple and to the point keep it up

  • @bhattbhavesh91

    @bhattbhavesh91

    Жыл бұрын

    Glad you liked it!

  • @abhishekwagh8246
    @abhishekwagh82464 жыл бұрын

    I have a sample of only 28. Unfortunately I don't have more sample. Will SMOTE work? Secondly, which logistic regression should be used? Sklearn or statsmodels? Both give different results. Please help.

  • @charmilam920
    @charmilam9203 жыл бұрын

    Thank you for this video. Understood SMOTE very well. Please make videos more often and How do you explain things so effortlessly with such clarity ? Where is this clarity coming from ? Great job

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Thank you! Will do!

  • @WordofSpirit
    @WordofSpirit2 жыл бұрын

    Looks like the weights is also not working on smote. Any alternative way to test different weights?

  • @bhagyashreeln1304
    @bhagyashreeln13042 жыл бұрын

    Hi, what do we do if we have a balanced dataset but still want to increase the number of rows

  • @0SIGMA
    @0SIGMA3 жыл бұрын

    You are some DOPE shit brother and by that i mean youre really good ! explained the important stuffs like only on train set beautifully ! really great !

  • @dipankarrahuldey6249
    @dipankarrahuldey62493 жыл бұрын

    With SMOTE, can we achieve higher f1 in practice? I saw that f1 was around 0.72

  • @alanblitzer744
    @alanblitzer7444 жыл бұрын

    You are great bro

  • @thomasayele5389
    @thomasayele5389 Жыл бұрын

    Excellent explanation!

  • @bhattbhavesh91

    @bhattbhavesh91

    Жыл бұрын

    I'm glad you liked it

  • @sridhar6358
    @sridhar63583 жыл бұрын

    so the idea of opting for ratio parameter in SMOTE to be a hyperparameter is to ensure we get better results is that correct, in general is it a good option to make ratio option of SMOTE to be a hyperparameter rather then fixing it to 1

  • @rishisolanki554
    @rishisolanki554Ай бұрын

    Really help

  • @hosseinroosta5154
    @hosseinroosta5154 Жыл бұрын

    Realy thanks♥️

  • @bhattbhavesh91

    @bhattbhavesh91

    Жыл бұрын

    You're welcome 😊

  • @MrFcapri
    @MrFcapri2 жыл бұрын

    kindly tell me I have 5 classes imbalanced data set. SMOTE will work for multi CLASS data set ?

  • @VINODKUMARIYA
    @VINODKUMARIYA Жыл бұрын

    Thank you sir !

  • @bhattbhavesh91

    @bhattbhavesh91

    Жыл бұрын

    Most welcome!

  • @AnupKumar-nz2qq
    @AnupKumar-nz2qq4 жыл бұрын

    After generating the synthetic data in which kind of situation this data can be useful any limitation of this type of data.

  • @dhananjaykansal8097
    @dhananjaykansal80974 жыл бұрын

    shouldn’t it be generate_auc_roc_curve(pipe, X_test). If no if Bhaveshbhai you or anyone can explain pls.

  • @makhboulame9654
    @makhboulame96543 жыл бұрын

    Can SMOTE be used for Multi label classification dataset ? Thank you

  • @priyas8871
    @priyas88712 жыл бұрын

    Can u please tell how this SMOTE can be applied for streaming data- In Test then Train Framework??

  • @Asma-cx8uc
    @Asma-cx8uc2 жыл бұрын

    Hello Sir ! Could you please describe how SMOTE technique can be used to balance data images

  • @spadbob24
    @spadbob243 жыл бұрын

    thank you so much - very informative video

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Glad it was helpful!

  • @ankushjamthikar9780
    @ankushjamthikar97803 жыл бұрын

    Very Good Explanation. But, can we use this method for multiclass problem? Also, does SMOTE leads to overfitting issue?

  • @sirvachjumani7215
    @sirvachjumani72153 жыл бұрын

    Hi Bhavesh, very nicely explained can you please tell me the literature of the following examples. thanks

  • @hieunguyenvan6590
    @hieunguyenvan6590 Жыл бұрын

    Do you need to remove outliers of dataset if you SMOTE?

  • @elaf8256
    @elaf82563 жыл бұрын

    How we can overcame the problem of Overlapping when used SMOTE??

  • @syedshaulhameed
    @syedshaulhameed3 жыл бұрын

    How do I split my data into training and testing if my data is imbalanced?

  • @JT2751257
    @JT27512574 жыл бұрын

    cello pointec- bachpan ki yaad dila di :)

  • @debatradas1597
    @debatradas15972 жыл бұрын

    Thank you so much Sir

  • @bhattbhavesh91

    @bhattbhavesh91

    2 жыл бұрын

    Most welcome

  • @travelsome
    @travelsome4 жыл бұрын

    Perfection

  • @mirroring_2035
    @mirroring_20352 жыл бұрын

    in your crosstab function you have y_test[target]. What is that? why is target used to index the y_test object?

  • @channel-lk6xz
    @channel-lk6xz7 ай бұрын

    I don't understand how we infer from auc roc. What are we seeing there and what are the values plotted here.

  • @achyuthvishwamithra
    @achyuthvishwamithra2 жыл бұрын

    When the final ratio came out to be 0.005, doesn't it imply that the we are going to be generating a very small number (0.005 * majority) of samples for the minority class? How will the length of minority class samples ever be equal to that of majority class?

  • @harshavardhansvlkkb2290
    @harshavardhansvlkkb22902 жыл бұрын

    Can we use smote to target column in data set

  • @jgubash100
    @jgubash1003 жыл бұрын

    Well explained

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Thank you!

  • @MarsLanding91
    @MarsLanding913 жыл бұрын

    Thank you for this video! 2 thumbs up! Question - at 4:06 you selected KNN = 3 but I didn't see you applying that concept in the code section. Can you please elaborate on where you set KNN as 3 in the code section? Did I misunderstand something?

  • @IykeDx

    @IykeDx

    4 ай бұрын

    When KNN is not stated, the default is 5.

  • @clintpaul6653
    @clintpaul66532 жыл бұрын

    Can i apply sampling for test set too.. Becuase its also very unbalanced??? Plzzz reply

  • @mramesh7085
    @mramesh70852 жыл бұрын

    Nice expalnation

  • @randomforrest9251
    @randomforrest92513 жыл бұрын

    how does smote work with categorical data?

  • @powellmenezes584
    @powellmenezes5845 жыл бұрын

    even i have this doubt - Hi, you used only two target 0 and 1 , how to do with more than two . Suppose target 1 is around 2000 , target 2 is around 200 , target 3 is around 11 and so on.

  • @TheRaviraaja

    @TheRaviraaja

    3 жыл бұрын

    arxiv.org/pdf/1106.1813.pdf - check out algorithm, neighbours does matters.

  • @shwetasharma1996
    @shwetasharma19964 жыл бұрын

    Nice content! I would like to compare some techniques of oversampling.. Can you pl help me out to get the hard code of SMOTE not the packaged one..thanks

  • @advaitshirvaikar4751
    @advaitshirvaikar47513 жыл бұрын

    Hey, when I try using make_pipeline(SMOTE(), SVC()) it gives me an error : All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1, out_step='deprecated', random_state=None, ratio=None, sampling_strategy='auto', svm_estimator='deprecated')' (type ) doesn't what's going wrong here

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    The SMOTE function has changed after I created this video! Please refer to the documentation!

  • @DanielWeikert
    @DanielWeikert4 жыл бұрын

    if we use smote in the pipeline, is it only upsampling on training or also on testing when we call predict? Thanks

  • @Eny11111
    @Eny111113 жыл бұрын

    Thanks 👍

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Welcome 👍

  • @bhagwatchate7511
    @bhagwatchate75114 жыл бұрын

    Nice

  • @dhananjaykansal8097
    @dhananjaykansal80974 жыл бұрын

    Lovelyyyyyyy

  • @ashishraj5882
    @ashishraj58823 жыл бұрын

    again ROC auc curve is used ??

  • @atwinemugume
    @atwinemugume5 жыл бұрын

    Thanks

  • @saptarshibhattacharya1253
    @saptarshibhattacharya1253 Жыл бұрын

    can u elaborate with a random forest algorithm in google colab?

  • @helll5894
    @helll58943 жыл бұрын

    What if there are more than 2 classes? In your video Sir, there are only 2 classes.. For example, I want to make 3 classes.. How can I implemented 3 classes on python use SMOTE?? Thank you, Sir

  • @sourishmukherjee2404
    @sourishmukherjee24043 жыл бұрын

    The final ratio for the final model after Grid search CV was for SMOTE=0.0005/Does thatg imply that the ratio(Minority class/Majority class)=0.005 .?Then how is the minority class gettting oversampled to equal proportion as the majority class??

  • @akhilyeduresi8145
    @akhilyeduresi81452 жыл бұрын

    gettings errors as : __init__() got an unexpected keyword argument 'ratio' AttributeError: 'SMOTE' object has no attribute 'fit_sample'

  • @burhanrashidhussein6037
    @burhanrashidhussein60375 жыл бұрын

    Does smote guarantee to improve classifier performance ?

  • @bhattbhavesh91

    @bhattbhavesh91

    5 жыл бұрын

    Nope! It doesn't, it only upsamples your data by generating artificial samples! How good the model performs depends on how well your classes are apart!

  • @hamzaraouia8975
    @hamzaraouia89754 жыл бұрын

    I have got this error when trying to run the smote: __init__() got an unexpected keyword argument 'ratio' any clues ? Thank you

  • @GurunathHari

    @GurunathHari

    4 жыл бұрын

    You must have figured it out by now. Am only a student. It has been deprecated as the video is 1 year old. try using this sm = SMOTE(random_state=42, sampling_strategy = 'minority')

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Thanks Gurunath for sharing this!

  • @deepikadusane9051
    @deepikadusane90514 жыл бұрын

    Hii bhavesh , i used ur this code of smote bt i m getting an error of ratio ie invalid parameter ratio for estimator Smote , how to resolve this

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    I guess the function has changed! Do have a look at the documentation to learn more about it!

  • @OriginalBernieBro
    @OriginalBernieBro4 жыл бұрын

    The smote ratio parameter is deprecated, my off balanced dataset sklearn classification_report is off balanced in the support column even after smoting.

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    The SMOTE function has changed after I created this video! Please refer to the official documentation!

  • @wenhongzhu8637
    @wenhongzhu86374 жыл бұрын

    Hi~can you share the data set

  • @anshumanagrahri7816
    @anshumanagrahri78164 жыл бұрын

    Hiii, can you please tell how to use SMOTE on time series and sequential data

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    you are a google search away for an answer!

  • @soumyadeeparinda1692
    @soumyadeeparinda16923 жыл бұрын

    Can you please share the notebook with us using google colab?

  • @kavanalipanahi3505
    @kavanalipanahi35053 жыл бұрын

    True positive is 0 in the confusion matrix(by the formula the Precision and Recall should be equal to zero) .So how did you get that great number (over 70 %)?

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    Please read the pinned comment!

  • @kavanalipanahi3505

    @kavanalipanahi3505

    3 жыл бұрын

    @@bhattbhavesh91 I like your videos. :)))

  • @dastola8330
    @dastola83304 жыл бұрын

    what is the use of defining random_state ?

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    kzread.info/dash/bejne/lWZom7Ftl8zInLA.html

  • @akhilthekkedath1850
    @akhilthekkedath18505 жыл бұрын

    Sir, could you please make a video on outlier detection?

  • @bhattbhavesh91

    @bhattbhavesh91

    5 жыл бұрын

    I have already created a video on outlier detection. Link - kzread.info/dash/bejne/ZIWm0dWtZJqanLQ.html

  • @deeptigupta518
    @deeptigupta5184 жыл бұрын

    Smote can only be used in Logistic Regression or any classification model

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    any classification algorithm!

  • @AnkitGupta-ec4pi
    @AnkitGupta-ec4pi4 жыл бұрын

    very well explained sir thank you

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    You are welcome

  • @bintehawa7712
    @bintehawa77129 ай бұрын

    Please start a playlist for beginners to learn AI ,ML please

  • @bhattbhavesh91

    @bhattbhavesh91

    9 ай бұрын

    Sure!

  • @niyazahmad9133
    @niyazahmad91333 жыл бұрын

    Smote__ratio is not a parameter of smote help me out plz......

  • @bhattbhavesh91

    @bhattbhavesh91

    3 жыл бұрын

    The SMOTE function has changed after I created this video! Please refer to the official documentation!

  • @sanyajain2127
    @sanyajain21274 жыл бұрын

    Getting an error: ValueError: Unknown label type: 'continuous-multioutput'

  • @bhattbhavesh91

    @bhattbhavesh91

    4 жыл бұрын

    you are a google search away for an answer!

  • @harishshanmugamdhanasekar311

    @harishshanmugamdhanasekar311

    3 жыл бұрын

    @@bhattbhavesh91 lol that's right 😂

  • @The_Option_Seller_Room
    @The_Option_Seller_Room4 жыл бұрын

    How to handled extremely imbalanced data for regression problem .

  • @guico3lho
    @guico3lho Жыл бұрын

    At the end of the video, how all the 4 metrics scored above 70% if the model did not predicted correct none of samples classified as 1? There was 0 True Positives and 63 False Negatives!

Келесі