Tutorial 2- Feature Selection-How To Drop Features Using Pearson Correlation

In this video I am going to start a new playlist on Feature Selection and in this video we will be discussing about how we can drop features using Pearson Correlation
github: github.com/krishnaik06/Comple...
Feature Selection playlist: • Feature Selection
All Playlist In My channel
Complete ML Playlist : • Complete Machine Learn...
Complete DL Playlist : • Complete Deep Learning
Complete NLP Playlist: • Natural Language Proce...
Docker End To End Implementation: • Docker End to End Impl...
Live stream Playlist: • Pytorch
Machine Learning Pipelines: • Docker End to End Impl...
Pytorch Playlist: • Pytorch
Feature Engineering : • Feature Engineering
Live Projects : • Live Projects
Kaggle competition : • Kaggle Competitions
Mongodb with Python : • MongoDb with Python
MySQL With Python : • MYSQL Database With Py...
Deployment Architectures: • Deployment Architectur...
Amazon sagemaker : • Amazon SageMaker
Please donate if you want to support the channel through GPay UPID,
Gpay: krishnaik06@okicici
Discord Server Link: / discord
Telegram link: t.me/joinchat/N77M7xRvYUd403D...
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
/ @krishnaik06
Please do subscribe my other channel too
/ @krishnaikhindi
Connect with me here:
Twitter: / krishnaik06
Facebook: / krishnaik06
instagram: / krishnaik06
#featureselection
#correlation

Пікірлер: 211

  • @prakash564
    @prakash5643 жыл бұрын

    Sir your channel is a perfect combination of sentdex and statquest. You are doing a great work 🙌more power to you!!

  • @nurnasuhamohddaud728
    @nurnasuhamohddaud7282 жыл бұрын

    Very comprehensive explanation for someone from non AI background. Thanks Sir keep up the good work!

  • @ashishkulkarni8140
    @ashishkulkarni81403 жыл бұрын

    Sir, could you please upload more videos on feature selection to this playlist? It is very amazing. I followed all the videos from feature engineering playlist. You are doing a great work. Thank you.🙏🏻

  • @sukanyabag6134
    @sukanyabag61343 жыл бұрын

    Sir, the videos you uploaded on feature selection helped a lot ! , Please upload the rest tutorials and methods too! Eagerly waiting for it !

  • @rhevathivijay2913
    @rhevathivijay29133 жыл бұрын

    Being in a teaching profession ,I assure this is the best explanation about Pearson correlation.. Please make more likes.

  • @shubhambhardwaj3643
    @shubhambhardwaj36433 жыл бұрын

    Any word is not sufficient to thank you for your work sir ....🙏🙏

  • @waytolegacy
    @waytolegacy2 жыл бұрын

    I think instead of dropping "either of" 2 highly correlated features, we should check from both of them how each of them correlates with the target as well and then drop the less correlated with the target variable. Which might increase some accuracy instead of considering dropping whichever comes first. Again, I think it is.

  • @djlivestreem4039

    @djlivestreem4039

    2 жыл бұрын

    good point

  • @beautyisinmind2163

    @beautyisinmind2163

    2 жыл бұрын

    you can check importance value of each using RF and one can be dropped which has less importance value

  • @niveditawagh8171

    @niveditawagh8171

    2 жыл бұрын

    Good point

  • @niveditawagh8171

    @niveditawagh8171

    2 жыл бұрын

    Can you please tell me how to drop the less correlated variable with the target variable?

  • @beautyisinmind2163

    @beautyisinmind2163

    2 жыл бұрын

    @@niveditawagh8171 you only drop when two feature variables are highly correlated but you don't have to drop feature that is less correlated with target variable because less correlated feature with target variable could be a good predictor variable in combination with other features.

  • @ActionBackers
    @ActionBackers3 жыл бұрын

    This was incredibly helpful; thank you for the great content!

  • @rukmanisaptharishi6638
    @rukmanisaptharishi66383 жыл бұрын

    If you are transporting ice-cream in a vehicle, the number of ice-cream sticks that reach the destination is inversely proportional to temperature, higher the temperature, lesser are the sticks. If you want to effectively model the temperature of the vehicle's cooler and make it optimal, you need to consider this negatively correlated features, outside air temperature and number of ice-cream sticks at the destination.

  • @suhailabessa9901
    @suhailabessa9901 Жыл бұрын

    thank you sOOo much , perfect explaining :) good luck with your channel that is recomended

  • @andyn6053
    @andyn6053 Жыл бұрын

    In which order should u do the feature selection steps? 0. Clean the dataset, get rid of NaN and junk values. Check format for datatypes in testset etc 1. Use z-method to eliminate outliers 2. Normalize the train_X data 3. Check correlation between x_train variables and y_train. Drop variables that have a low correlation with the target variable. 4. Use pearsons correlation test to drop highly correlated variables from x_test 5. Use variance threshold method to drop x_train variables with low variance. All variables that have been removed from the x_train data should be removed from the x_test aswell. 6. Fit x_train and y_ train to a classification model 7. Predict y(x_test) 8. Compare the predicted y(x_test) output with y_test to calculate accuracy 9. Try different classification models and see which one performs the best (have the highest accuracy) Is this the right order? Have I missed something?

  • @neelammishra5622
    @neelammishra56222 жыл бұрын

    Your knowledge is really invaluable. Thanks

  • @RandevMars4
    @RandevMars43 жыл бұрын

    Well explained. Really great work sir. Thank you very much

  • @yashkhant5874
    @yashkhant58743 жыл бұрын

    GREAT CONTRIBUTION SIR.... THIS CHENNAL SHOULD 20M SUBSCRIBER🤘🤘

  • @tigjuli
    @tigjuli3 жыл бұрын

    Nice! please upload more on this topic!! thank you!

  • @JithendraKumarumadisingu
    @JithendraKumarumadisingu2 жыл бұрын

    Great tutorial it helps a lot thanks @Krish Sir

  • @ireneashamoses4209
    @ireneashamoses42093 жыл бұрын

    Great video!! Thank you!👍👍💖

  • @gabrielegbenya7479
    @gabrielegbenya74792 жыл бұрын

    great video. very informative and educative. Thank you

  • @suhailsnmsnm5397
    @suhailsnmsnm53974 ай бұрын

    amazing teaching skills you have bhaai ... THNX

  • @nahidzeinali1991
    @nahidzeinali19914 ай бұрын

    Thanks so much! very useful. you are so good

  • @elvykamunyokomanunebo1441
    @elvykamunyokomanunebo1441 Жыл бұрын

    Thanks krish, You've earned a rocket point from me :) Would have been nice, if the function also printed which feature it was strongly correlated with: because from the code you dropped all the features that met the threshold, not one was kept.

  • @dinushachathuranga7657
    @dinushachathuranga76575 ай бұрын

    Thanks a lot for very clear explanation.❤

  • @kalvinwei19
    @kalvinwei193 жыл бұрын

    Thank you man, good for my assignment

  • @chineduezeofor2481
    @chineduezeofor24813 жыл бұрын

    Another great video!!!

  • @alphoncemutabuzi6949
    @alphoncemutabuzi69493 жыл бұрын

    I think the abs is important since it's like having two rows one being the opposite of the other

  • @MrKaviraj75

    @MrKaviraj75

    3 жыл бұрын

    Yes, I think so too. If changes to one feature affects another feature, they are dependent, in other words, they are correlated.

  • @pankajkumarbarman765
    @pankajkumarbarman765 Жыл бұрын

    Very helpful . Thank you sir.

  • @antoniodefalco6179
    @antoniodefalco61792 жыл бұрын

    thank you, so usefull, good teacher

  • @gurdeepsinghbhatia2875
    @gurdeepsinghbhatia28753 жыл бұрын

    I think it all depends on domain that whether to involve the neg corr or not , or we can train two diff models and compare their scores , Thanks Sir

  • @KnowledgeAmplifier1
    @KnowledgeAmplifier13 жыл бұрын

    I want to point out a veryyy important concept which is missing in this video discussion: Suppose 2 input features are highly correlated then it's not like that , I can drop any between those 2 , then I have to check which feature between those 2 has weaker correlation with output variable , that one has to be dropped.

  • @siddharthdedhia11

    @siddharthdedhia11

    3 жыл бұрын

    what do you mean by weaker? do you mean the most negative?

  • @KnowledgeAmplifier1

    @KnowledgeAmplifier1

    3 жыл бұрын

    @@siddharthdedhia11, here , weaker means lesser correlation with output feature .

  • @siddharthdedhia11

    @siddharthdedhia11

    3 жыл бұрын

    @@KnowledgeAmplifier1 so for example between -0.005 and -0.5 , -0.005 is the one with lesser correlation right?

  • @KnowledgeAmplifier1

    @KnowledgeAmplifier1

    3 жыл бұрын

    @@siddharthdedhia11 yes , correct as correlation value towards 0 is considered as less value and towards 1 or -1 means strong relationship :-)

  • @amankothari5508

    @amankothari5508

    2 жыл бұрын

    @jayesh naidu

  • @hirakaimkhani3338
    @hirakaimkhani33382 жыл бұрын

    wonderful tutorial sir!!

  • @TejusVignesh
    @TejusVignesh Жыл бұрын

    You are a legend!!🤘🤘

  • @raghavkhandelwal1094
    @raghavkhandelwal10943 жыл бұрын

    waiting for more videos in the playlist

  • @shivarajnavalba5042
    @shivarajnavalba50422 жыл бұрын

    Thank you Krish,

  • @niveditawagh8171
    @niveditawagh81712 жыл бұрын

    Nice explanation.

  • @mariatachi8398
    @mariatachi8398Ай бұрын

    Amazing content!~

  • @omi_naik
    @omi_naik3 жыл бұрын

    Great explanation :)

  • @drshahidqamar
    @drshahidqamar2 жыл бұрын

    LOL, you are jsut amazing Boss

  • @amitmodi7882
    @amitmodi78823 жыл бұрын

    Wonderful explanantion. Krish as mentioned in video you said you upload 5-6 videos for feature selection. Can you please share the link for rest of them.

  • @salihsartepe2614
    @salihsartepe26148 ай бұрын

    Thanks Krish 😊

  • @parms1191
    @parms11913 жыл бұрын

    I write the threshold code simply like [df.corr()>0.7 OR df.corr()

  • @codertypist

    @codertypist

    3 жыл бұрын

    Let's say variables x, y and z are all strongly correlated to each other. You would only need to use one of them as a feature. By saying [df.corr()>0.7 or df.corr()

  • @abinsharaf8305
    @abinsharaf83052 жыл бұрын

    since we are giving only one positive value for threshold, the code abs allows check for both negative and positve values with threshold, so i feel its better if it stays

  • @nkechiesomonu8764
    @nkechiesomonu8764 Жыл бұрын

    Thanks sir for the good job you have been doing . God bless you. Please sir my question is can we use correlation on image data. Thanks

  • @user-fb3dp4bz8v
    @user-fb3dp4bz8v Жыл бұрын

    again I wish if you explain how to handle the test set...but the explination is excellent am really gratful

  • @waatchit
    @waatchit3 жыл бұрын

    Thank you for such a nice explanation. Does having 'abs' preserve the negative correlation ??

  • @thecitizen9747
    @thecitizen9747 Жыл бұрын

    You are doing a great job but can u please do similar series on categorical features in a regression problem?

  • @marcastro8052
    @marcastro80522 жыл бұрын

    Thanks, Sir.

  • @abebebelew2056
    @abebebelew205610 ай бұрын

    Best!

  • @siddhantpathak6289
    @siddhantpathak62893 жыл бұрын

    Hi Krish, I checked it somewhere and I think if the dataset has perfectly positive or negative attributes then in either case there is a high chance that the performance of the model will be impacted by Multicollinearity.

  • @marijatosic217
    @marijatosic2173 жыл бұрын

    What do you think about feature reduction using PCA, looking for a correlation between each feature and principal components, and then using those who have the most number of correlation that is great than 50% (or any other)?

  • @deepanknautiyal5725
    @deepanknautiyal57253 жыл бұрын

    Hi krish please a make a video on complete logistic regression for Interview preparation

  • @josephmart7528
    @josephmart75282 жыл бұрын

    The abs takes care of both positive and negative numbers. If not specified, the function will only take care o positively correlated features

  • @pratikjadhav1242
    @pratikjadhav12423 жыл бұрын

    We cheak the correlation between inputs and the output so why you drop output column and then cheak correlation we use a VIF (variance inflection factor) to cheak the relationship between inputs and the threshold value is preffer 4.

  • @JenryLuis
    @JenryLuis Жыл бұрын

    Hi friend, I think the correlation function is removing more than expected because when the fors loops are iterating not validate if for a value > threshold the column and index already was removed before. I corrected the function and in this case the features removed are these: {'DIS', 'NOX', 'TAX'}. Also I tested creating the correlation matrix again and verify that there is not values > threshold. Please can you check it. def correlation(dataset, threshold): col_corr = set() corr_matrix = dataset.corr() for i in range(len(corr_matrix.columns)): for j in range(i): if abs(corr_matrix.iloc[i, j]) > threshold: if (corr_matrix.columns[i] not in col_corr) and (corr_matrix.index.tolist()[j] not in col_corr): colname = corr_matrix.columns[i] col_corr.add(colname) return col_corr

  • @abhishekd1012
    @abhishekd10122 жыл бұрын

    In this video it's said negatively correlated features are both imp. lets take an example, when we have both percentage and ranks in a dataset, for 100% we have 1 in rank and 60% lets say 45(last) in rank. both resemble the same importance in the dataset. So what I think is we can remove one feature among those 2 features, otherwise we will be giving double weightage for that particular feature. Hope someone can correct this if I was wrong.

  • @suneel8480
    @suneel84803 жыл бұрын

    Sir make video on how to select features for clustering?

  • @Eric-bq1jo
    @Eric-bq1jo2 жыл бұрын

    Is there any way to apply this approach to a classification problem where the target variable is 1 or 0?

  • @souvikghosh6509
    @souvikghosh65093 жыл бұрын

    Sir, kindly make a video on embedded methods of feature selection..

  • @bishwasarkarbishwaranjansarkar
    @bishwasarkarbishwaranjansarkar3 жыл бұрын

    Hello Krishna thanks for your video but along with please explain real life use as well. Where can we use in real life.

  • @Egor-sm4bl
    @Egor-sm4bl3 жыл бұрын

    Perfect defence on 3rd place!

  • @kjrimer
    @kjrimer2 жыл бұрын

    Hello nice video, how to do feature selection if we have more than one target variable? i.e. In case of MultiOutput Regression problem how we can do feature selection. do we have to perform the pearson correlation individually on each of target variable or is there another convenient way that can solve the problem?

  • @phyuphyuthwe670
    @phyuphyuthwe6703 жыл бұрын

    Dear teacher, May I ask a question? In my case, I want to predict sale of 4 products with weather forecast information, season and public holiday one week ahead. So, do I need to organize weekly based data? When we use SPSS, we need to organize weekly data, how about Machine Learning? I feel confused for that. In my understanding, ML will train the data with respect to weather information. So, we don't need to organize weekly data because we don't use time series data. Is it correct? Please kindly give me a comment.

  • @Learn-Islam-in-Telugu
    @Learn-Islam-in-Telugu2 жыл бұрын

    The function used in the example will not deliver high correlation with the dependent variable. Because at the end you dropped the columns without being checking the correlation with dependent variable.

  • @jannatunferdous103
    @jannatunferdous10310 ай бұрын

    Sir, what you've shown in the last of this video, in that big data project, after deleting those 193 features, how I can deploy the model? Please share a video (or link if you have in your playlist) the deployment phase after deleting features. Thanks. ❤

  • @yasharthsingh805
    @yasharthsingh8053 жыл бұрын

    Sir , can you please tell which website should I refer if I want to start reading white papers.... Please please do reply....I follow all ur videos!!

  • @SuperNayaab
    @SuperNayaab2 жыл бұрын

    watching this video from Boston (BU Student

  • @moizk8223
    @moizk82232 жыл бұрын

    I have a doubt. Suppose if A and B have correlation greater than threshhold and the loop includes column A from the pair. Further B and C are highly correlated(although C is not highly correlated with A)and the loop includes B in the list. Now if we drop A and B wouldn't that affect the model as both A and B will be dropped?

  • @killerdrama5521
    @killerdrama55212 жыл бұрын

    What if we have some features numerical and some features are categorical against categorical output .. which feature section method will be helpful

  • @nmuralikrishna4599
    @nmuralikrishna45992 жыл бұрын

    General Question - What if we drop few of the import features from and data and train again ? Will the accuracy drop ? or precision ?

  • @conceptsamplified
    @conceptsamplified3 жыл бұрын

    Of the highly correlated columns, Should we not keep one of the columns in our X_train dataset?

  • @ajaykushwaha-je6mw
    @ajaykushwaha-je6mw2 жыл бұрын

    Hi everyone i need one help. this technique to select numerical features only. Suppose we have done one hot encoding on categorigal data and converted into numerical then can we apply this technique on that features as well(entire data set with numerical column and categorical column converted into numerical with some encoding technique.) Kindly help me to understand.

  • @user-zo5ky6dk4r
    @user-zo5ky6dk4r Жыл бұрын

    Should small values of correlation such as -0.95 be deleted or they are good to train our model and they should stay in data frame?

  • @venkatk1591
    @venkatk15913 жыл бұрын

    Do we need use the entire datasets for correlation testing. Are we not missing something by considering the train set only?

  • @oladosuoladimeji370
    @oladosuoladimeji3703 жыл бұрын

    How can correlated features be selected for a multi label learning task especially in images

  • @levon9
    @levon93 жыл бұрын

    Two quick questions: (1) Why not remove redundant features, ie highly correlated variables, from X before splitting it into training and test? What would be wrong with this approach? (2) If one features variable is correlated with a value of 1 and another variable with a value of -1 with regard to a given feature, are these also considered redundant?

  • @World_Exploror
    @World_Exploror Жыл бұрын

    Multi collinearity has checked but what about the Correlation of dependent vs independent variables

  • @mrunalsrivastava2015
    @mrunalsrivastava20153 жыл бұрын

    can we compute correlation between two rows of a single matrix??

  • @World_Exploror
    @World_Exploror Жыл бұрын

    Can we drop features while comparing correlation of dependent variable with independent variables by taking some threshold....!

  • @MominSaadAltafnab
    @MominSaadAltafnab Жыл бұрын

    I didnt understood why we are just considering X_train for finding corr you said to avoid overfitting we are doing that but i am still not getting it like how it will be overfitted it we use all data can someone pls tell me why we are doing that

  • @amarkumar-ox7gj
    @amarkumar-ox7gj3 жыл бұрын

    If idea is to remove highly correlated features, then both highly positive and negative correlation should be considered!!

  • @HumaidAhmadKidwai17
    @HumaidAhmadKidwai17Ай бұрын

    How to check correlation between numerical column (input) and categorical output(in the form of 0s and 1s)

  • @laveylukose5350
    @laveylukose53503 жыл бұрын

    How will u do feature selection for categorical input data?

  • @erneelgupta
    @erneelgupta2 жыл бұрын

    what is the importance of random_state in train_test split ? How the values of random_state (0,42,100 etc.) affect the estiamation???

  • @laxmanbisht2638
    @laxmanbisht26382 жыл бұрын

    Hi, thanks for the lecture. What if we have a dataset in which categorical and numeric features are present. Will pearson's correlation be applicable?

  • @Jnalytics

    @Jnalytics

    Жыл бұрын

    Pearson's correlation only works with numeric features. However, if you want to explore the categorical features, you can use Pearson's Chi-square test. You can use the SKBest from scikit-learn and chi2. Hope it helps!

  • @aritratalapatra8452
    @aritratalapatra84523 жыл бұрын

    If I have 3 correlated columns, I should drop 2 out of 3 right ? why do you drop all correlated features from training and testing set ?

  • @keshavsharma-pq4vc
    @keshavsharma-pq4vc3 жыл бұрын

    Sir when you will Upload Next video of this playlist (Feature Selection)

  • @asha4545
    @asha45452 жыл бұрын

    Hello Sir my dataset contains 17000 features, when I execute corr() its taking more than 5 minutes to execute and also for generating heatmap memory related error generating. Can you help to solve the issue?

  • @perumalelancgoan9839
    @perumalelancgoan98392 жыл бұрын

    please clear it the below if any independent variables are highly corelated we shouldn't remove them right because its give very positive outcome

  • @nishadseeraj7034
    @nishadseeraj70342 жыл бұрын

    Can someone explain how the 2nd for loop is working? I am not getting it. For instance "for j in range(i)", wouldn't that give an error when i=0 for the first iteration of the first for loop when i=0, unless I am missing something?

  • @hideweapon1361

    @hideweapon1361

    11 ай бұрын

    nusty loop

  • @arunkrishna1036
    @arunkrishna10362 жыл бұрын

    Hi Krish.. how about using an VIF to find the correlated features?

  • @sanketargade3685
    @sanketargade3685 Жыл бұрын

    Why we are droping highly correlated feature after spliting train and test either it is easy to drop features from original data set and then we can simply split the dataset?❓😕🤔

  • @aayushdadhich4840
    @aayushdadhich48403 жыл бұрын

    Should i practice by writing my own full code including the hypothesis functions, cost functions, gradient descent or fully use sklearn?

  • @YS-nc4xu

    @YS-nc4xu

    3 жыл бұрын

    If you're a student and have time to explore, please go ahead and implement it from scratch. It'll really help you to not only understand the basic working but also the software development aspect of creating any model (refer sklearn documentation and source code) and get to know more about industry level coding practices.

  • @youcefyahiaoui1465
    @youcefyahiaoui14652 ай бұрын

    Great tutorial, but I think you're mistaken about the abs(). You're actually considering both with abs(). If you remove abs() and you keep the > inequality then a 0.95 would be > Thresh=0.9, but -0.99 would not satisfy this condition! If you want to remove abs(), then you need to test 2 conditions, like if corr_matrix.iloc[i,j] > +1*thesh (assuming thres is always +ve) and corr_matrix.iloc[i,j]

  • @meshmeso
    @meshmeso3 ай бұрын

    These are on numeric features, what of correlation between categorical features ?

  • @vivekkumargoel2676
    @vivekkumargoel26763 жыл бұрын

    sir when the remaining videos on features selection are coming

  • @teenamadhu7883
    @teenamadhu78833 жыл бұрын

    How to get the name of the column which is highly correlated to the given column. Please help

  • @StanleySI
    @StanleySI2 жыл бұрын

    Hi sir, there's an obvious flaw in this approach. You can't drop all correlated features, but only some of them. e.g. perimeter_mean & area_se are highly correlated (0.986507), and they both appear in your corr_features. However, you can't drop all of them because from pairplot, you could see perimeter_mean has a clear impact on the test result.

  • @vijaytogla2614
    @vijaytogla26143 жыл бұрын

    1 st comment..

  • @lwasinamdilli
    @lwasinamdilli Жыл бұрын

    How do you handle correlation for Categorical variables?

  • @sivadevil4845
    @sivadevil48453 ай бұрын

    Hi @krish naik, i want to know how much data cleaning and models selection and models performance and how we can do that. I hope u will explain if u find this comment.

  • @miteshrajpurohit9033
    @miteshrajpurohit90332 ай бұрын

    could you please make a video on " how auto encoder can be used to extract importance feature " ?