Using Column Transformer and Pipeline to handle data with missing values | Machine Learning

In this tutorial, we'll build upon what we learnt in Column Transformer (Part 1), but here we'll look at an example dataset which has missing values, and we'll figure out a way to apply pre-processing steps to such datasets using scikit-learn.
Using Column Transformer and Pipelines to do the data pre-processing helps us in more ways than one. First is the easy interpretability, second it helps to prevent data leakage and third it helps us in doing hyper parameter tuning with the help of GridSearchCV etc, among others.
In the tutorial, we'll be going through all the nitty-gritties of when, how, where to use them.
I've uploaded all the relevant code and datasets used here (and all other tutorials for that matter) on my github page which is accessible here:
Link:
github.com/rachittoshniwal/ma...
If you like my content, please do not forget to upvote this video and subscribe to my channel.
If you have any qualms regarding any of the content here, please feel free to comment below and I'll be happy to assist you in whatever capacity possible.
Thank you!

Пікірлер: 45

  • @crazypankaj4u
    @crazypankaj4u3 жыл бұрын

    Really great content and well explained! Thank you. Hope you grows quickly!

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    Thank you for the kind words Pankaj! Appreciate it 😁

  • @tarun94060sharma
    @tarun94060sharma Жыл бұрын

    Sometimes we get videos having great explanation.

  • @meenatyagi9740
    @meenatyagi974010 ай бұрын

    Very good explaination.I was struggling to get clearity on it .

  • @research__7644
    @research__7644 Жыл бұрын

    One word... Amazing, Very well explained, instead of staying in a simple way of how to do it if you only wanted to impute numerical columns and scale u went the extra mile and did it for both cat and numerical, basically multiple different imputation processes.. just amazing thx you

  • @simonalazarevska4243
    @simonalazarevska42433 жыл бұрын

    Great job! The video helped a lot, thank you!

  • @christoskourouklis8190
    @christoskourouklis8190 Жыл бұрын

    Sir, your videos are super-extremely educative. The value you are adding to our knowledge is beyond expectations! And all this for free... I would like to ask you a couple of questions though. Why do we need to add the indicator when OneHotEncoding? Also, how can we obtain the column names of our training ang test sets after the hole preprocessing is done, when using columntransformer and pipelines? Would it not be a problem if we continue our ML project without having the names of the columns? Thank you so much in advance, I am super excited having discovered your channel!

  • @martinngobye3574
    @martinngobye35749 ай бұрын

    Great explanation regarding column-transformer and pipeline, however how do you have the data frame column names back instead of numbers? Thank you!!

  • @khirodsahoo5537
    @khirodsahoo55373 жыл бұрын

    Thanks for this tutorial.. i wish to use feature engineering techniques like mean encoding or frequency encoding using pipeline.. could you please help me out in this..

  • @modhua4497
    @modhua44979 ай бұрын

    Thanks, do you have example on how to incorporate LOG or SQRT transformation of features before modeling?

  • @ashabhumza3394
    @ashabhumza33942 жыл бұрын

    How will we put Select_k_best feature selection and and a function to treat outliers in this pipeline?

  • @vish183
    @vish1833 жыл бұрын

    Hey Rachit, Thanks for the video. Quick question on the One Hot Encoding. I noticed that you used "handle unknown = ignore", how would the model handle values which are not available in the test dataset? Would it be wise to just ignore it?

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    Hey Vishwantha, thanks! The default behavior is to raise an error, telling us that an unknown category has been encountered in test set. When we set it to ignore, all the one hot encoded columns would be set to zeros and the code won't break

  • @kamilshaikh1602
    @kamilshaikh1602 Жыл бұрын

    in cases where we put custom conditions for missing value imputation, how to do that?? in the pipeline and column transformer fashion.

  • @ritugujela8345
    @ritugujela8345 Жыл бұрын

    The dataframe with columns converted into indexes. How to convert that dataframe with indexes back into the one with col names? It's too difficult to judge a col just by indexes and also when we have to do EDA after that.

  • @sodiqrafiu9072
    @sodiqrafiu90723 жыл бұрын

    Thanks for the tutorial boss. Please, could you come up with end to end and deployment....Please?

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    Hi Sodiq! First off, thank you for liking the video! Sure I'll come up with few on those!

  • @martinbielke8301
    @martinbielke83013 жыл бұрын

    Really good explained! I'm still not sure how to know when values are in different scale...

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    Thanks Martin! "How to know when values are in different scale" as in?

  • @martinbielke8301

    @martinbielke8301

    3 жыл бұрын

    @@rachittoshniwal At 5:48 you say you'll do a robust scaler because the values in the Dataset are in different scales, but I'm not sure how could I determine this by observing the values in this (or other) Dataset. Thanks!

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    @@martinbielke8301 oh, okay. We can compare the units, and also look at the scales. For ex, age and hours per week are mostly gonna be two digit numbers, while capital gain/ loss can be zero or even huge 5 digit numbers. So a linear model would overstate the importance of such a column which has a high range of values when compared with a column that doesn't vary much comparatively (like age) Hope it helps!

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    @@martinbielke8301 you could run a quick little df.describe() to get the min/ max values in each numerical column

  • @martinbielke8301

    @martinbielke8301

    3 жыл бұрын

    @@rachittoshniwal Alright!! Thank you

  • @mehul4mak
    @mehul4mak2 жыл бұрын

    Why are you applying OHE on Indicator of cat_columns?

  • @thepresistence5935
    @thepresistence59352 жыл бұрын

    HI, bro. Wonderful explanations. I want to perform Labe encoder, because label encoder helps to get rid away from the more columns. Can you please help me with that.

  • @rachittoshniwal

    @rachittoshniwal

    2 жыл бұрын

    Hey, hope this helps: kzread.info/dash/bejne/h6ib1Mp7opbRhNo.html

  • @atiaspire
    @atiaspire3 жыл бұрын

    what is the use of indicator required

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    Sometimes it makes model performance better when we have indicator columns, because there might be a particular reason why that value was missing. For example, if a house doesn't have a garage, then the garage area column will be nan for that house, and maybe it could help having an indicator column signifying that "missing-ness"? We can only try and see. It is more a trial and error really. Hope it helps.

  • @martinbielke8301
    @martinbielke83013 жыл бұрын

    I get everytime this message, no matter what I try: "Number of features of the model must match the input".

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    At what stage of the code do you get this error?

  • @martinbielke8301

    @martinbielke8301

    3 жыл бұрын

    @@rachittoshniwal when I try to instantiate a modell. If a do the train test split before the imputation and th OHE I get negative scoring values....

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    @@martinbielke8301 if you're doing regression analysis, then the negative values are fine. That is the scikit-learn convention of showing negative of the MSE, RMSE etc

  • @martinbielke8301

    @martinbielke8301

    3 жыл бұрын

    @@rachittoshniwal Yes I'm doing regression... Thank you for taking the time to answer back! I'm just starting with all this and I appreciate very much the content of your channel. Cheers

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    @@martinbielke8301 cheers! :)

  • @FindMultiBagger
    @FindMultiBagger3 жыл бұрын

    👍 Great tutorial man , Can we pass new unseen dataframe object to .predict(unseen_data) ? Aim is to apply same pipeline on new data and get predicted result for single df column. Let's say , .predict(New_df.sample(1)) Is is possible ? Thanks

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    Thanks Alex! I'm glad it helped. Yes, absolutely. Just be aware that the new unseen data row is of the same shape, size and order (of columns) as the original dataframe, so that the new row will go through all the transformation steps and finally make sure that the last step of the pipeline is an estimator, to get the predictions.

  • @FindMultiBagger

    @FindMultiBagger

    3 жыл бұрын

    @@rachittoshniwal got it 👍 , same mechanism is used for pycaret. In there current release they have added extra parameter called as custom pipeline. So I'm Thinking if we add something in pre existing pipeline it would more Powerful. Can you make tutorial on it ? ✅ How to add new transformation or new operation in existing pipeline. ✅ Use custom sci-kit pipeline in pycaret (regression/classification) It is really nice content idea for this current series ,😉 Thanks ,

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    @@FindMultiBagger hi Alex, tbh I'm not very much well versed with pycaret yet. But I'll definitely look into your suggestions! :) thanks!

  • @sandeshkharat7020
    @sandeshkharat70203 жыл бұрын

    How to use Pipline for making predictions for new values??? pipe_final.predict(new_arr) throws error.... ValueError: Specifying the columns using strings is only supported for pandas DataFrames

  • @rachittoshniwal

    @rachittoshniwal

    3 жыл бұрын

    You need to pass in a pandas dataframe to the predict method. I assume the pipeline is making use of a column transformer in it? And since the column transformer needs column names to do it's work, you need to pass in a dataframe of the unseen data to predict method. Just make it pd.DataFrame(new_arr, columns = cols) where cols is the list of column names of X you passed to the pipeline during fit. Lemme know

  • @sandeshkharat7020

    @sandeshkharat7020

    3 жыл бұрын

    @@rachittoshniwal Thanks Brother It Worked ...Supper Usefulllll

  • @subtlehyperbole4362
    @subtlehyperbole4362 Жыл бұрын

    (note: this is not an issue specific to your video, but something i have been getting confused by for a long time, this is just the first time I decided to stop and try ask about it in the comment section) It seems like it should be necessary (or maybe if not necessary, at least would be useful) to tell the model which column each imputed indicator is indicating for, right? But in the final dataset that you produce the imputed data indicators are all bunched up as the first four columns. How does the model know A) these features are imputed data indicators and, more importantly, B) which of the remaining 93 columns in the dataset each one is supposed to be for? They could be indicating for columns 5, 6, 7, and 8, or they could be indicating for columns 45, 72, 8, and 92, or any other combination of the remaining 93 feature columns. How does this not affect how the model trains? My brain is thinking that possibly the algorithm somehow susses that out on its own... but i don't understand how or why it can do that. Am i making much ado about nothing here?

  • @rachittoshniwal

    @rachittoshniwal

    Жыл бұрын

    At the very core of things, the computer only understands 0s and 1s haha. For human interpretability - yes, it might be necessary to label the columns to see what a specific missing indicator column is for, but for a computer, it doesn't matter. The column headers are just for us, as the model only cares about the data being in a 2D array format. For: A) the model doesn't know that a particular column is indicating missing values in some other column. It only cares about the values in it. B) Again, to reiterate, the model doesn't care what each of the other 93 odd columns stand for. It is only looking at their values You can shuffle the column ordering, pass in X.values instead of a dataframe X to the model. It will not affect performance

  • @subtlehyperbole4362

    @subtlehyperbole4362

    Жыл бұрын

    @@rachittoshniwal Yeah I understand that but the 93 columns are all features in and of themselves, the 4 imputation indicators aren't really features of data in the same underlying way, right? They are more like features about other features, not features about the event whose label it is trying to train on. It seems like the imputation indicator data point's entire utility is essentially to point at a single data point in another column and say "don't take this data point too seriously because it was made up" -- wouldn't the entire weighting system need to treat that types of columns be different? It feels like it would be problematic (or at least, not useful) to treat those columns as if they were just additional features columns that could be treated like any other of the existing 93 features. (I mean, I guess it depends on the particular algorithm, like i imagine decision tree based algos would probably be able to handle that kinda thing, but others it feels like wouldn't be well served to treat those columns like they were just any other feature columns, no different than any of the others for its own starting purposes)

  • @rachittoshniwal

    @rachittoshniwal

    Жыл бұрын

    @@subtlehyperbole4362 although the missing indicator column is based off some column X, it is essentially a brand new column holding the information that "there is a column X which has missing values for these rows" so the model will test whether NOT having values in that column X is indicative of something or not.

Келесі