CatBoost Part 1: Ordered Target Encoding

One of the defining features of CatBoost is its concerted effort to avoid data leakage at all costs. In this video, we'll see how it eliminates a potential threat in Target Encoding by ordering the data and encoding it sequentially. This ordered approach is central to everything CatBoost does and we'll see it again in Part 2 when we talk about how it builds trees.
NOTE: This StatQuest is based on the original CatBoost manuscript... arxiv.org/abs/1706.09516
...and an example provided in the CatBoost documentation...
catboost.ai/en/docs/concepts/...
English
This video has been dubbed using an artificial voice via aloud.area120.google.com to increase accessibility. You can change the audio track language in the Settings menu.
Spanish
Este video ha sido doblado al español con voz artificial con aloud.area120.google.com para aumentar la accesibilidad. Puede cambiar el idioma de la pista de audio en el menú Configuración.
Portuguese
Este vídeo foi dublado para o português usando uma voz artificial via aloud.area120.google.com para melhorar sua acessibilidade. Você pode alterar o idioma do áudio no menu Configurações.
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
KZread Membership: / @statquest
...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
0:00 Awesome song and introduction
1:56 A slight problem with k-fold target encoding
3:42 Ordered Target Encoding
Corrections:
4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2.
#StatQuest #CatBoost #dubbedwithaloud

Пікірлер: 84

  • @statquest
    @statquest Жыл бұрын

    Corrections: 4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2. To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @mrcoet
    @mrcoet Жыл бұрын

    Thank you! I'm doing my master thesis and I'm checking your channel every day waiting for Transformers. Thank you again!

  • @statquest

    @statquest

    Жыл бұрын

    I'm still working on it.

  • @dihancheng952

    @dihancheng952

    Жыл бұрын

    @@statquest same eagerly waiting here

  • @firstkaransingh
    @firstkaransingh Жыл бұрын

    Finally a video on cat boost. I was waiting for a proper explanation.

  • @statquest

    @statquest

    Жыл бұрын

    Bam! :)

  • @xaviernogueira
    @xaviernogueira Жыл бұрын

    Glad to see CatBoost! Would love to hear more about data leakage mitigation.

  • @statquest

    @statquest

    Жыл бұрын

    Thanks! Yes, I think at one point I need to do a video just on all the types of leakage.

  • @aghazi94
    @aghazi94 Жыл бұрын

    I have been waiting for this for so long. Thanks alot

  • @statquest

    @statquest

    Жыл бұрын

    BAM!

  • @AllNightNightwish
    @AllNightNightwish Жыл бұрын

    Hi Josh, I agree with your point here about it being unnecessary (also having seen the previous longer explanation you posted a while back). However, I think their main point and contribution was not the mitigation in a single tree, but throughout the ensemble. If i understand it correctly, by using ordered boosting and randomization over each tree they guarantee that there is no leakage between the separate trees, because none of the samples have ever seen the original value. They use multiple models trained on different fractions of the dataset for each tree, just so they can make predictions that don't have any leakage at all. I'm still not sure that it wouldn't just work with leave one out encoding but given that context it seems to be more useful at least.

  • @statquest

    @statquest

    Жыл бұрын

    Part 2 in this series (which comes out in less than 24 hours), shows how the trees are built using the same approach that limits leakage. I guess one of my issues with CatBoost making such a big deal about leakage is that, even though other methods (XGBoost, lightGBM, Random Forests, etc) might result in leakage, they still perform well - and the whole point of avoiding leakage is simply to have a model perform well.

  • @joy5636
    @joy5636 Жыл бұрын

    Wow, I am so excited to see the Catboost topic! thank u !

  • @statquest

    @statquest

    Жыл бұрын

    BAM! :)

  • @TJ-hs1qm
    @TJ-hs1qm Жыл бұрын

    Hey Josh, I was wondering if you could do a series on graph theory and NLP? exploring this stuff would be really helpful. Thanks!

  • @statquest

    @statquest

    Жыл бұрын

    I'll keep that in mind.

  • @davidguo1267
    @davidguo1267 Жыл бұрын

    Thanks for the explanation. By the way, have you talked about backpropagation through time in recurrent neural networks? If not, are you planning to talk about it?

  • @statquest

    @statquest

    Жыл бұрын

    Backpropagation through time is just "unroll the RNN and then do normal backpropagation". I have thought about doing a video on it and have notes, but it's not a super high priority right now. Instead I want to get to transformers.

  • @heteromodal
    @heteromodal11 ай бұрын

    hey Josh - is there a mathematical justification to the prior in the numerator being defined as 0.05? regardless of a justification existing :) - is it always the case or just what you saw in their examples but it's not certain that that's a fixed value? thank you as always for a great video!

  • @statquest

    @statquest

    11 ай бұрын

    I saw 0.05 used as the prior here: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic and, on that page, it says you can set the prior. But I've looked in the documentation and I can't find where it is set, so I really don't know if it is always the case or not.

  • @heteromodal

    @heteromodal

    11 ай бұрын

    @@statquest Thank you!

  • @frischidn3869
    @frischidn3869 Жыл бұрын

    Hello, thanks for the video. I wanna ask, what if the target variable (Loves Troll 2) is in multiclass (Like, Dislike, So-so). How will the encoding work then for the Favorite Color? And should we encode the target variable first such as 0= Dislike 1= So-so 2= Like Before we then proceed to CatBoost encoding the feature (Favorite Color)?

  • @statquest

    @statquest

    Жыл бұрын

    When there are more than 2 classes, the equation changes, but just a little bit. You can find it in the documentation: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic

  • @frischidn3869

    @frischidn3869

    Жыл бұрын

    @@statquest It is said there "The label values are integer identifiers of target classes (starting from "0")" So I have to encode the target variable first outside CatBoost algorithm as 0, 1, 2 if it is 3 classes?

  • @statquest

    @statquest

    Жыл бұрын

    @@frischidn3869 Sounds like it.

  • @beautyisinmind2163
    @beautyisinmind216310 ай бұрын

    Catgorial boosting is only suitable for data with categorial features or we can use it even if our data has no categorical features? While using on continuous features does it require any conversion?

  • @statquest

    @statquest

    10 ай бұрын

    You can certainly use CatBoost on a dataset that doesn't have any categorical features. And it wouldn't require conversion.

  • @luiscarlospallaresascanio2374
    @luiscarlospallaresascanio2374 Жыл бұрын

    Que usaste para traducir el texto, en español? :0 ya había visto otros videos de traducción pero no pensé que pasarían a hacer el cambio tan rápido

  • @statquest

    @statquest

    Жыл бұрын

    Uso "Aloud" de Google.

  • @daniellaicheukpan
    @daniellaicheukpan Жыл бұрын

    hi Josh. thanks for your videos. I have one question. in your example, color blue can be encoded to several numerical values. Assume that I trained and deployed this model, when a new data comes with color = blue, which have no "loves troll 2" column How can the model know how to encode the color into which value? thanks so much

  • @statquest

    @statquest

    Жыл бұрын

    You use all of the color blue samples in the original training dataset.

  • @daniellaicheukpan

    @daniellaicheukpan

    Жыл бұрын

    @@statquest that means take the average?

  • @statquest

    @statquest

    Жыл бұрын

    @@daniellaicheukpan I was thinking more along the lines of plugging all of the blue rows into the equation. That might be the same as taking the average, but I haven't worked that out.

  • @johndavid5907
    @johndavid59079 ай бұрын

    Hi there sir, can you tell me that the value prior variable is holding is that the value of significance level value?

  • @statquest

    @statquest

    9 ай бұрын

    0.05 is often used as a threshold for statistical significance, but in this case, that concept has nothing to do with how we assign a value to the prior. In theory, the prior could be anything, like 12, and that's not even an option for the threshold for statistical significance.

  • @matteomorellini5974
    @matteomorellini5974 Жыл бұрын

    Hi Josh first of all thanks for your amazing work and passion. I'd like to suggest you a video about Optuna which, at least in my case, would be extremely helpful

  • @statquest

    @statquest

    Жыл бұрын

    I'll keep that in mind.

  • @matteomorellini5974

    @matteomorellini5974

    Жыл бұрын

    @@statquest thanks Josh❤️

  • @junaidbutt3000
    @junaidbutt3000 Жыл бұрын

    Clear and concise as always Josh! I was wondering if there was a natural way to extend the OptionCount metric for multiclass problems? It makes sense in the binary classification, we count the observations where a category class c co-occurs with the positive class of the target variable (1 in this case). If this was adapted for multiclass problems, how would we adapt this encoding equation?

  • @statquest

    @statquest

    Жыл бұрын

    Great question - and the CatBoost documentation has a good description of how it works for more classes: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic

  • @texla-kh9qx

    @texla-kh9qx

    Жыл бұрын

    @@statquest From the documentation, "Multiclassification The label values are integer identifiers of target classes (starting from "0").", it seems that they simply integer encoding the multiclasses? Isn't this introduce an artificial ordering in the target classes?

  • @statquest

    @statquest

    Жыл бұрын

    @@texla-kh9qx You have to remember that we don't split the data based the target value, so using integer values for the target isn't a problem.

  • @texla-kh9qx

    @texla-kh9qx

    Жыл бұрын

    @@statquest The categorical features of independent variables is encoded by target statistics, i.e. the transformation from categories to numerical values. If there is an artificial ordering in target variable y, it propagates to that categorical feature of X. So integer encoding multiclasses seems not a good choice.

  • @statquest

    @statquest

    Жыл бұрын

    @@texla-kh9qx If you look at the equations for target encoding independent variables, you'll see that they don't include the target value, just the number of rows with the same category. So I don't believe that the target values propagate to the independent variables.

  • @shubhamgupta6551
    @shubhamgupta6551 Жыл бұрын

    How was the ordered target encoding applied at the time of scoring? There will not be any target variable and we don't have a single value for a category i.e Blue color encoded multiple time with different values

  • @statquest

    @statquest

    Жыл бұрын

    We use the entire training datasets to encode new data.

  • @tapiotanskanen3494
    @tapiotanskanen3494 Жыл бұрын

    1:57 - Is this correct? On the chapter 3.2 - *Greedy TS* - they talk about a problem _"This estimate is noisy for low-frequency categories...",_ but your example has (maximally) high-frequency category. Later they stipulate _"Assume i-th feature is categorical, _*_all its values_*_ are unique, ..."._ To me this means that there are only single row for each category. In other words, each category (label) is unique, i.e. we have exactly one example per category (label).

  • @statquest

    @statquest

    Жыл бұрын

    The video is correct. If you keep reading the manuscript, just a few more paragraphs, you'll get to the section titled "Leave-one-out TS", and you'll see what I'm talking about in this video.

  • @texla-kh9qx

    @texla-kh9qx

    Жыл бұрын

    The video is talking about the example with constant categorical feature introduced in "Leave-one-out TS" section of their paper. However, I think the formula for target statistics in this video is different from the one in the paper, though the conclusion is still the same. Put it another way, the categorical feature who has uniform value originally carries no information at all. After the target statistics encoding, that categorical feature is transformed to a numerical feature with binary values which exactly distinguishes the binary target classes. This is clearly a target leakage as you can do perfect prediction relies on a single feature.

  • @murilopalomosebilla2999
    @murilopalomosebilla2999 Жыл бұрын

    It may be silly, but having a boosting method with cat in its name is really cool haha

  • @statquest

    @statquest

    Жыл бұрын

    Bam! :)

  • @ravi122133
    @ravi1221333 ай бұрын

    @statquest , I think in the paper they take the case when each sample has a unique category to show that it leads to leakage. and not the case that all samples have the same category. Section 3.2 Greedy TS of the CatBoost paper.

  • @statquest

    @statquest

    3 ай бұрын

    Yes, but in either case, you could just remove that column.

  • @ericchang927
    @ericchang927 Жыл бұрын

    Greate video!!! could you pls also introduce lightgbm? Thanks!

  • @statquest

    @statquest

    Жыл бұрын

    I'll keep that in mind. I have some notes on it already so hopefully I can do it soon.

  • @EvanZamir
    @EvanZamir9 ай бұрын

    Lightning can be used with CatBoost?

  • @statquest

    @statquest

    8 ай бұрын

    Lightning AI provides a platform to do things easily in the cloud. So, anytime you have a ton of data or a large model, Lightning can help.

  • @tessa10001
    @tessa10001 Жыл бұрын

    Where was this when i made my master thesis with catboost :(

  • @statquest

    @statquest

    Жыл бұрын

    Late bam?

  • @dl569
    @dl569 Жыл бұрын

    can't wait to see Transformer, PLEASE!!!!!!

  • @statquest

    @statquest

    Жыл бұрын

    Working on it! :)

  • @BlueRS123
    @BlueRS12311 ай бұрын

    Will you cover LightGBM?

  • @statquest

    @statquest

    11 ай бұрын

    I've got notes on it and when I have time I will.

  • @BlueRS123

    @BlueRS123

    11 ай бұрын

    @@statquest Cool! Are videos of gradient descent optimizers planned, too? (Momentum, Adam, etc.)

  • @statquest

    @statquest

    11 ай бұрын

    @@BlueRS123 I've got notes for Adam as well, so it's just a function of finding some time.

  • @c.nbhaskar4718
    @c.nbhaskar4718 Жыл бұрын

    Great tutorial but i am Eagerly waiting for statquest on Transformers

  • @statquest

    @statquest

    Жыл бұрын

    Working on it!

  • @guimaraesalysson
    @guimaraesalysson Жыл бұрын

    In this simple example of people who liked the colors whether or not they liked the movie, wouldn't "leakage" make sense? After all, if for example 90% of people who like blue liked the movie, wouldn't knowing that the color the next person likes is blue already provide information? Why is the leak a leak in this case?

  • @statquest

    @statquest

    Жыл бұрын

    Leakage comes form using the same row's target value to modify it's value for Favorite Color. This is typically dealt with by using k-fold target encoding - kzread.info/dash/bejne/Z2xt0KWAlbqtYdo.html

  • @EvanZamir
    @EvanZamir9 ай бұрын

    My guess is the ordered target encoding acts like a form of regularization.

  • @statquest

    @statquest

    8 ай бұрын

    Yes, that makes sense to me.

  • @Joaopedro_
    @Joaopedro_6 ай бұрын

    Manda um salve para o Caio Ducati

  • @statquest

    @statquest

    6 ай бұрын

    :)

  • @nitinsiwach1989
    @nitinsiwach19896 ай бұрын

    Not only is the motivation unjustifiable. The way Target encoding is done by catboost also makes no sense. Even in your toy example the different categories are numerically exactly the same when encoded and there is absolutely no reason it should be the case

  • @statquest

    @statquest

    6 ай бұрын

    Noted