CatBoost Part 1: Ordered Target Encoding

One of the defining features of CatBoost is its concerted effort to avoid data leakage at all costs. In this video, we'll see how it eliminates a potential threat in Target Encoding by ordering the data and encoding it sequentially. This ordered approach is central to everything CatBoost does and we'll see it again in Part 2 when we talk about how it builds trees.
NOTE: This StatQuest is based on the original CatBoost manuscript... arxiv.org/abs/1706.09516
...and an example provided in the CatBoost documentation...
catboost.ai/en/docs/concepts/...
English
This video has been dubbed using an artificial voice via aloud.area120.google.com to increase accessibility. You can change the audio track language in the Settings menu.
Spanish
Este video ha sido doblado al español con voz artificial con aloud.area120.google.com para aumentar la accesibilidad. Puede cambiar el idioma de la pista de audio en el menú Configuración.
Portuguese
Este vídeo foi dublado para o português usando uma voz artificial via aloud.area120.google.com para melhorar sua acessibilidade. Você pode alterar o idioma do áudio no menu Configurações.
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
KZread Membership: / @statquest
...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
0:00 Awesome song and introduction
1:56 A slight problem with k-fold target encoding
3:42 Ordered Target Encoding
Corrections:
4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2.
#StatQuest #CatBoost #dubbedwithaloud

Пікірлер: 84

@statquest Жыл бұрын
Corrections: 4:09 It is also worth noting that if there were more than 2 target values, for example, if Loves Troll 2 could be 0, 1 and 2, then, when calculating the OptionCount for a sample with Loves Troll 2 = 1, we would include rows that had Loves Troll 2 = 1 and 2. To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@mrcoet Жыл бұрын
Thank you! I'm doing my master thesis and I'm checking your channel every day waiting for Transformers. Thank you again!
@statquest
Жыл бұрын
I'm still working on it.
@dihancheng952
Жыл бұрын
@@statquest same eagerly waiting here
@firstkaransingh Жыл бұрын
Finally a video on cat boost. I was waiting for a proper explanation.
@statquest
Жыл бұрын
Bam! :)
@xaviernogueira Жыл бұрын
Glad to see CatBoost! Would love to hear more about data leakage mitigation.
@statquest
Жыл бұрын
Thanks! Yes, I think at one point I need to do a video just on all the types of leakage.
@aghazi94 Жыл бұрын
I have been waiting for this for so long. Thanks alot
@statquest
Жыл бұрын
BAM!
@AllNightNightwish Жыл бұрын
Hi Josh, I agree with your point here about it being unnecessary (also having seen the previous longer explanation you posted a while back). However, I think their main point and contribution was not the mitigation in a single tree, but throughout the ensemble. If i understand it correctly, by using ordered boosting and randomization over each tree they guarantee that there is no leakage between the separate trees, because none of the samples have ever seen the original value. They use multiple models trained on different fractions of the dataset for each tree, just so they can make predictions that don't have any leakage at all. I'm still not sure that it wouldn't just work with leave one out encoding but given that context it seems to be more useful at least.
@statquest
Жыл бұрын
Part 2 in this series (which comes out in less than 24 hours), shows how the trees are built using the same approach that limits leakage. I guess one of my issues with CatBoost making such a big deal about leakage is that, even though other methods (XGBoost, lightGBM, Random Forests, etc) might result in leakage, they still perform well - and the whole point of avoiding leakage is simply to have a model perform well.
@joy5636 Жыл бұрын
Wow, I am so excited to see the Catboost topic! thank u !
@statquest
Жыл бұрын
BAM! :)
@TJ-hs1qm Жыл бұрын
Hey Josh, I was wondering if you could do a series on graph theory and NLP? exploring this stuff would be really helpful. Thanks!
@statquest
Жыл бұрын
I'll keep that in mind.
@davidguo1267 Жыл бұрын
Thanks for the explanation. By the way, have you talked about backpropagation through time in recurrent neural networks? If not, are you planning to talk about it?
@statquest
Жыл бұрын
Backpropagation through time is just "unroll the RNN and then do normal backpropagation". I have thought about doing a video on it and have notes, but it's not a super high priority right now. Instead I want to get to transformers.
@heteromodal11 ай бұрын
hey Josh - is there a mathematical justification to the prior in the numerator being defined as 0.05? regardless of a justification existing :) - is it always the case or just what you saw in their examples but it's not certain that that's a fixed value? thank you as always for a great video!
@statquest
11 ай бұрын
I saw 0.05 used as the prior here: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic and, on that page, it says you can set the prior. But I've looked in the documentation and I can't find where it is set, so I really don't know if it is always the case or not.
@heteromodal
11 ай бұрын
@@statquest Thank you!
@frischidn3869 Жыл бұрын
Hello, thanks for the video. I wanna ask, what if the target variable (Loves Troll 2) is in multiclass (Like, Dislike, So-so). How will the encoding work then for the Favorite Color? And should we encode the target variable first such as 0= Dislike 1= So-so 2= Like Before we then proceed to CatBoost encoding the feature (Favorite Color)?
@statquest
Жыл бұрын
When there are more than 2 classes, the equation changes, but just a little bit. You can find it in the documentation: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
@frischidn3869
Жыл бұрын
@@statquest It is said there "The label values are integer identifiers of target classes (starting from "0")" So I have to encode the target variable first outside CatBoost algorithm as 0, 1, 2 if it is 3 classes?
@statquest
Жыл бұрын
@@frischidn3869 Sounds like it.
@beautyisinmind216310 ай бұрын
Catgorial boosting is only suitable for data with categorial features or we can use it even if our data has no categorical features? While using on continuous features does it require any conversion?
@statquest
10 ай бұрын
You can certainly use CatBoost on a dataset that doesn't have any categorical features. And it wouldn't require conversion.
@luiscarlospallaresascanio2374 Жыл бұрын
Que usaste para traducir el texto, en español? :0 ya había visto otros videos de traducción pero no pensé que pasarían a hacer el cambio tan rápido
@statquest
Жыл бұрын
Uso "Aloud" de Google.
@daniellaicheukpan Жыл бұрын
hi Josh. thanks for your videos. I have one question. in your example, color blue can be encoded to several numerical values. Assume that I trained and deployed this model, when a new data comes with color = blue, which have no "loves troll 2" column How can the model know how to encode the color into which value? thanks so much
@statquest
Жыл бұрын
You use all of the color blue samples in the original training dataset.
@daniellaicheukpan
Жыл бұрын
@@statquest that means take the average?
@statquest
Жыл бұрын
@@daniellaicheukpan I was thinking more along the lines of plugging all of the blue rows into the equation. That might be the same as taking the average, but I haven't worked that out.
@johndavid59079 ай бұрын
Hi there sir, can you tell me that the value prior variable is holding is that the value of significance level value?
@statquest
9 ай бұрын
0.05 is often used as a threshold for statistical significance, but in this case, that concept has nothing to do with how we assign a value to the prior. In theory, the prior could be anything, like 12, and that's not even an option for the threshold for statistical significance.
@matteomorellini5974 Жыл бұрын
Hi Josh first of all thanks for your amazing work and passion. I'd like to suggest you a video about Optuna which, at least in my case, would be extremely helpful
@statquest
Жыл бұрын
I'll keep that in mind.
@matteomorellini5974
Жыл бұрын
@@statquest thanks Josh❤️
@junaidbutt3000 Жыл бұрын
Clear and concise as always Josh! I was wondering if there was a natural way to extend the OptionCount metric for multiclass problems? It makes sense in the binary classification, we count the observations where a category class c co-occurs with the positive class of the target variable (1 in this case). If this was adapted for multiclass problems, how would we adapt this encoding equation?
@statquest
Жыл бұрын
Great question - and the CatBoost documentation has a good description of how it works for more classes: catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic
@texla-kh9qx
Жыл бұрын
@@statquest From the documentation, "Multiclassification The label values are integer identifiers of target classes (starting from "0").", it seems that they simply integer encoding the multiclasses? Isn't this introduce an artificial ordering in the target classes?
@statquest
Жыл бұрын
@@texla-kh9qx You have to remember that we don't split the data based the target value, so using integer values for the target isn't a problem.
@texla-kh9qx
Жыл бұрын
@@statquest The categorical features of independent variables is encoded by target statistics, i.e. the transformation from categories to numerical values. If there is an artificial ordering in target variable y, it propagates to that categorical feature of X. So integer encoding multiclasses seems not a good choice.
@statquest
Жыл бұрын
@@texla-kh9qx If you look at the equations for target encoding independent variables, you'll see that they don't include the target value, just the number of rows with the same category. So I don't believe that the target values propagate to the independent variables.
@shubhamgupta6551 Жыл бұрын
How was the ordered target encoding applied at the time of scoring? There will not be any target variable and we don't have a single value for a category i.e Blue color encoded multiple time with different values
@statquest
Жыл бұрын
We use the entire training datasets to encode new data.
@tapiotanskanen3494 Жыл бұрын
1:57 - Is this correct? On the chapter 3.2 - *Greedy TS* - they talk about a problem _"This estimate is noisy for low-frequency categories...",_ but your example has (maximally) high-frequency category. Later they stipulate _"Assume i-th feature is categorical, _*_all its values_*_ are unique, ..."._ To me this means that there are only single row for each category. In other words, each category (label) is unique, i.e. we have exactly one example per category (label).
@statquest
Жыл бұрын
The video is correct. If you keep reading the manuscript, just a few more paragraphs, you'll get to the section titled "Leave-one-out TS", and you'll see what I'm talking about in this video.
@texla-kh9qx
Жыл бұрын
The video is talking about the example with constant categorical feature introduced in "Leave-one-out TS" section of their paper. However, I think the formula for target statistics in this video is different from the one in the paper, though the conclusion is still the same. Put it another way, the categorical feature who has uniform value originally carries no information at all. After the target statistics encoding, that categorical feature is transformed to a numerical feature with binary values which exactly distinguishes the binary target classes. This is clearly a target leakage as you can do perfect prediction relies on a single feature.
@murilopalomosebilla2999 Жыл бұрын
It may be silly, but having a boosting method with cat in its name is really cool haha
@statquest
Жыл бұрын
Bam! :)
@ravi1221333 ай бұрын
@statquest , I think in the paper they take the case when each sample has a unique category to show that it leads to leakage. and not the case that all samples have the same category. Section 3.2 Greedy TS of the CatBoost paper.
@statquest
3 ай бұрын
Yes, but in either case, you could just remove that column.
@ericchang927 Жыл бұрын
Greate video!!! could you pls also introduce lightgbm? Thanks!
@statquest
Жыл бұрын
I'll keep that in mind. I have some notes on it already so hopefully I can do it soon.
@EvanZamir9 ай бұрын
Lightning can be used with CatBoost?
@statquest
8 ай бұрын
Lightning AI provides a platform to do things easily in the cloud. So, anytime you have a ton of data or a large model, Lightning can help.
@tessa10001 Жыл бұрын
Where was this when i made my master thesis with catboost :(
@statquest
Жыл бұрын
Late bam?
@dl569 Жыл бұрын
can't wait to see Transformer, PLEASE!!!!!!
@statquest
Жыл бұрын
Working on it! :)
@BlueRS12311 ай бұрын
Will you cover LightGBM?
@statquest
11 ай бұрын
I've got notes on it and when I have time I will.
@BlueRS123
11 ай бұрын
@@statquest Cool! Are videos of gradient descent optimizers planned, too? (Momentum, Adam, etc.)
@statquest
11 ай бұрын
@@BlueRS123 I've got notes for Adam as well, so it's just a function of finding some time.
@c.nbhaskar4718 Жыл бұрын
Great tutorial but i am Eagerly waiting for statquest on Transformers
@statquest
Жыл бұрын
Working on it!
@guimaraesalysson Жыл бұрын
In this simple example of people who liked the colors whether or not they liked the movie, wouldn't "leakage" make sense? After all, if for example 90% of people who like blue liked the movie, wouldn't knowing that the color the next person likes is blue already provide information? Why is the leak a leak in this case?
@statquest
Жыл бұрын
Leakage comes form using the same row's target value to modify it's value for Favorite Color. This is typically dealt with by using k-fold target encoding - kzread.info/dash/bejne/Z2xt0KWAlbqtYdo.html
@EvanZamir9 ай бұрын
My guess is the ordered target encoding acts like a form of regularization.
@statquest
8 ай бұрын
Yes, that makes sense to me.
@Joaopedro_6 ай бұрын
Manda um salve para o Caio Ducati
@statquest
6 ай бұрын
:)
@nitinsiwach19896 ай бұрын
Not only is the motivation unjustifiable. The way Target encoding is done by catboost also makes no sense. Even in your toy example the different categories are numerically exactly the same when encoded and there is absolutely no reason it should be the case
@statquest
6 ай бұрын
Noted

CatBoost Part 1: Ordered Target Encoding

Пікірлер: 84

@statquest

Жыл бұрын

@dihancheng952

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

11 ай бұрын

@heteromodal

11 ай бұрын

@statquest

Жыл бұрын

@frischidn3869

Жыл бұрын

@statquest

Жыл бұрын

@statquest

10 ай бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@daniellaicheukpan

Жыл бұрын

@statquest

Жыл бұрын

@statquest

9 ай бұрын

@statquest

Жыл бұрын

@matteomorellini5974

Жыл бұрын

@statquest

Жыл бұрын

@texla-kh9qx

Жыл бұрын

@statquest

Жыл бұрын

@texla-kh9qx

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@texla-kh9qx

Жыл бұрын

@statquest

Жыл бұрын

@statquest

3 ай бұрын

@statquest

Жыл бұрын

@statquest

8 ай бұрын

@statquest

Жыл бұрын

@statquest

Жыл бұрын

@statquest

11 ай бұрын

@BlueRS123

11 ай бұрын

@statquest

11 ай бұрын