CatBoost Part 2: Building and Using Trees

Just like we saw in CatBoost Part 1, Ordered Target Encoding, we're going to use the training data one row at a time to build and calculate the output values from trees. This is part of CatBoot's determined effort to avoid leakage like there is no tomorrow. We'll also learn how CatBoost makes predictions once the trees made.
NOTE: This StatQuest is based on the original CatBoost manuscript... arxiv.org/abs/1706.09516
...and an example provided in the CatBoost documentation...
catboost.ai/en/docs/concepts/...
English
This video has been dubbed using an artificial voice via aloud.area120.google.com to increase accessibility. You can change the audio track language in the Settings menu.
Spanish
Este video ha sido doblado al español con voz artificial con aloud.area120.google.com para aumentar la accesibilidad. Puede cambiar el idioma de la pista de audio en el menú Configuración.
Portuguese
Este vídeo foi dublado para o português usando uma voz artificial via aloud.area120.google.com para melhorar sua acessibilidade. Você pode alterar o idioma do áudio no menu Configurações.
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
KZread Membership: / @statquest
...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
0:00 Awesome song and introduction
1:10 Building the first tree
6:05 Quantifying the effectiveness of the first threshold
6:56 Testing a second threshold
9:05 Building the second tree
10:21 The main idea of how CatBoost works
12:15 Making predictions
13:02 Symmetric Decision Trees
14:56 Summary of the main ideas
Corrections:
2:05 Red should have gone into bin 0 instead of bin 1.
7:23 I should have said that the cosine similarity was 0.71.
#StatQuest #CatBoost #DubbedWithAloud

Пікірлер: 85

  • @statquest
    @statquest Жыл бұрын

    NOTE: At 7:23 I should have said that the cosine similarity was 0.71. To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @sahilpalsaniya724
    @sahilpalsaniya724 Жыл бұрын

    "BAM" and its variants are stuck in my head. every time I solve a problem my head plays your voice

  • @statquest

    @statquest

    Жыл бұрын

    bam! :)

  • @Monkey_uho
    @Monkey_uho Жыл бұрын

    Awesome work ! I've been watching a lot of your videos to understand the basics ML algorithms, continue like that ! Thank you for taking the time and the energy to spread knowledge with others. Also, I would like to say that, like others, I will also love a video explaining the concepts behind LighGBM.

  • @statquest

    @statquest

    Жыл бұрын

    Thank you! And one day I hope to do LightGBM

  • @weipenghu4463

    @weipenghu4463

    11 ай бұрын

    looking forward to it❤

  • @aakashdusane
    @aakashdusaneАй бұрын

    Not gonna lie catBoost nuances were significantly more difficult to understand than any other ensemble model till date. Although the basic intuition is pretty straightforward.

  • @statquest

    @statquest

    Ай бұрын

    It's a weird one for sure.

  • @razielamadorrios7284
    @razielamadorrios7284 Жыл бұрын

    Such a great video Josh! I really enjoyed it. Any chance to do an explanation for lightGBM? thanks in advance. Additionally, I'm a huge fan of your work :)

  • @statquest

    @statquest

    Жыл бұрын

    I'll keep that in mind.

  • @Quami111
    @Quami111 Жыл бұрын

    In 2:09 and 12:40, you assigned row with height=1.32 to bin=1, but you said that rows with smaller heights would have bin=0. It doesn't appear in 11:24, row with height=1.32 has bin=0, so I guess it is a mistake.

  • @statquest

    @statquest

    Жыл бұрын

    Oops! That was a mistake. 1.32 was supposed to be in bin 0 the whole time.

  • @OscarMartinez-gg5du

    @OscarMartinez-gg5du

    Жыл бұрын

    ​@@statquest In 1:15 when create the randomized tree for the first build, the height also seems to be shuffled for their corresponding Favorite Color and that changes the examples for the creation of the stumps. However, the explanation is very clear, I love your videos!!

  • @LL-hj8yh
    @LL-hj8yh8 ай бұрын

    Hey Josh, thanks as always! Are you planning to roll out lightgbm videos as well?

  • @statquest

    @statquest

    8 ай бұрын

    Eventually that's the plan.

  • @TheDataScienceChannel
    @TheDataScienceChannel Жыл бұрын

    As always a great video. Was wondering if you intend to add a code tutorial as well?

  • @statquest

    @statquest

    Жыл бұрын

    I'll keep it in mind!

  • @asmaain5856

    @asmaain5856

    Жыл бұрын

    @@statquest please for soon I reaaaaally need it

  • @rikki146

    @rikki146

    Жыл бұрын

    API for shallow models are mostly similar :\

  • @user-lu5ds2qp2f
    @user-lu5ds2qp2f Жыл бұрын

    Big Fan !! 🙌

  • @statquest

    @statquest

    Жыл бұрын

    Thanks!

  • @near_.
    @near_. Жыл бұрын

    Awesome. I'm your new subscriber 🙂

  • @statquest

    @statquest

    Жыл бұрын

    Thank you! :)

  • @rishabhsoni
    @rishabhsoni6 ай бұрын

    Great video. One question: Is the intuition behind using high cosine similarity to pick threshold that essentially we are adding the scaled leaf output to create predictions and if leaf outputs are more closer to residuals then we are moving in right direction as residuals represent how far away are we from actual target? Usually we minimize the residuals which kind of means that you find similarity with target

  • @statquest

    @statquest

    6 ай бұрын

    I think that is correct. A high similarity means the output value is close to the residuals, so we're moving the right direction.

  • @rishabhsoni

    @rishabhsoni

    6 ай бұрын

    But one question that comes to mind: cosine similarly is based on L2 norm so Euclidean distance. Wouldnt in this case no of rows of data act as dimension and cause weird output due to curse of dimensionality

  • @nitinsiwach1989
    @nitinsiwach19895 ай бұрын

    What do bins have to do with the ordered encoding computation as you mentioned at 11:26? In the video, you have mentioned one use-case for the bins which is to reduce the number of thresholds tested like other gradient boosting methods.

  • @statquest

    @statquest

    5 ай бұрын

    The bins are used to give us a discrete target value for Ordered Target Encoding (since it doesn't work directly with a continuous target.) For details, see: kzread.info/dash/bejne/fYyDtrWkgK-YiJc.html

  • @nitinsiwach1989
    @nitinsiwach19894 ай бұрын

    Hello Josh, Thank you for your amazing channel In the catboost package, why do we have both 'depth' and 'max_leaves' as parameters? One would think that since the trees here are oblivious, the two are deterministically related. Can you shed some light on this?

  • @statquest

    @statquest

    4 ай бұрын

    That's good question. Unfortunately, there have been a lot of changes to CatBoost since it was originally published and it's hard to get answers for what's going on.

  • @reynardryanda245
    @reynardryanda245 Жыл бұрын

    12:41 how did you get the optionCount for prediction? I thought that it’s the amount of time that color for that bin appears sequentially. But if it’s for prediction, we don’t know the actual bin right?

  • @statquest

    @statquest

    Жыл бұрын

    At 12:41, we are trying to predict the hight of someone who likes the color blue. So, in order to change "blue" into a number, we look at the training data on the left, which has two rows with the color blue in it. One of those rows is in Bin 0 and the other is in Bin 1. Thus, to get the option count for "blue", we add 0 + 1 = 1. In other words, the option count for the new observation is derived entirely from the training dataset.

  • @user-fi2vi9lo2c
    @user-fi2vi9lo2c8 ай бұрын

    Dear Josh, I have a question about using Catboost for Classification. In this video, which tells us about using Catboost for Regression, we calculated output values for a leaf as an average of residuals in a leaf. How do we calculate output value for Classification? Do we use the same formula as for Gradient Boosting? I mean, (Sum of residuals) in the numerator and Sum of (Previous probability(i)*(1-Previous probability(i)) in denominator.

  • @statquest

    @statquest

    8 ай бұрын

    CatBoost is, fundamentally, based on Gradient Boost, which does classification by converting the target into a log(odds) value and then treating it like a regression problem. For details, see: kzread.info/dash/bejne/nKypsK6BZce-c9Y.html

  • @serdargundogdu7899
    @serdargundogdu78999 ай бұрын

    I wish, you could replay this part again :)

  • @statquest

    @statquest

    9 ай бұрын

    :)

  • @aryanshrajsaxena6961
    @aryanshrajsaxena6961Ай бұрын

    Will we use k-fold target encoding for the case of more than 2 bins?

  • @statquest

    @statquest

    Ай бұрын

    I believe that is correct.

  • @alphatyad8131
    @alphatyad81318 ай бұрын

    Excuse me again Dr. Starmer. Do you know how CatBoost determines the final tree (I mean from many trees of gradient boosting that CatBoost builds) till that becomes a rule so it can predict new data? Cause I haven't found a source that tells an explicit explanation of how CatBoost made the decision trees till it can be used to predict. Thanks in advance, Dr. (Or for anyone who knows, I would appreciate your help)

  • @statquest

    @statquest

    8 ай бұрын

    You build a bunch of trees and see if the predictions have stopped improving. If so, then you are done. If not, and it looks like the general trend is to continue improving, then build more trees.

  • @alphatyad8131

    @alphatyad8131

    8 ай бұрын

    I got it & am so appreciate it, Dr. And then if I could ask again; Is it safe if we say CatBoost is similar to the XGBoost method in the way it chooses features for building the tree (made predictor) & defining -in this case- the classification class for the given data?

  • @statquest

    @statquest

    8 ай бұрын

    @@alphatyad8131 They're pretty different. To learn more about XGBoost, see: kzread.info/dash/bejne/gah4mtmPkanTZqg.html and kzread.info/dash/bejne/apZlrKd9psjUgbg.html

  • @alphatyad8131

    @alphatyad8131

    8 ай бұрын

    @@statquest Well explanation, Dr. Josh Starmer. Actually, I still learning by watching your videos on 'Machine Learning'. I appreciate it, feel not stuck in the same place as before thanks to your help. Have a nice day Dr.

  • @user-hv2lq3yt4w
    @user-hv2lq3yt4w8 ай бұрын

    TKS a lot~ i'm looking for an answer! For the new data whose "Favorite Color" is blue, why does it belong to bin#0 instead of bin#1 ?

  • @statquest

    @statquest

    8 ай бұрын

    The new data is not assigned to a bin at all. We just use the old bin numbers associated with the Training Data (and only the training data) to convert the color, "blue", to a number. The bin numbers in the training data are used for the sum of the 1's for the numerator.

  • @user-hv2lq3yt4w

    @user-hv2lq3yt4w

    8 ай бұрын

    @@statquest I misunderstood, sorry~ for new data whose "Favorite Color" is blue, we use all the rows with the same color, "blue", where OptionCount and n.

  • @statquest

    @statquest

    8 ай бұрын

    @@user-hv2lq3yt4w yep

  • @danieleboch3224
    @danieleboch32243 ай бұрын

    i have a question about leaf outputs. don't gradient boosting algorithms on trees build a new tree all the way down and after that assign some values to their leafs? you rather did it iteratively, calculating outputs when the tree wasn't built yet.

  • @statquest

    @statquest

    2 ай бұрын

    As you can see in this video, not all gradient boosting algorithms with trees do things the same way. In this case, the trees are built differently, and this is done to avoid leakage.

  • @danieleboch3224

    @danieleboch3224

    2 ай бұрын

    @@statquest thanks, i got it now! but i got another question, in the catboost documentation there is a leaf estimation parameter (set to "Newton") and it is weird as the newton method is the exact method that is used in finding leaf values in xgboost, it uses the second derivative of the loss function and creates a tree according to new information criteria based on that method. but why would we need that if we already build trees in the ordered way finding the best split with the cosine similarity function?

  • @statquest

    @statquest

    2 ай бұрын

    @@danieleboch3224 To be honest, I an only speculate about this. My guess is that they started to play around with different leaf estimation methods and found the one xgboost uses works better than the one they originally came up with. To be honest, the "theory" of catboost seems to be quite different from how it works in practice, and this is very disappointing to me.

  • @alexpowell-perry2233
    @alexpowell-perry22338 ай бұрын

    at 11:48 when you are calculating the output values of the second tree, the residual for the 3rd record with a favourite Colour value of 0.525 and a Residual of 1.81 gets sent down the LHS leaf, even though the LHS leaf contains Residuals that are

  • @statquest

    @statquest

    8 ай бұрын

    Oops! That's a mistake. Sorry for the confusion!

  • @frischidn3869
    @frischidn386911 ай бұрын

    What will the residuals and leaf output be when it is a multiclass classification?

  • @statquest

    @statquest

    11 ай бұрын

    Presumably it's log likelihoods from cross entropy. I don't show how this works with CatBoost, but I show how it works with Neural Networks here: kzread.info/dash/bejne/aHWmtdusZdSucbg.html

  • @alexpowell-perry2233
    @alexpowell-perry22338 ай бұрын

    How does catboost decide on the best split at level 2 in the tree if it has to be symmetric? What if the best threshold for the LHS node is different to the best threshold for the RHS node?

  • @statquest

    @statquest

    8 ай бұрын

    It finds the best threshold given that it has to be the same for all nodes at that level. Compared to how a normal tree is created, this is not optimal. However, the point is not to make an optimal tree, but instead to create a "weak learner" so that we can combine a lot of them to build something that is good at making predictions. Pretty much all "boosting" methods do something to make the trees a little worse at predicting on their own because trees are notorious for overfitting the training data. By making the trees a little worse, they prevent overfitting.

  • @alexpowell-perry2233

    @alexpowell-perry2233

    8 ай бұрын

    @@statquest thanks so much for the reply but I still dont quite understand this - so does each LEVEL get a similarity score? I dont understand how you can quantify a threshold when this threshold is being applied to more than 1 node in the tree? In your example you showed us how to calculate the cosine similarity for a split that is being applied to just 1 node - how do we calculate this when its being applied to, (in the case of a level 2 split) 2 nodes simultaneously? I also have one more question - since the tree must be symmetrical, i am assuming that a characteristic (in the case of your example - "Favourite Film") can only ever appear in a tree once?

  • @statquest

    @statquest

    8 ай бұрын

    @@alexpowell-perry2233 In the video I show how the cosine similarity is calculated using 2 leaves. Adding more leaves doesn't change the process. Regardless of how many leaves are on a level, we calculate the cosine similarity between the residuals and the predictions for all of the data. And yes, a feature will not be used if it can no longer split the data into smaller groups.

  • @sanukurien2752
    @sanukurien27522 ай бұрын

    what happens during inference time when the target is not available? How are the categorical variables encoded then?

  • @statquest

    @statquest

    2 ай бұрын

    You use the full training dataset to encode the new data.

  • @Mark_mochi
    @Mark_mochi6 ай бұрын

    In 8:25, why does the threshold change to 0.87 all of a sudden?

  • @statquest

    @statquest

    6 ай бұрын

    Oops. That looks like a typo.

  • @yufuzhang1187
    @yufuzhang1187 Жыл бұрын

    Dr. Starmer, when you have a chance, can you please make videos on LIghtGBM, which is quite popular these days? Also, can you do ChatGPT or GPT or Transformer, clearly explained! Thank you so much!

  • @statquest

    @statquest

    Жыл бұрын

    I'm working on Transformers right now.

  • @yufuzhang1187

    @yufuzhang1187

    Жыл бұрын

    @@statquest Thank you so much! Looking forward!

  • @xaviernogueira

    @xaviernogueira

    Жыл бұрын

    ​@@statquest excited for that

  • @user-zq4cv6yn8u
    @user-zq4cv6yn8u8 ай бұрын

    Thank you for your content! It's very nice, everything is clear, I hope you want stop producing your content :)

  • @statquest

    @statquest

    8 ай бұрын

    Thank you!

  • @serdargundogdu7899
    @serdargundogdu78999 ай бұрын

    how was "favorite color < 29" changed into "favorite color < 0.87" in 8:28 ? Could you please explain?

  • @statquest

    @statquest

    9 ай бұрын

    That's just a horrible and embarrassing typo. :( It should be 0.29.

  • @alphatyad8131
    @alphatyad8131 Жыл бұрын

    Dr. Starmer, I try to manually calculate and use a calculator too for several times but it was different from the results in 7:23. I get 0.7368, but there is 0.79. Am I missing something? Does anyone get the same result as me?

  • @statquest

    @statquest

    Жыл бұрын

    That's just a typo in the video. Sorry for the confusion.

  • @alphatyad8131

    @alphatyad8131

    Жыл бұрын

    ​Okay. Thank you for your attention and the great explanation, Dr. Josh Starmer. Such an honor and my pleasure to contribute to this video. Have a great day, Dr.

  • @recklesspanda8669
    @recklesspanda866911 ай бұрын

    is it still work like that if i use classification?

  • @statquest

    @statquest

    11 ай бұрын

    I believe classification is just like classification for standard Gradient Boost: kzread.info/dash/bejne/nKypsK6BZce-c9Y.html

  • @recklesspanda8669

    @recklesspanda8669

    11 ай бұрын

    @@statquest thank you🤗

  • @YUWANG-du4pv
    @YUWANG-du4pv Жыл бұрын

    Dr. Starmer, could you explain lightGBM🤩

  • @statquest

    @statquest

    Жыл бұрын

    I'll keep that in mind.

  • @TrusePkay
    @TrusePkay Жыл бұрын

    Do a video on LightGBM

  • @statquest

    @statquest

    Жыл бұрын

    I'll keep that in mind.

  • @nilaymandal2408
    @nilaymandal24085 ай бұрын

    5:28

  • @statquest

    @statquest

    5 ай бұрын

    A good moment.

  • @TheDankGoat
    @TheDankGoat8 ай бұрын

    obnoxious, arrogant, has mistakes, but useful....

  • @statquest

    @statquest

    8 ай бұрын

    What parts do you think are obnoxious? What parts are arrogant?And what time points, minutes and seconds, are mistakes? (The mistakes I might be able to correct or at least have a note mentions them).