Julia Silge
Күн бұрын
18,536
1

Tuning XGBoost using tidymodels

Ғылым және технология

Learn how to tune hyperparameters for an XGBoost model, using #TidyTuesday data on beach volleyball matches.
You can check out the code here on my blog: juliasilge.com/blog/xgboost-t...

Пікірлер: 84

@kemikao4 жыл бұрын
Your videos are very informative! I love that you take the time to show the data first and explain what the variables are. And the fact that you explain the tidy functions and even repeat a bit of what you said in earlier videos is great! You use just the right amount of detail for me at least. Thank you.
@mygeorgyboy2 жыл бұрын
Very nice example. You show all the process, very illustrative. Thank you Julia
@talitabac3 жыл бұрын
Amazing video, super clear! Thank you, Julia!
@mathteacher17294 жыл бұрын
Thank you so much for this video (and for all your videos). I've been using R for about two or three years and this was just the right amount of detail and exposition for me. Your workflow is clean and easy to follow, I like how you used the help function and your overall layout is nice to (console in the top right). I look forward to trying XGBoost on some data sets now! :)
@BonifaceMakone4 жыл бұрын
These videos are super informative. Keep them coming. Thanks
@mehdi12703 жыл бұрын
Thank you so much Julia for all your tutorial videos. They are easy to follow and very informative.......just great! Please keep posting them. I hope you can find some time to post a video on neural network optimization with Keras in R. I can even start a petition for that. LOL
@flachboard842 жыл бұрын
Very helpful video! I look forward to following this example in a future project!
@davidjackson76754 жыл бұрын
I always learn something from your videos.
@haraldurkarlsson11473 жыл бұрын
Very nice presentation of xgboost by the way.
@badrGamer112 жыл бұрын
Always an amazing content thank you
@luisfernandocuestasanchez43433 жыл бұрын
You are the most amazing person I've ever come across Thanks a lot Blessings =)
@edneideramalho236311 ай бұрын
You are the best!
@erickcohen18764 жыл бұрын
Hi Julia, this was video was amazing and very informative! Would you be able to help us find resources for (or post a video about :) ) the math behind these models? I.e. gradient-descent for XGBoost models. Thank you very much for posting these videos! I am learning a ton!
@haraldurkarlsson11473 жыл бұрын
Julia, I ran this model on a new mac mini and it produced results in about 7 minutes. Much faster than my old mac which I desktop did not dare run it on.
@PA_hunter
2 жыл бұрын
Similar time here
@alanjiang29303 жыл бұрын
Watched more than half of your videos within one week. Don't even want to blink! Saw you plotted XGB importance - wonder if there is tidymodel way to plot SHAP values from XGB. Thanks, Julia!
@JuliaSilge
3 жыл бұрын
If you are only doing xgboost, you might try the SHAPforxgboost package: cran.r-project.org/package=SHAPforxgboost (it takes a bit of munging the model to get it to work with that package) For modeling in general, I like DALEX for explainability, which also supports tidymodels: modeloriented.github.io/DALEXtra/reference/explain_tidymodels.html We have a chapter in process on explainability in our upcoming book, so keep your eyes out for that: www.tmwr.org/
@alanjiang2930
3 жыл бұрын
@@JuliaSilge Got it. Thanks for the direction! Again, amazing video series! Really really tidy.
@JoseAyerdis4 жыл бұрын
If you get a RStudio crash related to Initializing libomp.dylib, but found libomp.dylib already initialized. When using the final workflow and fit it. You can use a workaround on OSX Sys.setenv(KMP_DUPLICATE_LIB_OK = TRUE)
@geilin23944 жыл бұрын
These vids are great. Can we see a classification model with calibration curves, and then recalibrate it, within the tidymodels framework? How long did the hyperparameter tuning take here?
@haraldurkarlsson11473 жыл бұрын
I should mention that the mini ran this quietly and I heard no noise from an overworked. The unit is also cool to the touch.
@raminziaei64114 жыл бұрын
Thanks a lot Julia. I really love your videos. Do you have any plans for making a video on neural network and tuning it in tidymodels? That would be awesome if possible. Please continue these videos. They are really great. Cheers
@faiazrummankhan55893 жыл бұрын
All your videos are such a great learning resource for real world EDA and modelling. I was just wondering what theme you are using in rstudio ?
@JuliaSilge
3 жыл бұрын
It's one of the themes available via rsthemes: www.garrickadenbuie.com/project/rsthemes/
@lucaskramer4384 жыл бұрын
Great explanation, but i have one question: When you call last_fit() you make use of your split object. In my particular case i only was provided with the train and test test initially, so that i dont have a split object. Is there any way to call last_fit() nevertheless? Thanks!
@JuliaSilge
4 жыл бұрын
You can't call last_fit() directly if you don't have the split, but you *can* manually do what it is a wrapper for, which is train one last time on the training data and then evaluate one last time on the testing data.
@amahoela7303 жыл бұрын
Does anyone know how you can save the workflow for later use? I have problems with it since it is not of format 'xgb.booster', whereas using the function saveRDS might result in compatibility issues in case of future package versions.
@gkuleck11 ай бұрын
Hi Julia! Great video. Have you done a video on multiclass classification? I am struggling to find guidance for this type with text classification. Thanks!!
@JuliaSilge
11 ай бұрын
Check out these two: - juliasilge.com/blog/nber-papers/ - juliasilge.com/blog/multinomial-volcano-eruptions/
@gkuleck
11 ай бұрын
Thank you!
@Matthew-px9nu4 жыл бұрын
Julia thank you for these great videos keep it up ! Quick question once using last_fit if wanting to predict on NEW data what are the workflow steps ? Last_fit doesn’t really work on new data that wasn’t in the original split. Thank you !
@JuliaSilge
4 жыл бұрын
Once you get to last_fit(), check out the objects that are inside of it. One of the columns contains a *fitted model* that can be used on new data. In fact, that fitted model is used on the testing data to compute the metrics!
@Matthew-px9nu
4 жыл бұрын
@@JuliaSilge Thank you Julia! Last quick Q, noticed you always process the commands in console from the notebook Rmd, what button do you click to run in console instead of in the notebook?
@JuliaSilge
4 жыл бұрын
@@Matthew-px9nu That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.
@vincentpepe1064
4 жыл бұрын
@@JuliaSilge Hi Julia! Where do I exactly find this? The columns I have are splits, id, .metrics, .notes,. predictions, .workflow. I can't find the fitted model in .workflow either so I'm not sure where it is. Thanks!
@JuliaSilge
4 жыл бұрын
@@vincentpepe1064 The .workflow is a *fitted* workflow at this point. For example, try tidying it or predicting on it. I show how to tidy it here: juliasilge.com/blog/palmer-penguins/
@briancostello9394 жыл бұрын
Great video! Is there any difference between “pivot_longer” and “gather”? They look identical to me, just with the arguments having different names, but want to make sure I’m not missing something.
@JuliaSilge
4 жыл бұрын
You can read this blog post that introduced the pivot verbs: www.tidyverse.org/blog/2019/09/tidyr-1-0-0/
@briancostello939
4 жыл бұрын
Julia Silge oh awesome thanks!
@angvl8793 Жыл бұрын
Hi Julia ! Great video as always :) ! Can i ask you something please? At around 34.08 if we don't want to use the xgb_grid you are using and we use in the tune_grid() function, something else for the grid parameter, let's say grid = 50 is this ok ? I mean generally is it ok to use grid equal a number ? Thank you very much !
@JuliaSilge
Жыл бұрын
Yes, that argument can take a couple of different kinds of values, either a dataframe or an integer value: tune.tidymodels.org/reference/tune_grid.html You can read a bit more about this here: www.tmwr.org/grid-search.html#evaluating-grid
@angvl8793
Жыл бұрын
@@JuliaSilge Thank you again ! :) .
@JerryWho499 ай бұрын
Great video, thanks. But I’ve got a question. Say, my local computer is too small to fit a model fast enough. How would I train a model in the cloud? Do you have any best practices?
@JuliaSilge
9 ай бұрын
One of the easiest ways to go is to use RStudio on SageMaker: posit.co/blog/getting-started-rstudio-sagemaker/
@Simonsayztaga4 жыл бұрын
Do you have a course on tidymodels?? Video Course or Tutorials?
@JuliaSilge
4 жыл бұрын
You can check out this interactive course on tidymodels: supervised-ml-course.netlify.app/
@artathearta
3 жыл бұрын
@@JuliaSilge Amazing resource, thank you
@haraldurkarlsson11473 жыл бұрын
Julia, I was able to follow along and everything looked fine until the final roc_auc curve. I get a mirror image of your curve. I have combed through the code and found nothing wrong. The confusion matrix outcome is similar to yours etc. It seems like a systematic error. I noticed when looked at the data that will generate the curve that indeed my numbers for specificity are somehow switched. While your table starts with specificity of 1 mine starts at zero so the value seem more like 1-specificity to begin with in my case. I am puzzled.
@JuliaSilge
3 жыл бұрын
You can look at the first comment at the relevant blog post here: juliasilge.com/blog/xgboost-tune-volleyball/ Since I published this blog post, there was a change in yardstick in version 0.0.7: github.com/tidymodels/yardstick/blob/master/NEWS.md#yardstick-007 that changed how to choose which level (win or lose) is the "event". You can change this by using the `event_level` argument for functions like `roc_curve()`: yardstick.tidymodels.org/reference/roc_curve.html
@artathearta3 жыл бұрын
48:44 my autoplot was flipped along the X = Y axis, I wonder why.
@JuliaSilge
3 жыл бұрын
It's because of a global change in how yardstick finds the "first" or base level event: juliasilge.com/blog/xgboost-tune-volleyball/#comment-5015180544
@shamsulhoquekhan933 Жыл бұрын
Can someone tell me why we used sample_prop inside the search grid?
@JuliaSilge
Жыл бұрын
It's what proportion of the total available sample is used for modeling within one boosting iteration: dials.tidymodels.org/reference/trees.html#details
@deltax71593 ай бұрын
What appearance theme are you using here?
@JuliaSilge
3 ай бұрын
I use one of the themes from rsthemes: www.garrickadenbuie.com/project/rsthemes/ I think Oceanic Plus? There are lots of nice ones available in that package.
@dudeadulto4 жыл бұрын
Hi im getting a warning-error: ! Fold01: model 1/20: The `x` argument of `as_tibble.matrix()` must have colum... Whentune_grid function runs... Found in a github issue, that it's related to "name reparing"... Do you have any idea if it really affects the results of the tunning process, or if thers a update/solution for it?
@JuliaSilge
4 жыл бұрын
Hmmmm, do you want to make sure all your packages are updated? That sounds like a message from an older version of the packages. If you are still getting that warning, I recommend creating a reprex and posting on RStudio Community: community.rstudio.com/c/ml/15
@dudeadulto
4 жыл бұрын
@@JuliaSilge After reading your responde, I did update all my packages, and the error still occurs, but the process seems to keep running. I will let it finish, and see if it affects the results of the tune_grid
@wecsleyprates32053 жыл бұрын
Hey Julia, congrats again: show up this error: xgb_res
@JuliaSilge
3 жыл бұрын
You need to *install* xgboost, actually; you don't have the package installed: install.packages("xgboost")
@wecsleyprates3205
3 жыл бұрын
@@JuliaSilge yeah...but I don't know what is happening, when I try install the package xgboost gives a error telling me that the xgboost is not available for my R version. My R Studio is the currently version.
@JuliaSilge
3 жыл бұрын
@@wecsleyprates3205 Ah, a classic problem that folks run into when things get borked! Check out this SO question + answers: stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-for-r-version-x-y-z-wa
@wecsleyprates3205
3 жыл бұрын
Thanks @@JuliaSilge...Do you know what means the error below? Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘predict’ for signature ‘"xgb.Booster"’
@JuliaSilge
3 жыл бұрын
@@wecsleyprates3205 That sounds like xgboost still isn't getting loaded correctly to me. Could you try creating a reprex showing your problem and posting on RStudio Community? rstd.io/tidymodels-community
@haraldurkarlsson11473 жыл бұрын
Julia, I do like Markdown but for testing out code I prefer R script simply because I make a lot of mistakes. So I am curious to know why you work in Markdown. Is it so because you have already written and debugged your code and would like to save the lesson in a nicer format?
@JuliaSilge
3 жыл бұрын
No, I work in R Markdown regularly. In R I basically am either building package code or I am working in R Markdown. I'm a huge believer in the idea of "literate programming" as a real way to work. I make a lot of mistakes too, but I don't think that reduces the value of combining narrative and code in one document.
@haraldurkarlsson1147
3 жыл бұрын
I am working on setting up a class for students in my department and am quite torn on whether to go the Markdown or R script route. Since most of the class work will be around coding and simply learning how to R I am inclined to start with the regular setup (script) and then move on to Markdown later. Thanks.
@JuliaSilge
3 жыл бұрын
@@haraldurkarlsson1147 The person I know who has thought the most about this is Mine Çetinkaya-Rundel; you can see one of her resources for teaching here: datasciencebox.org/ She recommends teaching R Markdown to emphasize reproducible analyses.
@haraldurkarlsson1147
3 жыл бұрын
I see. Thanks a lot for the tip.
@haraldurkarlsson1147
3 жыл бұрын
Julia, I will have a deeper dive into the datasciencebox. However, I will be teaching grad students that should have some inkling of what the basic statistics concepts are. Most have already worked with data, done some data processing, and generated tables and graphs. I would like to teach them R to simplify their lives and give them hopefully a new valuable skill for the current or future work. As grad students the science part is covered.
@tamaraabzhandadze27123 жыл бұрын
Thank you for the great tutorial. I have been haivng a problem with a confusion matrix. namely, when i run the code " final_res_r %>% collect_predictions()%>% roc_curve(dependent_var, .pred_dependent_var)%>% autoplot()", i get the error Can't subset columns that don't exist. x Column `.pred_dependent_var` doesn't exist.. I can not understand how to solve the problem. What am i doing wrong?
@JuliaSilge
3 жыл бұрын
Hmmmm, do you see the column with the predicted class probability in it, after you run `collect_predictions()`? You can check out the documentation for `roc_curve()` here: yardstick.tidymodels.org/reference/roc_curve.html And if you continue to have trouble, I recommend creating a reprex and posting it on RStudio Community: rstd.io/tidymodels-community It's often easier to get help with coding problems in a format like that rather than comments.
@tamaraabzhandadze2712
3 жыл бұрын
@@JuliaSilge Dear Julia! Just amazing to read your response :). I have solved that problem :). however, another problem that I could not solve was related to the variable importance. I managed to create a figure but I can not get the actual values per variable. I tried to use varImp(model_name), xgb.importance(model = model_name). but getting just lovely red text around, without the results :)
@JuliaSilge
3 жыл бұрын
@@tamaraabzhandadze2712 I typically use the vip package for variable importance, as I show in this blog post: juliasilge.com/blog/xgboost-tune-volleyball/
@tamaraabzhandadze2712
3 жыл бұрын
@@JuliaSilge thank you! I have actually posted the question there as well :) . I read your answer and got the results :). I just really have to decide now the cutoff coefficient for choosing some variables out of ten features. p.s. i did factor analyses as well, and could identify 3 variables with good loading, but there it was a bit easier as there are cutoffs for loading :). For XGboost i have no idea what to do :)
@hansmeiser60783 жыл бұрын
Is .pred_win = .pred_class ?
@JuliaSilge
3 жыл бұрын
No, .pred_win should be a class probability (like a number) and .pred_class should be the predicted class (like the factor level).
@hansmeiser6078
3 жыл бұрын
@@JuliaSilge Ah ok, thank you!