13.4.5 Sequential Feature Selection -- Code Examples (L13: Feature Selection)

Ғылым және технология

This final video in the "Feature Selection" series shows you how to use Sequential Feature Selection in Python using both mlxtend and scikit-learn.
Jupyter notebook: github.com/rasbt/stat451-mach...
Timestamps:
00:00 Dataset setup and KNN baseline
04:08 Selecting the best 5 features
10:18 Inspecting the results
13:40 Selecting the best subset of any size
17:29 Exhaustive search
21:12 Sequential feature selection in scikit-learn
-------
This video is part of my Introduction of Machine Learning course.
The complete playlist: • Intro to Machine Learn...
A handy overview page with links to the materials: sebastianraschka.com/blog/202...
-------
If you want to be notified about future videos, please consider subscribing to my channel: / sebastianraschka

Пікірлер: 34

@118bone24 күн бұрын
This was a great help, I was just trying to determine the best FS method for my dataset. I've now subscribed and I'm looking forward to checking out all the videos on the playlist, thank you!
@Hefebatzen2 жыл бұрын
Thank you very much for the video. The easier formula for computing the sum of m choose k for getting the number of exhaustive search runs is 2^m - 1. Cheers
@tririzki7025 Жыл бұрын
thank you so much. your videos are really helpful!
@abhishek-shrm2 жыл бұрын
Thanks, Sebastian for making this playlist!
@SebastianRaschka
2 жыл бұрын
Glad to hear you are liking it!
@abhishek-shrm
2 жыл бұрын
@@SebastianRaschka What should be the next steps after completing this playlist?
@SebastianRaschka
2 жыл бұрын
@@abhishek-shrm That's a good question. I'd say focusing on some personal project and applying those techniques, or Kaggle. When you are doing that, you'll also automatically want to learn more specific things, look for specific papers, and so forth. I think working on a project is really a worthwhile learning experience.
@shivampratap8863 Жыл бұрын
Mr. Raschka, Please complete the series, Lots of Love from India
@rawadalabboud14892 жыл бұрын
Thank you for this playlist it was very helpful and everything is very well explained! Are the feature extraction videos posted yet?
@SebastianRaschka
2 жыл бұрын
Thanks! Regarding the Feature Extraction vids, I was unfortunately running out of time! Hopefully one day, though!
@sebastianlarsen671311 ай бұрын
Thanks for the great content. I find that the SFS method can overfit on CV as the selected features are sensitive to the random seed when N features is large (say N > 100). Is it possible to do nested CV, where you'd perform SFS based on a subset of the data (train, val) then evaluate generalisation on the test data? In that case, a 5-fold outer CV would have 5 different selected feature sets. Are there any approaches to then select the top features?
@olivernolte1376 Жыл бұрын
Dear Sebastian, thank you so much for the Video! I was wondering whether there is a way to use a sk.learn pipeline as the estimator for the Sequential Feature Selector? My pipeline only consists out of a Column Transformer (one-hot encoder) and a Standard Scaler and I can't seem to get it to work. An answer would be much appreciated!
@bjoernaagaard Жыл бұрын
Thanks for the video. How does one determine the best estimator for the SFS object? I'm using it for a regression problem, but I have hard time determining whether to use LASSO, RF or something else without any source to support it. All the best!
@mariia.morello9 ай бұрын
Hello, Sebastian, I watched all the lectures and I wanted to thank you for the course! Would it be possible for you to share the lecture notes? The link you posted under one of the previous videos does not seem to work
@kshitijshekhar11442 жыл бұрын
Could you please finish series 14,15, and 16? Loved the playlist.
@SebastianRaschka
2 жыл бұрын
Glad to hear you like it! I am currently juggling too many projects, but I hope I can get to it some day!
@nicob952 жыл бұрын
Amazing content Sebastian! Your explanations in all your videos are amazingly clear :) One short question however: It seems like that both presented feature selection functions base the feature selection on the scores in the training data set, and that these functions do not make use on a validation/testing data set. Is this right or am I overseeing something? Wouldn't it be better to base feature selection on the scores on a testing data set, to avoid overfitting?
@SebastianRaschka
2 жыл бұрын
Yes and no. Yes, the selection is based on the training set. But it is using k-fold cross-validation. E.g., if you set cv=5, there will be 5 classifiers fitted for each feature combination, and the score is the average score over the 5 validation folds. You can of course override it and use the regular holdout method like you would do it in GridSearchCV etc by providing a custom object to the cv argument.
@nicob95
2 жыл бұрын
@@SebastianRaschka Thanks Sebastian, this is clear now!
@jingyiwang5113 Жыл бұрын
Thank you so much for this video! It is really helpful! May I ask if you could kindly remind me what this score is? Is it p value? I am grateful for your reply!
@123456931211 ай бұрын
Kindly update the playlist..There is no video on feature selection which you mentioned in this video....Thanks for your playlist, really helpful
@user-me6rj1ut4v Жыл бұрын
Happy new year! Thank you for very interesting video. I have an applied question to SFS from mlxtend. KNN is simple enough algorithm which hasn't got stop mechanism. I want to use XGBClassifier and it is subject to overfit. Therefore I usually stop it using validation sample. I want to ask, how can I stop my xgbclassifiers during SFS procedure? On the one hand, if I use small number of trees, I will have underfitted algorithms, whose will not allow to choose optimal set of features. On the other hand, if I use too large number of trees, many algorithms will be overfitted and choice of features will be again incorrect (Such procedure will be subject to large fluctuations). Thus I need some mechanism to stop each xgbclassifier in right moment independently. Can you advice me something useful to resolve this issue?
@azjargalnaranbaatar2712 Жыл бұрын
Great video! If we want to choose 2 features from for example 60 features. Although it reduces the accuracy, does it mean that if we use this model, it would choose the BEST two models that's more accurate than the other feature combinations of two?
@SebastianRaschka
Жыл бұрын
Yes, that's exactly right. It will choose among the different 2-feature subsets, it would choose the best one (but that best one might be worse than the 60-feature subset)
@azjargalnaranbaatar2712
Жыл бұрын
@@SebastianRaschka okay thank you!
@ntust_quangnguyen23712 жыл бұрын
First of all, thank you very much for your understandable video! However, your case is a classification problem, I wonder how if I use it for a regression problem. Can I use KNeighborsClassifier or replace it with another estimator? And is there any way to display run progress in scikit-learn (like verbose in mlxtend). Because I have very large input (tens of thousands) so I want to know how much it takes time?
@SebastianRaschka
2 жыл бұрын
Yes, you can replace it with any other scikit-learn estimator for classification or regression. E.g., you could use a RandomForestRegressor or anything you like really. Btw, if you use a scikit-learn regressor, it will automatically use the R2 score for scoring like it's done in other scikit-learn methods. However, you can of course change it to something else. There is a "scoring" parameter for that similar to GridSearchCV. Also, you can show the progress via the "verbose" parameter: verbose : int (default: 0), level of verbosity to use in logging. If 0, no output, if 1 number of features in current set, if 2 detailed logging i ncluding timestamp and cv scores at step.
@rahulpaul9432 Жыл бұрын
Thanks again for such an informative and exhaustive video explanation! I had one query on SequentialFeatureSelector in sklearn.feature_selection. Unlike mlxtend; in sklearn there's no option for using "best" in number of features to be selected. Instead, there's an auto option which has to be used with another parameter tol: n_features_to_select: “auto”, int or float, default=’warn’ If "auto", the behaviour depends on the tol parameter: if tol is not None, then features are selected until the score improvement does not exceed tol; otherwise, half of the features are selected. I am using this in a regression setting; any suggestion on how this can be used to get the best features as we get in mlxtend? I am using a score as : 'neg_mean_squared_error'
@SebastianRaschka
Жыл бұрын
That's a good question, and I don't know the answer to that, sry. Actually, I have only very limited experience with the sklearn implementation since I am typically using the original SFS in mlxtend.
@zeusgamer5860
Жыл бұрын
@@SebastianRaschka no issues! Will use mlxtend in that case 😊
@osvaldofigo86982 жыл бұрын
Can we use more than 1 scoring method? Let's say we use accuracy and f1-score.
@SebastianRaschka
2 жыл бұрын
No, that's currently not possible. What you can do is you can provide a custom scoring method (scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) and provide that as input. Here you could do something like the average of accuracy and F1 score.
@joaovitormaiacezario8116 Жыл бұрын
Model has to be fitted before set inside the SFS?
@SebastianRaschka
Жыл бұрын
No, it is an unfitted model. It then gets fitted on the training set with the "modified" feature set in each round