Man, its such a privilege being able to watch stuff like this.
@Biesterable
5 жыл бұрын
So true
@TrentTube
4 жыл бұрын
I feel the exact same way. I am constantly humbled and thrilled this is available.
@filippovannella49575 жыл бұрын
This man is one of the best professor I have ever seen. Thanks a lot for this lecture series.
@tarunluthrabk3 жыл бұрын
I searched extensively for good content on Machine learning, and by God's grace I found one! Thank you Prof Weinberger.
@ebiiseo4 жыл бұрын
Your ability to uncover insights behind all those mathematical formulas is superb. I really like the way you teach. Thank you for uploading this
@jorgeestebanmendozaortiz8732 жыл бұрын
Due to the Covid crisis the professors at my University went on a strike for most of the semester, so my ML class got ruined. Fortunately I found your lectures, and I've been following through the last months. I have to say this is the most thorough introductory course to ML that I've found out there. Thank you very much, prof. Killian, for making your lectures available for everyone. You're working towards a freer and better world by doing so.
@juliocardenas44853 жыл бұрын
I’m using what I’ve learned here to try improving people’s lives. I’m a data scientist in healthcare and a former radiology researcher. Thank you for sharing this freely.
@vatsan164 жыл бұрын
Me: Machine learning is a black box, the math is too abstract, and nothing really makes sense Professor Weinberger: Hold my beer
@jachawkvr4 жыл бұрын
I was familiar with these concepts before watching this lecture, but now I feel like I actually understand what bias and variance mean. Thank you so much for explaining these so well!
@yuniyunhaf57674 жыл бұрын
i cant believe i have reached to this point, and he shaped the way i think about ML, best professor
@deltasun4 жыл бұрын
that's the clearest exposition of bias-variance decomposition I've ever seen (and i've seen quite a few). by far
@xwcao19913 жыл бұрын
Thank you prof. Weinberger for making educational fairness to the people from the thrid world countries like me. who can not afford to study in one of the world's class universities like Cornell. Wish you healthy and happy in your entrie life.
@muratcan__225 жыл бұрын
this video is gold
@MohamedTarek-vt4lb6 ай бұрын
This is Amazing! bless you professor Kilian if you read this
@psfonseka5 жыл бұрын
This was super helpful for my own classwork. Thank you so much for posting your lectures publicly!
@rajeshs28404 жыл бұрын
Oh Man, Hats off to you efforts.. Its amazing lecture..
@kevinshen32212 жыл бұрын
this is absolutely gold. was so confused by reading An Introduction to Statistical Learning because they give no explanation of how they get bias variance tradeoff, and i found this!
@sans81194 жыл бұрын
An amazing lecture !! makes things very clear .
@jenishah98252 жыл бұрын
Such videos don't generally come up in YT suggestions. But if you have found it, it is a gold mine!
@crystinaxinyuzhang36214 жыл бұрын
It's such an amazing lecture! I've never thought each trained trained ml model itself as a random variable before and this is really eye opening
@sheikhshafayat69843 жыл бұрын
I don’t usually comment anywhere. But I can’t help say a thanks to you. Such a great teaching skill!
@taketaxisky4 жыл бұрын
The way the error is decomposed reminds me of the decomposition of sum of squares in ANOVA into within group SS and between group SS, in a similar calculation
@vishnuvardhan66254 ай бұрын
Best vedio on Bias-Variance Decomposition ❤
@angelocortez51852 жыл бұрын
These videos popped up on my feed. Didn't realize you wrote the MLKR paper as well. Seeing your videos make me wish I took a formal class with you. Thank you for this content Kilian!
@haodongzheng70452 жыл бұрын
Thank you, professor. I feel like that I’ve grown up a little bit after watching your video ;)
@abhyudayasrinet175 жыл бұрын
A really great explanation
@noblessetech4 жыл бұрын
Awesome video playlist love it.
@vishchugh4 жыл бұрын
BEST LECTURE ON BIAS VARIANCE TR !!!!!!!!!!!!!!!!
@TheAIJokes2 жыл бұрын
you are one of my favourite teacher sir....love you from india....❤️
@mateuszjaworski29747 ай бұрын
It's like a good action movie, you can't wait enought about what will be next.
@jordankuzmanovik52973 жыл бұрын
Wonderful!!...Bravo
@NO_REPLY_ALARM_TOWARD_ME Жыл бұрын
I think that the lecturer always give the students several minutes to make clarified themselves even it maybe seems to be trivial in proving step. It maybe seem difficult, but concise to follow. Thanks.
@ashraf736 Жыл бұрын
What a wonderful lecture.
@xmtiaz2 жыл бұрын
This was beautiful.
@florianwicher3 жыл бұрын
It was a little bit slow, but I got it now. Thanks a lot!
@hanseyye14683 жыл бұрын
Thanks Professor Weinberger.I have one question on 23:28, why we use here the joint distribution p(x,y) but not a conditional p(yIx) or p(y)*p(x)
@kilianweinberger698
3 жыл бұрын
Because you are drawing x and y randomly, and your data set and algorithm depends on both. You could factor this into first drawing x, then y i.e. P(y|x)P(x), but it really wouldn't change much in the analysis. Hope this helps.
@hanseyye1468
3 жыл бұрын
@@kilianweinberger698 thank you so much
@gauravsinghtanwar44154 жыл бұрын
What is the need to take probability term in the expected test error expression ?
@meenakshisundaram83103 жыл бұрын
Thank you very much
@utkarshtrehan91283 жыл бұрын
Enlightenment
@Saganist4205 жыл бұрын
My real life dart playing skills have high bias, high variance.
@janismednieks12773 жыл бұрын
"My son is doing that now, he's in second grade." If you're the one teaching him, I believe you. Thanks.
@StevenSarasin7 ай бұрын
That means that the noise also depends on the feature set. So that the noise is not necessarily irreducible, if you can find new features to include. In the housing price example you would appear to have a lot of noise if you left out a location variable in the feature x! Interesting. So we have reduced the generalization error to the dependency on D, the variance, will more data improve the situation? the dependency on the feature set, does there exist a feature set that limits the variance on y itself averaged given x? and there is the bias dependency, are we in principle flexible enough to match the true data pattern (linear vs non-linear.)
@immabreakaleg4 жыл бұрын
17:48 what a boss question wow
@VijayBhaskarSingh2 жыл бұрын
{x1, x2, .. xi} are sample vectors from (X variables)? Or they are function of one variable?
@ammarkhan26113 жыл бұрын
Hi Professor, Is there a way to get an access to the assignments ?
@amit_muses3 жыл бұрын
I have a command on Bayesian probability theorem, total probability theorem, but couldn't understand the symbols the prof used. I could understand that the prof used some concepts of Expectation theory but couldn't understand well. Can someone suggest some material for this part that I can do in a very short period so that I can understand this lecture well
@danielsiemmeister52862 жыл бұрын
First of all thank you of this very intuitive explanation, Mr. Weinberger! I have some small questions or remarks which aren't 100 % clear to me: - you said, that y (given x) is random. So we want to pick one statistic depending on our goals. In this case you choose the Expectation(y|x). (One could for example choose the median, coudn't we?) However some minutes later you choose the squared loss function as a "nice" choice for regression. Aren't this two sides of the same coin? If I am choosing the squared loss function, then I am picking the E[y|x]? (When I am choosing the absolute value loss function, then I am choosing the median). So this is my first question, are my thoughts right? - How would the proof look like if I am not in the "squared loss / expectation" setting? What would the proof look like for an generic loss function or statistic of y|x? This is my second question. - How would the proof look like if we are going in the regression setting? I think that is pretty much the same question as question 2. Am I right, when I am saying, that if the distribution of y|x is discrete, than I am in a classification setting and if it is continous, than I am in a regression setting? Furthermore, if I am picking the statistic of y|x (or a loss function) in a generic way, then I have a proof for classification and regression problems? I would be very thankful if anyone could answer or comment on my questions! Yours Daniel
@kilianweinberger698
2 жыл бұрын
Yes, you are right. The math becomes a lot trickier if you don't use the squared loss, but ultimately the principle is the same for pretty much any less function.
@pendekantimaheshbabu97994 жыл бұрын
Excellent Can we apply bv trade off among different models?ie for e.g. between Linear Regression and polynomial regression comparison?Whether Bold H consists of set of Hypotheses that contain only linear regressors?
@kilianweinberger698
4 жыл бұрын
Ultimately the BV trade-off exists for all models. However, as far as I know the derivation of this decomposition only falls into place so nicely in a few steps for linear regression.
@siddhanttandon62462 жыл бұрын
Hey Prof i have a question. In this derivation we kinda bounded the risk for a new sample i.e. out of sample risk which is composed of 3 parts. Is their some theory which does the same breakdown of risk on our training set i.e. samples the model has already seen ? I am particularly interested to know if my training loss can ever go to zero.
@kilianweinberger698
2 жыл бұрын
That depends on your hypothesis class (i.e. what algorithm you are using). Maybe take a look at the lectures on Boosting. AdaBoost is an ensemble algorithm that (given some assumptions) guarantees that the training error will go to zero (if you average several classifiers together).
@macc73743 жыл бұрын
Hi Professor! Thank you for uploading this video. When we start the derivation by representing expected test error in terms of hD(X) and y, how can we explain the presence of noise? Our assumption is that y is the correct label. So while there is certainly noise in real world examples, given the starting point of the derivation here, should noise be expected to show up?
@kilianweinberger698
3 жыл бұрын
Keep in mind noise can either be a bad measurement, but it could also be part of the label that you just cannot explain by your representation of x. Imagine I am predicting house prices (y) based on features about a house (x). My features are e.g. number of bedrooms, square footage, age, ... But now the price of a house decreases because a really loud and rambunctious fraternity moves in next door - something that is not captured in my x at all. For this house the price y is now abnormally low. The price is correct, but given your limited features the only way you can explain it is as noise.
@macc7374
3 жыл бұрын
@@kilianweinberger698 thank you
@roniswar3 жыл бұрын
Dear Prof', Thank you again for posting this, very useful and interesting!! One question: in a regression setup, why do you call h (the hypothesis function) as "expected classifier"? Is this the common definition, when thinking about regression problem? Thanks!
@kilianweinberger698
3 жыл бұрын
No, it is only in the setting where you consider the training set as a random variable. Under this view, the classifier also becomes a random variable (as it is a function of the training set), and you can in theory compute its expectation. Hope this helps.
@roniswar
3 жыл бұрын
@@kilianweinberger698 Thank you! One other thing that I didn't see that anyone asked. What happens to the The bias-variance tradeoff, that you fully showed for MSE, when the loss function is not an MSE? Is the decomposition still contains exactly those 3 quantities of bias-variance-noise? How do we measure the tradeoff in these case? We do not longer have this convex parabola shape I assume (If you have a good source explaining this issue please refer me to it).
@ayushmalik70932 жыл бұрын
Hi Prof High Variance implies overfitting but overfitting has 2 parts, high test error and low training error. how to imply low training error from high variance? High variance in the hd(x) could also be result of gibberish learning by our algo which could leads to high test and training error. IMO low bias and high variance should mean overfitting as in that case model prediction for different datasets will spread around the centre of your dart board.
@lorenzoappino91583 жыл бұрын
Killian is my hero
@taketaxisky4 жыл бұрын
How does overfitting affect the decomposed error terms? Maybe it is not relevant here.
@taketaxisky
4 жыл бұрын
Just realize a graph in the lecture notes explains this!
@sandeshhegde91435 жыл бұрын
Where is lecture 18? (I don't see it in the playlist)
@Saganist420
5 жыл бұрын
lecture 18 was an exam, so it was not recorded.
@TrentTube
4 жыл бұрын
I eventually concluded it was the exam I skipped :D
@kc12993 жыл бұрын
disappears into some good feeling hahhaa
@adiratna962 жыл бұрын
I didn't understand why D and (x,y ) are independent. Anyone can explain why? please. TIA.
@adiratna96
2 жыл бұрын
Damn, never mind I got it.
@gaconc13 жыл бұрын
This is a form of the Pythagorean theorem
@kilianweinberger698
3 жыл бұрын
Interesting observation!
@bharatbajoria3 жыл бұрын
Why is there no D at 37:00 in b^2?
@kilianweinberger698
3 жыл бұрын
both terms y-bar and y are independent of the training data set D.
@deepfakevasmoy34773 жыл бұрын
24:56 please someone ask some question, I am not ready for war :)
@logicboard77462 жыл бұрын
Point @22:00
@logicboard7746
2 жыл бұрын
Then @41:00
@hohinng8644 Жыл бұрын
Everything is excellent except the poor handwritting
Пікірлер: 79
Man, its such a privilege being able to watch stuff like this.
@Biesterable
5 жыл бұрын
So true
@TrentTube
4 жыл бұрын
I feel the exact same way. I am constantly humbled and thrilled this is available.
This man is one of the best professor I have ever seen. Thanks a lot for this lecture series.
I searched extensively for good content on Machine learning, and by God's grace I found one! Thank you Prof Weinberger.
Your ability to uncover insights behind all those mathematical formulas is superb. I really like the way you teach. Thank you for uploading this
Due to the Covid crisis the professors at my University went on a strike for most of the semester, so my ML class got ruined. Fortunately I found your lectures, and I've been following through the last months. I have to say this is the most thorough introductory course to ML that I've found out there. Thank you very much, prof. Killian, for making your lectures available for everyone. You're working towards a freer and better world by doing so.
I’m using what I’ve learned here to try improving people’s lives. I’m a data scientist in healthcare and a former radiology researcher. Thank you for sharing this freely.
Me: Machine learning is a black box, the math is too abstract, and nothing really makes sense Professor Weinberger: Hold my beer
I was familiar with these concepts before watching this lecture, but now I feel like I actually understand what bias and variance mean. Thank you so much for explaining these so well!
i cant believe i have reached to this point, and he shaped the way i think about ML, best professor
that's the clearest exposition of bias-variance decomposition I've ever seen (and i've seen quite a few). by far
Thank you prof. Weinberger for making educational fairness to the people from the thrid world countries like me. who can not afford to study in one of the world's class universities like Cornell. Wish you healthy and happy in your entrie life.
this video is gold
This is Amazing! bless you professor Kilian if you read this
This was super helpful for my own classwork. Thank you so much for posting your lectures publicly!
Oh Man, Hats off to you efforts.. Its amazing lecture..
this is absolutely gold. was so confused by reading An Introduction to Statistical Learning because they give no explanation of how they get bias variance tradeoff, and i found this!
An amazing lecture !! makes things very clear .
Such videos don't generally come up in YT suggestions. But if you have found it, it is a gold mine!
It's such an amazing lecture! I've never thought each trained trained ml model itself as a random variable before and this is really eye opening
I don’t usually comment anywhere. But I can’t help say a thanks to you. Such a great teaching skill!
The way the error is decomposed reminds me of the decomposition of sum of squares in ANOVA into within group SS and between group SS, in a similar calculation
Best vedio on Bias-Variance Decomposition ❤
These videos popped up on my feed. Didn't realize you wrote the MLKR paper as well. Seeing your videos make me wish I took a formal class with you. Thank you for this content Kilian!
Thank you, professor. I feel like that I’ve grown up a little bit after watching your video ;)
A really great explanation
Awesome video playlist love it.
BEST LECTURE ON BIAS VARIANCE TR !!!!!!!!!!!!!!!!
you are one of my favourite teacher sir....love you from india....❤️
It's like a good action movie, you can't wait enought about what will be next.
Wonderful!!...Bravo
I think that the lecturer always give the students several minutes to make clarified themselves even it maybe seems to be trivial in proving step. It maybe seem difficult, but concise to follow. Thanks.
What a wonderful lecture.
This was beautiful.
It was a little bit slow, but I got it now. Thanks a lot!
Thanks Professor Weinberger.I have one question on 23:28, why we use here the joint distribution p(x,y) but not a conditional p(yIx) or p(y)*p(x)
@kilianweinberger698
3 жыл бұрын
Because you are drawing x and y randomly, and your data set and algorithm depends on both. You could factor this into first drawing x, then y i.e. P(y|x)P(x), but it really wouldn't change much in the analysis. Hope this helps.
@hanseyye1468
3 жыл бұрын
@@kilianweinberger698 thank you so much
What is the need to take probability term in the expected test error expression ?
Thank you very much
Enlightenment
My real life dart playing skills have high bias, high variance.
"My son is doing that now, he's in second grade." If you're the one teaching him, I believe you. Thanks.
That means that the noise also depends on the feature set. So that the noise is not necessarily irreducible, if you can find new features to include. In the housing price example you would appear to have a lot of noise if you left out a location variable in the feature x! Interesting. So we have reduced the generalization error to the dependency on D, the variance, will more data improve the situation? the dependency on the feature set, does there exist a feature set that limits the variance on y itself averaged given x? and there is the bias dependency, are we in principle flexible enough to match the true data pattern (linear vs non-linear.)
17:48 what a boss question wow
{x1, x2, .. xi} are sample vectors from (X variables)? Or they are function of one variable?
Hi Professor, Is there a way to get an access to the assignments ?
I have a command on Bayesian probability theorem, total probability theorem, but couldn't understand the symbols the prof used. I could understand that the prof used some concepts of Expectation theory but couldn't understand well. Can someone suggest some material for this part that I can do in a very short period so that I can understand this lecture well
First of all thank you of this very intuitive explanation, Mr. Weinberger! I have some small questions or remarks which aren't 100 % clear to me: - you said, that y (given x) is random. So we want to pick one statistic depending on our goals. In this case you choose the Expectation(y|x). (One could for example choose the median, coudn't we?) However some minutes later you choose the squared loss function as a "nice" choice for regression. Aren't this two sides of the same coin? If I am choosing the squared loss function, then I am picking the E[y|x]? (When I am choosing the absolute value loss function, then I am choosing the median). So this is my first question, are my thoughts right? - How would the proof look like if I am not in the "squared loss / expectation" setting? What would the proof look like for an generic loss function or statistic of y|x? This is my second question. - How would the proof look like if we are going in the regression setting? I think that is pretty much the same question as question 2. Am I right, when I am saying, that if the distribution of y|x is discrete, than I am in a classification setting and if it is continous, than I am in a regression setting? Furthermore, if I am picking the statistic of y|x (or a loss function) in a generic way, then I have a proof for classification and regression problems? I would be very thankful if anyone could answer or comment on my questions! Yours Daniel
@kilianweinberger698
2 жыл бұрын
Yes, you are right. The math becomes a lot trickier if you don't use the squared loss, but ultimately the principle is the same for pretty much any less function.
Excellent Can we apply bv trade off among different models?ie for e.g. between Linear Regression and polynomial regression comparison?Whether Bold H consists of set of Hypotheses that contain only linear regressors?
@kilianweinberger698
4 жыл бұрын
Ultimately the BV trade-off exists for all models. However, as far as I know the derivation of this decomposition only falls into place so nicely in a few steps for linear regression.
Hey Prof i have a question. In this derivation we kinda bounded the risk for a new sample i.e. out of sample risk which is composed of 3 parts. Is their some theory which does the same breakdown of risk on our training set i.e. samples the model has already seen ? I am particularly interested to know if my training loss can ever go to zero.
@kilianweinberger698
2 жыл бұрын
That depends on your hypothesis class (i.e. what algorithm you are using). Maybe take a look at the lectures on Boosting. AdaBoost is an ensemble algorithm that (given some assumptions) guarantees that the training error will go to zero (if you average several classifiers together).
Hi Professor! Thank you for uploading this video. When we start the derivation by representing expected test error in terms of hD(X) and y, how can we explain the presence of noise? Our assumption is that y is the correct label. So while there is certainly noise in real world examples, given the starting point of the derivation here, should noise be expected to show up?
@kilianweinberger698
3 жыл бұрын
Keep in mind noise can either be a bad measurement, but it could also be part of the label that you just cannot explain by your representation of x. Imagine I am predicting house prices (y) based on features about a house (x). My features are e.g. number of bedrooms, square footage, age, ... But now the price of a house decreases because a really loud and rambunctious fraternity moves in next door - something that is not captured in my x at all. For this house the price y is now abnormally low. The price is correct, but given your limited features the only way you can explain it is as noise.
@macc7374
3 жыл бұрын
@@kilianweinberger698 thank you
Dear Prof', Thank you again for posting this, very useful and interesting!! One question: in a regression setup, why do you call h (the hypothesis function) as "expected classifier"? Is this the common definition, when thinking about regression problem? Thanks!
@kilianweinberger698
3 жыл бұрын
No, it is only in the setting where you consider the training set as a random variable. Under this view, the classifier also becomes a random variable (as it is a function of the training set), and you can in theory compute its expectation. Hope this helps.
@roniswar
3 жыл бұрын
@@kilianweinberger698 Thank you! One other thing that I didn't see that anyone asked. What happens to the The bias-variance tradeoff, that you fully showed for MSE, when the loss function is not an MSE? Is the decomposition still contains exactly those 3 quantities of bias-variance-noise? How do we measure the tradeoff in these case? We do not longer have this convex parabola shape I assume (If you have a good source explaining this issue please refer me to it).
Hi Prof High Variance implies overfitting but overfitting has 2 parts, high test error and low training error. how to imply low training error from high variance? High variance in the hd(x) could also be result of gibberish learning by our algo which could leads to high test and training error. IMO low bias and high variance should mean overfitting as in that case model prediction for different datasets will spread around the centre of your dart board.
Killian is my hero
How does overfitting affect the decomposed error terms? Maybe it is not relevant here.
@taketaxisky
4 жыл бұрын
Just realize a graph in the lecture notes explains this!
Where is lecture 18? (I don't see it in the playlist)
@Saganist420
5 жыл бұрын
lecture 18 was an exam, so it was not recorded.
@TrentTube
4 жыл бұрын
I eventually concluded it was the exam I skipped :D
disappears into some good feeling hahhaa
I didn't understand why D and (x,y ) are independent. Anyone can explain why? please. TIA.
@adiratna96
2 жыл бұрын
Damn, never mind I got it.
This is a form of the Pythagorean theorem
@kilianweinberger698
3 жыл бұрын
Interesting observation!
Why is there no D at 37:00 in b^2?
@kilianweinberger698
3 жыл бұрын
both terms y-bar and y are independent of the training data set D.
24:56 please someone ask some question, I am not ready for war :)
Point @22:00
@logicboard7746
2 жыл бұрын
Then @41:00
Everything is excellent except the poor handwritting