Machine Learning Lecture 13 "Linear / Ridge Regression" -Cornell CS4780 SP17

Пікірлер: 50

@prwi87 Жыл бұрын
There is an error at 36:16 that leads to wrong solution at 37:10. The sum is not taken over P(w). As we write P(y|X,w)*P(w) as Π( P(y_i | x_i, w) * P(w) ), P(w) is constant so it can by taken out of the multiplication, then taking a log will give log(P(w)) + sum( log( P(y_i | x_i, w) ) ). After solving it there will be no "n" in the nominator before "w.T w". y is not normally distributed, but y|X is. That's why we write P(y|X) in the next step. also in the MAP approach P(D|w) is not defined, we don't know what this pdf is, this should be P(y|X, w) which is given as normal by our main assumption. Since, D = (X, y) writing P(D|w) means that we stating that we know P(X, y|w) but we have no idea what it is. Later it is defined properly. another thing is that, to my understanding the concept of "slope" in high dimensional data is meaningless i think, we should use gradients or normal vectors. Thus, in this case vector w is not a slope but a normal to the hyperplane. at 37:05 "w" is a vector so P(w) is a multivariate gaussian distribution, but univariate is written. Since entries of "w" are i.i.d. we can write it as a multiplication of N univariate gaussians. It won't change much but is rigorous, and when we get ||w||^2 or sum of all w_i under the argmin it is reasonable since we wrote that "w" is multivariate. And just a moment later Professor writes "w^T w", meaning that multivariate normal is P(w). I understand that i have all time in the world to rewind this one lecture and pick up on little things, but i really like to be rigorous and if my nitpicking can help someone i will be really happy.
@30saransh
Ай бұрын
helpful, thanks
@tonychen314 жыл бұрын
Excellent Lecture!
@jachawkvr4 жыл бұрын
This was a fun lecture. I never knew that minimizing the squared error was equivalent to the MLE approach.
@Galois1683
4 жыл бұрын
I am studying that now
@JoaoVitorBRgomes
3 жыл бұрын
It is because when you find the maximum likelihood estimator you trying to find the most likely coefficient which end up being the one that also minimizes the squared error.
@JoaoVitorBRgomes3 жыл бұрын
@kilian weinberger, at circa 28:45 you ask if we have any questions about this. I have 1 question. You use argmin w because you want to find the w that minimizes that loss function, right? If it was a concave function, you would write it as argmax w?
@ugurkap5 жыл бұрын
MAP version is different from the lecture notes. I believe lecture notes are correct because when we take the log, since prior and likelihood is multiplied with each other we should get log(likelihood) + log(prior) and summation of the likelihood should not affect the log(prior) so lambda should not be multiplied by n. If we are not splitting the likelihood and prior and leaving it as log(likelihood x prior), we should get something different then the MLE version, right?
@echo-channel77
4 жыл бұрын
I agree, besides that n would simply be canceled anyway and we'd be left wit n * the constant.
@flicker19843 жыл бұрын
Logistic regression is regression in the sense that it predicts probability, which can be used to define the classifier.
@FlynnCz3 жыл бұрын
Hi Kilian, I have a doubt. Why do we assume a standard deviation for the noise? Should not we directly calculate it (or them if we allow the variance to be a function of x like for the mean) during the minimisation? Thank you!
@kilianweinberger698
3 жыл бұрын
There is always a trade-off between what you assume and what you learn. If you make it too general and you attempt to learn the entire noise model, then you could explain the entire data set as noise (i.e. your w is just the all-zeros vector, mapping everything to the origin, and all deviations from the origin are explained by a very large variance). So here we avoid this problem by fixing the variance to something reasonable, and then learning the mean. There are however more complicated algorithms with more sophisticated noise models.
@FlynnCz
3 жыл бұрын
@@kilianweinberger698 Thanks a lot for your answer! Your videos are helping me a lot for my master thesis!
@llll-dj8rn8 ай бұрын
So, assuming that the noise of the linear regression OLS is Gaussian, when applying MLE we derive the ordinary linear regression, and when applying MAP we derive the regularized(ridge) linear regression
@PremKumar-vw7rt3 жыл бұрын
Hi Kilian, I have a doubt, in the probabilistic perspective of linear regression, we assume that for every Xi there is a range of values for Yi, i.e P(Yi|Xi), where Xi is a D dimension vector, so while solving and arriving at the cost function why are we using univariate Gaussian dist to find the cost function instead of multi-variate Gaussian dist?
@kilianweinberger698
3 жыл бұрын
well, the model is that P(y|x)=w'x+t where t is Gaussian distributed. So that's just 1 dimensional. You can of course make your noise model more complex, but you must make sure that, as it becomes more powerful, you don't explain the signal with the noise model.
@shashankshekhar70525 жыл бұрын
For Project4(ERM) data_train file is given which consist of Bag of words matrix(X) but we don't have the label(y) for it whats the way around can you please help?
@vivekmittal2290
4 жыл бұрын
from where you got the project files
@SAINIVEDH
3 жыл бұрын
@@vivekmittal2290 Did you get em ?
@bbabnik
3 жыл бұрын
@@SAINIVEDH hey, can you please share them? I only found Exams and Homeworks.
@tr8d3r3 жыл бұрын
I love it "only losers maximize" ;)
@Aesthetic_Euclides5 ай бұрын
I was thinking about modeling the prediciton of y given x with a Gaussian. Are these observations/reasoning steps correct? I understand the Gaussianess comes because we have a true linear function that perfectly models the relationship between X and Y, but it is uknown to us. But we have data (D) that we assume comes from sampling the true distribution (P). Now, we only have this limited sample of data, so it's reasonable to model the noise as Gaussian. This means that for a given x, our prediction y actually belongs to a Gaussian distribution, but since we only have this "single" sample D of the true data distribution, our best bet is to assign this y as the expectation of the true Gaussian. Which results in us predicting y as the final prediction (also because a good estimator of the expectation is the average, I guess). Now, I have explained how in the end we are going to fit the model to the data and predict that, so why do we have to model the noise in the model? Why not make it purely an optimization problem? I guess more like the DL approach.
@erenyeager44523 жыл бұрын
OMG, the regularisation comes from the MAP!!!!!!!!!!!!!! Respect 🙇 🙇
@jonaswinston6032
3 жыл бұрын
I know it's kinda off topic but does anyone know of a good website to watch newly released tv shows online?
@pabloaydin6459
3 жыл бұрын
@Jonas Winston I would suggest Flixzone. Just search on google for it :)
@judekarsyn9507
3 жыл бұрын
@Pablo Aydin definitely, have been watching on Flixzone for since april myself :)
@ulisesvihaan3207
3 жыл бұрын
@Pablo Aydin yea, I've been using FlixZone for since march myself :)
@fletchermarcos2413
3 жыл бұрын
@Pablo Aydin Thanks, signed up and it seems like they got a lot of movies there :) Appreciate it !!
@30saranshАй бұрын
Is there any way we can get access to the projects for this course?
@dmitriimedvedev63503 жыл бұрын
36:22 not clear: why P(w) is inside log and thus inside summation? Shouldnt it be a sum of logs of P(D | w) plus logP(w)? Could anyone please explain why we represent P(D | w) * P(w) = Product [ P (Yi | Xi, w) * P(w)], including P(w) INSIDE the product for every pare (Yi, Xi) from 1 to n?
@dmitriimedvedev6350
3 жыл бұрын
and i noticed this part is actually dirreferent from lecture notes
@cge0074 ай бұрын
Hello, Thank you for the lecture. Why is the variance equal for all points? 17:23 Is this an assumption that we are taking?
@kilianweinberger698
4 ай бұрын
Yes, just an assumption to keep things simple.
@vatsan164 жыл бұрын
Often when people derive the loss function for linear regression, they just start directly from the minimization of the squared error between the regression line and the points that is, minimize sum((y-yi)^2) . Here, you start with an assumption that the yi has a gaussian distribution and then arrive at this same conclusion with MLE. If we call the former method 1 and latter method 2. Where is the gaussian distribution assumption considered in method 1?
@kilianweinberger698
4 жыл бұрын
The Gaussian noise model is essentially baked into the squared loss.
@deepfakevasmoy3477
3 жыл бұрын
good question
@deepfakevasmoy3477
3 жыл бұрын
@@kilianweinberger698 do we come up with same loss function if we use other distribution for error noise from exponential family?
@saketdwivedi81874 жыл бұрын
how do we know the point to switch to Newtons Method?
@kilianweinberger698
4 жыл бұрын
You take a newton step and check if the loss went down. If it ever goes up, you undo that step and switch back to gradient descent.
@saketdwivedi8187
4 жыл бұрын
@@kilianweinberger698 Thanks so much for the response.
@omalve9454 Жыл бұрын
Gold
@JoaoVitorBRgomes3 жыл бұрын
But if I assume that in MAP the prior is like a poisson is it going to give me the same results as the MLE? Are they supposed to give the same theta/w? @killian weinberg , thank you prof.!
@kilianweinberger698
3 жыл бұрын
well if you don't use the conjugate prior, things can get ugly - purely from a mathematical point of view.
@arihantjha47463 жыл бұрын
Hi kilian My Doubt is with respect to how we have derived the mean square loss in the notes. We take P(xi) to be independent of theta. Now, considering P(X) is the marginal for P(X,Y), if the joint is dependent on theta, wouldn't the marginal also depend on theta, so P(X=xi) will also depend on theta with this logic. Is it that we ASSUME P(X) to be independent of theta, for the parameterized distribution,for the sake of doing discriminative learning or is there some underlying obvious reason that I am missing. Thank you for your lectures
@amarshahchowdry9727
3 жыл бұрын
I have the same doubt.
@kilianweinberger698
3 жыл бұрын
Essentially it is a modelling assumption. We assume Y depends on X (which is very reasonable, as the label is a function of X). We model this function with a distribution parameterized by theta. So P(Y|X;theta) depends on theta. We also assume that the Xs are just given to us (by mother nature). This is also very reasonable -- essentially you assume someone gives you the data, and you predict the labels. So P(X) does not depend on theta. But now the joint distribution *does* depend on theta, because it contains the conditional: P(Y,X;theta)=P(Y|X;theta)*P(X) Hope this helps.
@arihantjha4746
3 жыл бұрын
@@kilianweinberger698 Thank you for the reply. I understand.
@gregmakov26802 жыл бұрын
hahahaha, exactly, statistics always try to mess things up :D:D