Cornell class CS4780. (Online version: tinyurl.com/eCornellML )
Жүктеу.....
Пікірлер: 51
@KulvinderSingh-pm7cr5 жыл бұрын
So informative !!! Thanks professor for making ML so fun and intuitive !!!
@saikumartadi84944 жыл бұрын
thanks a lot for the video .please upload other courses offered by you if they are recorded. would love to learn from teachers like you
@vasileiosmpletsos1784 жыл бұрын
Nice Lecture! Helped me a lot!
@giulianobianco67523 ай бұрын
Great lecture, thanks Professor! Integrating the online Certificate with deeper math and concepts
@PhilMcCrack14 жыл бұрын
Stars at 1:22
@benw43614 жыл бұрын
Great lecture!
@yukiy.42015 жыл бұрын
magnifico!
@yanalishkova18735 жыл бұрын
Very well and intuitively explained, and very entertaining as well! Thank you! ][=
@mehmetaliozer24032 жыл бұрын
Thanks for this amazing lecture 👍👍
@fuweili53203 жыл бұрын
learn lots from it!!!
@andresfeliperamirez78693 жыл бұрын
Amazing!
@sooriya2334 жыл бұрын
Brilliant
@amitotc3 жыл бұрын
First of thanks a lot for sharing the video, it really helped me in understanding some of the salient differences between gradient descent and newton's method. I have a few questions: 1) In the ADAGRAD method, the denominator might become too big which then might vanish the gradient pretty quickly in some situations. Isn't it better to take the average of the `s` values rather than just summing them? 2) Towards the end of the lecture you talk about Newton's method where we might have to compute the inverse of a large matrix. In the case of very high dimensional data, shouldn't the cost of the inverting matrix be a lot higher than taking computationally less expensive gradient descent steps, which might then overshadow any benefit we get from using Newton's method? Matrix inversion is usually O(n^3) if we use Gaussian elimination to compute the inverse of the matrix. (As far as I know, matrix inversion is as hard as matrix multiplication, so at least O(n^2.8) if we use something like Strassen's algorithm) 3) If we want to use gradient descent to find the roots of a function, instead of finding minima and maxima, how effective it will be as compared to Newton's method. Can gradient descent be used in all applications where Newton's method is used interchangeably? (Even though it might be less effective in some cases)
@maddai17645 жыл бұрын
Nice explanation
@rameezusmani12943 жыл бұрын
sir..i cant explain in words how helpful your videos are and how easy they are to understand....people like you are true heroes of humanity and science...thank you so much....i have 1 question...these days machine learning seems to be limited to deep learning using neural networks...i want to know that do people still use ML or MAP or Naive Bayes ?
@kilianweinberger698
3 жыл бұрын
Yes, other (non-deep) methods are still very much in use. You are right that one can get the impression that nowadays people only use deep networks, however I would say that's also a problem of our time. I often find that in some companies people train big multi-layer deep nets for problems where a simple gradient boosted tree or Naive Bayes classifier would have been far more efficient. Btw, MLE and MAP are really concepts to optimize model parameters, so you can use them for deep networks, too.
@rameezusmani1294
3 жыл бұрын
@@kilianweinberger698 thank you so much for clarifying...i will take it as a home work that how i can use MLE or MAP to optimize my neural network parameters.
@rameezusmani1294
3 жыл бұрын
i realized that least squares approach is a specific case of MLE :). Thank you for making me think in that direction sir
@coolarun3150 Жыл бұрын
awesome!!
@analyticstoolsbyhooman69632 жыл бұрын
Heard a lot about Cornell but got the reason for that.
@mikejason38223 жыл бұрын
Why was an approximation of l(w) (using tailor series) was used at 13:20?
@doyourealise3 жыл бұрын
sir did u find out where the handouts went? 10:30??
@massimoc74943 жыл бұрын
It is possible for the newtons method for optimization to get stuck in a flex of a function? I ask this because if i'm in the concave part of the function i'm minimizing, and if i'm in the convex part i'm finding the maximum of the parabola tangent to my point x_{0}
@kilianweinberger698
3 жыл бұрын
hmm, unlikely ... sounds more like a (sign) error?
@sudhanshuvashisht89604 жыл бұрын
In Adagrad, when we're updating s vector as s = s + g . ^ 2 (not g dot g, rather as you described), what if gradient at a point for all features is < 1 (i.e. every element of g vector is less than one). In that case, how does adding square of each gradient element to s vector serve our purpose?
@subhasdh2446
2 жыл бұрын
That was my question but i think I've found a good explanation to it. Please correct me if I'm wrong. So when it is
@xwcao19913 жыл бұрын
what would be the effect of doing gradient descent (or even conjugate GD)on the taylor expansion for finding the optimum of parabola instead of computing its optimum directly? Would that avoid overrunning in a narrow valley? Thanks for any explanation.
@kilianweinberger698
3 жыл бұрын
Newton's Method is essentially doing that for the second moment. However, Taylor's expansion is only a good approximation locally. The moment you use it to take large steps, you may be very far off - leading to divergence.
@xwcao1991
3 жыл бұрын
@@kilianweinberger698 thanks for the explanation. Love your teaching style.
@gauravsinghtanwar44154 жыл бұрын
I didn't understand that how gradient descent works in the case of local minima. Plz help. Danke Schuen!
@kilianweinberger698
4 жыл бұрын
Well, yes we will get there in the later lectures (neural networks). With GD you will get trapped in the local minimum. To fix this you can use Stochastic Gradient Descent in those cases. Here, the gradient is so noisy, that it is not precise enough to get trapped (unless the minimum is really wide). Hope this helps.
@gauravsinghtanwar4415
4 жыл бұрын
@@kilianweinberger698 vielen Dank! Das ist wunderbar kurs.
@cuysaurus4 жыл бұрын
Is my log-likehood function a loss function?
@kilianweinberger698
4 жыл бұрын
The negative log-likelihood is a loss function (in fact a very common one). The likelihood or log-likelihood of your data is something you want to maximize - so if you negate it you obtain something you want to minimize (i.e. a loss).
@vatsan164 жыл бұрын
So if I understand what you say correctly, naive bayes is useful when we dont have as much data. But in a real life problem, how would we know if the amount of data that we have is enough?
@kilianweinberger698
4 жыл бұрын
TBH, often it is just the easiest to try both, Naive Bayes and logistic regression, and see which one works better.
@Carlosrv19
Жыл бұрын
@@kilianweinberger698 The form that logistic regression takes for P(y|X) is derived by the assumption of NB when P(X|y) is gaussian. Then, isn't it also same as saying that Logistic regression inherits same assumptions than NB?
@maharshiroy84713 жыл бұрын
I believe in practical implementations, a poorly conditioned Hessian can cause huge numerical errors while converging, ultimately lending second-order methods like Newton very unreliable.
@meghnashankr93402 жыл бұрын
just saved me hours of digging into books for understanding these concepts...
@KaushalKishoreTiwari3 жыл бұрын
But how to decide parabola value
@JoaoVitorBRgomes3 жыл бұрын
At 23:00 (circa) how do we check if the function is convex?
@kilianweinberger698
3 жыл бұрын
It is convex if and only if its second derivative is non-negative. Or for high dimensional functions, the Hessian is positive semi-definite.
@user-me2bw6ir2i Жыл бұрын
Has anyone found matrix cookbook that professor has mentioned?
@kilianweinberger698
Жыл бұрын
Here you go: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
@gabrielabecerrilprado21423 жыл бұрын
❤
@gregmakov26802 жыл бұрын
I think the Naive Bayes has worse performance than Gaussian Naive Bayes or logistic regression because of the generalization capability.
@rakinbaten730515 күн бұрын
I'm curious if someone was actually stealing all the notes
@sandeepraj71573 жыл бұрын
Imagine there's a tiger outside your house and you want to drive it away. Now since we have no idea to do that, we can safely assume it is a cat as it is much easier to scare away a cat. Problem solved.
@coolshoos2 жыл бұрын
who stole the freaking handouts???
@pnachtweyАй бұрын
Everyone seems to have a different version. AdaGrad doesn't always work. The sum of the dot product of the gradient gets too big UNLESS one scales it down. Also, AdaGrad works best with a line search. All variations work best with a line search.
Пікірлер: 51
So informative !!! Thanks professor for making ML so fun and intuitive !!!
thanks a lot for the video .please upload other courses offered by you if they are recorded. would love to learn from teachers like you
Nice Lecture! Helped me a lot!
Great lecture, thanks Professor! Integrating the online Certificate with deeper math and concepts
Stars at 1:22
Great lecture!
magnifico!
Very well and intuitively explained, and very entertaining as well! Thank you! ][=
Thanks for this amazing lecture 👍👍
learn lots from it!!!
Amazing!
Brilliant
First of thanks a lot for sharing the video, it really helped me in understanding some of the salient differences between gradient descent and newton's method. I have a few questions: 1) In the ADAGRAD method, the denominator might become too big which then might vanish the gradient pretty quickly in some situations. Isn't it better to take the average of the `s` values rather than just summing them? 2) Towards the end of the lecture you talk about Newton's method where we might have to compute the inverse of a large matrix. In the case of very high dimensional data, shouldn't the cost of the inverting matrix be a lot higher than taking computationally less expensive gradient descent steps, which might then overshadow any benefit we get from using Newton's method? Matrix inversion is usually O(n^3) if we use Gaussian elimination to compute the inverse of the matrix. (As far as I know, matrix inversion is as hard as matrix multiplication, so at least O(n^2.8) if we use something like Strassen's algorithm) 3) If we want to use gradient descent to find the roots of a function, instead of finding minima and maxima, how effective it will be as compared to Newton's method. Can gradient descent be used in all applications where Newton's method is used interchangeably? (Even though it might be less effective in some cases)
Nice explanation
sir..i cant explain in words how helpful your videos are and how easy they are to understand....people like you are true heroes of humanity and science...thank you so much....i have 1 question...these days machine learning seems to be limited to deep learning using neural networks...i want to know that do people still use ML or MAP or Naive Bayes ?
@kilianweinberger698
3 жыл бұрын
Yes, other (non-deep) methods are still very much in use. You are right that one can get the impression that nowadays people only use deep networks, however I would say that's also a problem of our time. I often find that in some companies people train big multi-layer deep nets for problems where a simple gradient boosted tree or Naive Bayes classifier would have been far more efficient. Btw, MLE and MAP are really concepts to optimize model parameters, so you can use them for deep networks, too.
@rameezusmani1294
3 жыл бұрын
@@kilianweinberger698 thank you so much for clarifying...i will take it as a home work that how i can use MLE or MAP to optimize my neural network parameters.
@rameezusmani1294
3 жыл бұрын
i realized that least squares approach is a specific case of MLE :). Thank you for making me think in that direction sir
awesome!!
Heard a lot about Cornell but got the reason for that.
Why was an approximation of l(w) (using tailor series) was used at 13:20?
sir did u find out where the handouts went? 10:30??
It is possible for the newtons method for optimization to get stuck in a flex of a function? I ask this because if i'm in the concave part of the function i'm minimizing, and if i'm in the convex part i'm finding the maximum of the parabola tangent to my point x_{0}
@kilianweinberger698
3 жыл бұрын
hmm, unlikely ... sounds more like a (sign) error?
In Adagrad, when we're updating s vector as s = s + g . ^ 2 (not g dot g, rather as you described), what if gradient at a point for all features is < 1 (i.e. every element of g vector is less than one). In that case, how does adding square of each gradient element to s vector serve our purpose?
@subhasdh2446
2 жыл бұрын
That was my question but i think I've found a good explanation to it. Please correct me if I'm wrong. So when it is
what would be the effect of doing gradient descent (or even conjugate GD)on the taylor expansion for finding the optimum of parabola instead of computing its optimum directly? Would that avoid overrunning in a narrow valley? Thanks for any explanation.
@kilianweinberger698
3 жыл бұрын
Newton's Method is essentially doing that for the second moment. However, Taylor's expansion is only a good approximation locally. The moment you use it to take large steps, you may be very far off - leading to divergence.
@xwcao1991
3 жыл бұрын
@@kilianweinberger698 thanks for the explanation. Love your teaching style.
I didn't understand that how gradient descent works in the case of local minima. Plz help. Danke Schuen!
@kilianweinberger698
4 жыл бұрын
Well, yes we will get there in the later lectures (neural networks). With GD you will get trapped in the local minimum. To fix this you can use Stochastic Gradient Descent in those cases. Here, the gradient is so noisy, that it is not precise enough to get trapped (unless the minimum is really wide). Hope this helps.
@gauravsinghtanwar4415
4 жыл бұрын
@@kilianweinberger698 vielen Dank! Das ist wunderbar kurs.
Is my log-likehood function a loss function?
@kilianweinberger698
4 жыл бұрын
The negative log-likelihood is a loss function (in fact a very common one). The likelihood or log-likelihood of your data is something you want to maximize - so if you negate it you obtain something you want to minimize (i.e. a loss).
So if I understand what you say correctly, naive bayes is useful when we dont have as much data. But in a real life problem, how would we know if the amount of data that we have is enough?
@kilianweinberger698
4 жыл бұрын
TBH, often it is just the easiest to try both, Naive Bayes and logistic regression, and see which one works better.
@Carlosrv19
Жыл бұрын
@@kilianweinberger698 The form that logistic regression takes for P(y|X) is derived by the assumption of NB when P(X|y) is gaussian. Then, isn't it also same as saying that Logistic regression inherits same assumptions than NB?
I believe in practical implementations, a poorly conditioned Hessian can cause huge numerical errors while converging, ultimately lending second-order methods like Newton very unreliable.
just saved me hours of digging into books for understanding these concepts...
But how to decide parabola value
At 23:00 (circa) how do we check if the function is convex?
@kilianweinberger698
3 жыл бұрын
It is convex if and only if its second derivative is non-negative. Or for high dimensional functions, the Hessian is positive semi-definite.
Has anyone found matrix cookbook that professor has mentioned?
@kilianweinberger698
Жыл бұрын
Here you go: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
❤
I think the Naive Bayes has worse performance than Gaussian Naive Bayes or logistic regression because of the generalization capability.
I'm curious if someone was actually stealing all the notes
Imagine there's a tiger outside your house and you want to drive it away. Now since we have no idea to do that, we can safely assume it is a cat as it is much easier to scare away a cat. Problem solved.
who stole the freaking handouts???
Everyone seems to have a different version. AdaGrad doesn't always work. The sum of the dot product of the gradient gets too big UNLESS one scales it down. Also, AdaGrad works best with a line search. All variations work best with a line search.