Machine Learning Lecture 12 "Gradient Descent / Newton's Method" -Cornell CS4780 SP17

Cornell class CS4780. (Online version: tinyurl.com/eCornellML )

Пікірлер: 51

  • @KulvinderSingh-pm7cr
    @KulvinderSingh-pm7cr5 жыл бұрын

    So informative !!! Thanks professor for making ML so fun and intuitive !!!

  • @saikumartadi8494
    @saikumartadi84944 жыл бұрын

    thanks a lot for the video .please upload other courses offered by you if they are recorded. would love to learn from teachers like you

  • @vasileiosmpletsos178
    @vasileiosmpletsos1784 жыл бұрын

    Nice Lecture! Helped me a lot!

  • @giulianobianco6752
    @giulianobianco67523 ай бұрын

    Great lecture, thanks Professor! Integrating the online Certificate with deeper math and concepts

  • @PhilMcCrack1
    @PhilMcCrack14 жыл бұрын

    Stars at 1:22

  • @benw4361
    @benw43614 жыл бұрын

    Great lecture!

  • @yukiy.4201
    @yukiy.42015 жыл бұрын

    magnifico!

  • @yanalishkova1873
    @yanalishkova18735 жыл бұрын

    Very well and intuitively explained, and very entertaining as well! Thank you! ][=

  • @mehmetaliozer2403
    @mehmetaliozer24032 жыл бұрын

    Thanks for this amazing lecture 👍👍

  • @fuweili5320
    @fuweili53203 жыл бұрын

    learn lots from it!!!

  • @andresfeliperamirez7869
    @andresfeliperamirez78693 жыл бұрын

    Amazing!

  • @sooriya233
    @sooriya2334 жыл бұрын

    Brilliant

  • @amitotc
    @amitotc3 жыл бұрын

    First of thanks a lot for sharing the video, it really helped me in understanding some of the salient differences between gradient descent and newton's method. I have a few questions: 1) In the ADAGRAD method, the denominator might become too big which then might vanish the gradient pretty quickly in some situations. Isn't it better to take the average of the `s` values rather than just summing them? 2) Towards the end of the lecture you talk about Newton's method where we might have to compute the inverse of a large matrix. In the case of very high dimensional data, shouldn't the cost of the inverting matrix be a lot higher than taking computationally less expensive gradient descent steps, which might then overshadow any benefit we get from using Newton's method? Matrix inversion is usually O(n^3) if we use Gaussian elimination to compute the inverse of the matrix. (As far as I know, matrix inversion is as hard as matrix multiplication, so at least O(n^2.8) if we use something like Strassen's algorithm) 3) If we want to use gradient descent to find the roots of a function, instead of finding minima and maxima, how effective it will be as compared to Newton's method. Can gradient descent be used in all applications where Newton's method is used interchangeably? (Even though it might be less effective in some cases)

  • @maddai1764
    @maddai17645 жыл бұрын

    Nice explanation

  • @rameezusmani1294
    @rameezusmani12943 жыл бұрын

    sir..i cant explain in words how helpful your videos are and how easy they are to understand....people like you are true heroes of humanity and science...thank you so much....i have 1 question...these days machine learning seems to be limited to deep learning using neural networks...i want to know that do people still use ML or MAP or Naive Bayes ?

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Yes, other (non-deep) methods are still very much in use. You are right that one can get the impression that nowadays people only use deep networks, however I would say that's also a problem of our time. I often find that in some companies people train big multi-layer deep nets for problems where a simple gradient boosted tree or Naive Bayes classifier would have been far more efficient. Btw, MLE and MAP are really concepts to optimize model parameters, so you can use them for deep networks, too.

  • @rameezusmani1294

    @rameezusmani1294

    3 жыл бұрын

    @@kilianweinberger698 thank you so much for clarifying...i will take it as a home work that how i can use MLE or MAP to optimize my neural network parameters.

  • @rameezusmani1294

    @rameezusmani1294

    3 жыл бұрын

    i realized that least squares approach is a specific case of MLE :). Thank you for making me think in that direction sir

  • @coolarun3150
    @coolarun3150 Жыл бұрын

    awesome!!

  • @analyticstoolsbyhooman6963
    @analyticstoolsbyhooman69632 жыл бұрын

    Heard a lot about Cornell but got the reason for that.

  • @mikejason3822
    @mikejason38223 жыл бұрын

    Why was an approximation of l(w) (using tailor series) was used at 13:20?

  • @doyourealise
    @doyourealise3 жыл бұрын

    sir did u find out where the handouts went? 10:30??

  • @massimoc7494
    @massimoc74943 жыл бұрын

    It is possible for the newtons method for optimization to get stuck in a flex of a function? I ask this because if i'm in the concave part of the function i'm minimizing, and if i'm in the convex part i'm finding the maximum of the parabola tangent to my point x_{0}

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    hmm, unlikely ... sounds more like a (sign) error?

  • @sudhanshuvashisht8960
    @sudhanshuvashisht89604 жыл бұрын

    In Adagrad, when we're updating s vector as s = s + g . ^ 2 (not g dot g, rather as you described), what if gradient at a point for all features is < 1 (i.e. every element of g vector is less than one). In that case, how does adding square of each gradient element to s vector serve our purpose?

  • @subhasdh2446

    @subhasdh2446

    2 жыл бұрын

    That was my question but i think I've found a good explanation to it. Please correct me if I'm wrong. So when it is

  • @xwcao1991
    @xwcao19913 жыл бұрын

    what would be the effect of doing gradient descent (or even conjugate GD)on the taylor expansion for finding the optimum of parabola instead of computing its optimum directly? Would that avoid overrunning in a narrow valley? Thanks for any explanation.

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Newton's Method is essentially doing that for the second moment. However, Taylor's expansion is only a good approximation locally. The moment you use it to take large steps, you may be very far off - leading to divergence.

  • @xwcao1991

    @xwcao1991

    3 жыл бұрын

    @@kilianweinberger698 thanks for the explanation. Love your teaching style.

  • @gauravsinghtanwar4415
    @gauravsinghtanwar44154 жыл бұрын

    I didn't understand that how gradient descent works in the case of local minima. Plz help. Danke Schuen!

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    Well, yes we will get there in the later lectures (neural networks). With GD you will get trapped in the local minimum. To fix this you can use Stochastic Gradient Descent in those cases. Here, the gradient is so noisy, that it is not precise enough to get trapped (unless the minimum is really wide). Hope this helps.

  • @gauravsinghtanwar4415

    @gauravsinghtanwar4415

    4 жыл бұрын

    @@kilianweinberger698 vielen Dank! Das ist wunderbar kurs.

  • @cuysaurus
    @cuysaurus4 жыл бұрын

    Is my log-likehood function a loss function?

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    The negative log-likelihood is a loss function (in fact a very common one). The likelihood or log-likelihood of your data is something you want to maximize - so if you negate it you obtain something you want to minimize (i.e. a loss).

  • @vatsan16
    @vatsan164 жыл бұрын

    So if I understand what you say correctly, naive bayes is useful when we dont have as much data. But in a real life problem, how would we know if the amount of data that we have is enough?

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    TBH, often it is just the easiest to try both, Naive Bayes and logistic regression, and see which one works better.

  • @Carlosrv19

    @Carlosrv19

    Жыл бұрын

    @@kilianweinberger698 The form that logistic regression takes for P(y|X) is derived by the assumption of NB when P(X|y) is gaussian. Then, isn't it also same as saying that Logistic regression inherits same assumptions than NB?

  • @maharshiroy8471
    @maharshiroy84713 жыл бұрын

    I believe in practical implementations, a poorly conditioned Hessian can cause huge numerical errors while converging, ultimately lending second-order methods like Newton very unreliable.

  • @meghnashankr9340
    @meghnashankr93402 жыл бұрын

    just saved me hours of digging into books for understanding these concepts...

  • @KaushalKishoreTiwari
    @KaushalKishoreTiwari3 жыл бұрын

    But how to decide parabola value

  • @JoaoVitorBRgomes
    @JoaoVitorBRgomes3 жыл бұрын

    At 23:00 (circa) how do we check if the function is convex?

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    It is convex if and only if its second derivative is non-negative. Or for high dimensional functions, the Hessian is positive semi-definite.

  • @user-me2bw6ir2i
    @user-me2bw6ir2i Жыл бұрын

    Has anyone found matrix cookbook that professor has mentioned?

  • @kilianweinberger698

    @kilianweinberger698

    Жыл бұрын

    Here you go: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

  • @gabrielabecerrilprado2142
    @gabrielabecerrilprado21423 жыл бұрын

  • @gregmakov2680
    @gregmakov26802 жыл бұрын

    I think the Naive Bayes has worse performance than Gaussian Naive Bayes or logistic regression because of the generalization capability.

  • @rakinbaten7305
    @rakinbaten730515 күн бұрын

    I'm curious if someone was actually stealing all the notes

  • @sandeepraj7157
    @sandeepraj71573 жыл бұрын

    Imagine there's a tiger outside your house and you want to drive it away. Now since we have no idea to do that, we can safely assume it is a cat as it is much easier to scare away a cat. Problem solved.

  • @coolshoos
    @coolshoos2 жыл бұрын

    who stole the freaking handouts???

  • @pnachtwey
    @pnachtweyАй бұрын

    Everyone seems to have a different version. AdaGrad doesn't always work. The sum of the dot product of the gradient gets too big UNLESS one scales it down. Also, AdaGrad works best with a line search. All variations work best with a line search.