Lecture 7 "Estimating Probabilities from Data: Maximum Likelihood Estimation" -Cornell CS4780 SP17

Cornell class CS4780. (Online version: tinyurl.com/eCornellML )
Lecture Notes: www.cs.cornell.edu/courses/cs4...
Past 4780 exams are here: www.dropbox.com/s/zfr5w5bxxvizmnq/Kilian past Exams.zip?dl=0
Past 4780 homeworks are here: www.dropbox.com/s/tbxnjzk5w67...
If you want to take the course for credit and obtain an official certificate, there is now a revamped version (with much higher quality videos) offered through eCornell ( tinyurl.com/eCornellML ). Note, however, that eCornell does charge tuition for this version.

Пікірлер: 63

  • @RS-el7iu
    @RS-el7iu4 жыл бұрын

    ive just stumbled on a treasure of high class lectures and for free. you make me enjoy these topics after graduating since 2000 and believe me its hard to make someone mid 40s to enjoy these when all i think about nowadays is learn stuff like sailing :)). i wish we had profs like you in my country it would have been 100 folds more enjoyable. thank you for sharing all these.

  • @crestz1
    @crestz1 Жыл бұрын

    This lecturer is amazing. As a Ph.D candidate, I always revisit the lectures to familiarise myself with the basics.

  • @cuysaurus
    @cuysaurus4 жыл бұрын

    48:46 He looks so happy.

  • @meenakshisarkar7529
    @meenakshisarkar75293 жыл бұрын

    This is probably the best explanation I came across regarding the difference between the Bayesian and Frequentists statistics. :D

  • @xiaoweidu4667
    @xiaoweidu46673 жыл бұрын

    The key to deeper understanding of algorithms is the assumptions about the underlying data. Thank you and great respect.

  • @SundaraRamanR
    @SundaraRamanR4 жыл бұрын

    "Bayesian statistics has nothing to do with Bayes' rule" - knowing this would have avoided a lot of confusion for me over the years. I kept trying to make the (presumably strong) connection between the two and assumed I didn't understand Bayesian reasoning because I couldn't figure out this mysterious connection

  • @WahranRai

    @WahranRai

    7 ай бұрын

    You are totally wrong !

  • @deltasun
    @deltasun4 жыл бұрын

    impressive lecture, thanks a lot! I was also impressed to discover that, if instead of taking the MAP you take the EAP (expected a posteriori), then the Bayesian approach implies smoothing even with uniform prior (that is alpha=beta=1)! beautiful

  • @sandeepreddy6295
    @sandeepreddy62953 жыл бұрын

    Makes the concepts of MLE and MAP very very clear. We also get to know that - Bayesians and frequentists both trust the Bayes rule.

  • @JohnWick-xd5zu
    @JohnWick-xd5zu4 жыл бұрын

    Thank you Kilian, you are very talented!!

  • @abunapha
    @abunapha5 жыл бұрын

    Starts at 2:37

  • @saitrinathdubba
    @saitrinathdubba5 жыл бұрын

    Just Brilliant !! Thank you prof. kilian !!!

  • @brahimimohamed261
    @brahimimohamed2612 жыл бұрын

    Someone from Algeria confirms that this lecture is incredible. You have transformed complex concepts very simple

  • @sumithhh9379
    @sumithhh93794 жыл бұрын

    Thank you professor Kilian.

  • @zelazo81
    @zelazo814 жыл бұрын

    I think I finally understood a difference between frequentist and bayesian reasoning, thank you :)

  • @mohammadaminzeynali9831
    @mohammadaminzeynali9831 Жыл бұрын

    Thank you Dr. Weinberger. you are a great lecturer and also KZread algorithm subtitles your "also" as "eurozone".

  • @arjunsigdel8070
    @arjunsigdel80703 жыл бұрын

    Thank you. This is great service.

  • @KulvinderSingh-pm7cr
    @KulvinderSingh-pm7cr5 жыл бұрын

    Made my day !! Learnt a lot !!

  • @Jeirown
    @Jeirown3 жыл бұрын

    when he says basically, it sounds like bayesly. And most of the time it still makes sense

  • @DavesTechChannel
    @DavesTechChannel4 жыл бұрын

    Amazing lecture, best explanation of MLE vs MAP

  • @StarzzLAB
    @StarzzLAB3 жыл бұрын

    I teared up at the end as well

  • @abhinavmishra9401
    @abhinavmishra94013 жыл бұрын

    Impeccable

  • @andrewstark8107
    @andrewstark8107 Жыл бұрын

    From 30:00 pure gold content. :)

  • @jijie133
    @jijie1333 жыл бұрын

    Great!

  • @dude8309
    @dude83094 жыл бұрын

    I have a question about how MLE is formulated when using the binomial distribution (or maybe in general?): I might be overly pedantic or just plain wrong but looking at 18:01 wouldn't it be "more correct" to say P(H | D; theta) instead of just P(D;theta)? Since we're looking at the probability of H given the Data, while using theta as a parameter?

  • @marcogelsomini7655
    @marcogelsomini76552 жыл бұрын

    48:18 loop this!! Thx Professor Weinberger!

  • @yuniyunhaf5767
    @yuniyunhaf57674 жыл бұрын

    thanks prof

  • @vishchugh
    @vishchugh4 жыл бұрын

    Hi Killian, While calculating the likelihood function in the example. You have taken (nH+nT)choose(nH) also into consideration. Which doesn’t change the optimization though, but shouldn’t be there I guess, because in P(Data | parameter) , all samples being independent should just be Q^nH * (1-Q)^nT. Rght?

  • @SalekeenNayeem
    @SalekeenNayeem4 жыл бұрын

    MLE starts at 11:40

  • @hafsabenzzi3609
    @hafsabenzzi36092 жыл бұрын

    Amazing

  • @JoaoVitorBRgomes
    @JoaoVitorBRgomes3 жыл бұрын

    At circa 37:28, professor, you say some on the lines of 'which parameter makes our data most likely', could I say in other words: 'which parameter it is that corresponds to this distribution of data' ? But not 'which parameter most probable corresponds to this distribution' ? Or neither? Because what confuses me is reading this P(D|theta) . I read as what's the probability of this data / dataset given I got this theta/parameters/weights, because when I start, I start with the data, then I try to estimate the parameters not the opposite. Suppose I have somehow weights then I try to discover the probability that this weights/parameteres/theta belongs to this dataset. Weird. Am I a Bayesian? Lol. (e.g. logistic classification task for fraud). Kind Regards!

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Yes, you may be in the early stadium of turning into a Bayesian. Basically if you treat theta is a random variable and assign it a prior distribution you can estimate P(theta|D) i.e. what is the most likely parameter given this data. If you are frequentist, then theta is just a parameter of a distribution and you pretend that you drew the data from exactly this distribution. You then maximize P(D;theta) i.e. which parameter theta makes my data most likely. (In practice these two approaches end up being very similar ...)

  • @jandraor
    @jandraor4 жыл бұрын

    What's the name of the last equation?

  • @thachnnguyen
    @thachnnguyen5 ай бұрын

    I raise my hand. Why you assume any type of distribution when discussing? What if I don't know that formula? But what I see is nH and nT. Why not work with those?

  • @utkarshtrehan9128
    @utkarshtrehan91283 жыл бұрын

    MVP

  • @Klisteristhashit
    @Klisteristhashit4 жыл бұрын

    xkcd commic mentioned in the lecture: xkcd.com/1132/

  • @HimZhang
    @HimZhang2 жыл бұрын

    In the coin toss example (lecture notes, under "True" Bayesian approach), P(heads∣D)=...=E[θ|D] = (nH+α)/(nH+α+nT+β). Can anyone explain why the last equality holds?

  • @sushmithavemula2498
    @sushmithavemula24985 жыл бұрын

    Hey Prof, Your lectures are really good. But if you could provide some real time applications/examples while explaining a few concepts it would let every one understand the concepts better !

  • @coolblue5929
    @coolblue59292 жыл бұрын

    Very enjoyable. I think a Killian is like a thousand million right? I got confused at the end though. I need to revise.

  • @jachawkvr
    @jachawkvr4 жыл бұрын

    I have a question. Is P(D;theta) the same as (D|theta)? The same value seems to be used for both in the lecture, but I recall Dr.Weinberger saying that there is a difference earlier in the lecture.

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    Well, for all means and purposes it is the same. If you write P(D|theta) you imply that theta is a random variable, enabling to impose a prior P(theta). If you write P(D;theta) you treat it as a parameter, and a prior distribution wouldn't make much sense. If you don't use a prior the two notations are identical in practice.

  • @jachawkvr

    @jachawkvr

    4 жыл бұрын

    Ok, I get it now. Thank you for explaining this!

  • @imnischaygowda
    @imnischaygowda Жыл бұрын

    nH + nT choose nH what exactly do u mean here?

  • @prwi87
    @prwi87 Жыл бұрын

    Edit: After thinking, and checking, and finishing the lecture, and watching a bit of the lecture after this one i have came to the conclussion that my first explanation was wrong, as i didn't have enough knoweledge. The way it is calculated is good and fine, where i struggled was to understand the right PDF the Professor was using. What threw me off was P(D; theta) which is a joint PDF (i know it's PMF, but for me they are all pdfs if you put delta function in there) of obdaining exactly data D, because D is a realization of some random vector X, so to be more precise in notation P(D; theta) should be written as P(X = D; theta). But what Professor meant was the PDF P(H = n_h; len(D), theta) which is a binomial distribution. Then we can calculate MLE just as it was calculated during the lectures. But this is not the probability of getting the data D, but the probability of observing exactly n_h heads in len(D) tosses. Then in MAP we have conditional PDF H|theta ~ Binom(len(D), theta), written as P(H = n_h | theta; len(D)), we treat theta as random variable but len(D) as a parameter. There are two problems with explanation that starts around 18:00. Let me state the notation first. Let D be the data gathered, this data is the realization of random vector X. n_h is the number of heads tossed in D. nCr(x, y) is combinations of x choose y. 1. Professor writes that P(D;theta) is equal to the binomial distribution of the number of heads tossed which is not true. Binomial distribution is determined by two parameters, the number of independent Bernoulli trials (n) and the probability of obtaining a desired outcome (p), thus theta = (n, p). If we have tossed the coin n times, there is nothing we don't know about n, since we have choosen it, and so n is fixed and most importantly it is known to us! Because of that, let us denote n = len(D) and then theta = p. Let now H = number of heads tossed, then P(H = n_h; len(D), theta) = nCr(len(D), n_h) * theta ^ n_h * (1 - theta) ^ (len(D) - n_t) is precisely the distribution that was written by the Professor. I have also noticed that one person in comments asked why cannot we write P(H|D;theta), and more precisely P(H = n_h|len(D); theta). The reason for that is that len(D) is not a random variable, we are the one choosing the number of tosses and there is nothing random about it. Note that in this notation used in a particulat comment theta is treated as a parameter as it is written after ";". 2. To be precise P(X = D; theta) is a joint distribution. For example if we would have tossed the coin three times, then D = (d1, d2, d3) with d_i = {0, 1} (0 for tails and 1 for heads), and P(X = D;theta) = P(d1, d2, d3;theta). P(X = D;theta) is the joint probability of observing the data D we got from the experiment. The likelihood function is then defined as L(theta|D) = P(X = D;theta), but keep in mind that the likelihood is not a conditional probability distribution, as theta is not a random variable. The correct way to interpret L(theta|D) is as function of theta, which value also depends on the underlying measurements D. Now, if the data is i.i.d. then we can write that P(X = D;theta) = P(X_1 = d1;theta) * P(X_2 = d2;theta) * ... * P(X_len(D) = d_len(D);theta) = L(theta|D) In our example of coin tossing P(X_i = d_i;theta) = theta ^ d_i * (1 - theta) ^ (1 - d_i), where d_i = {0, 1} (0 for tails and 1 for heads) Given that L(theta|D) = theta ^ sum(d_i) * (1 - theta) ^ (len(D) - sum(d_i)), where sum(d_i) is simply n_h, the number of heads observed. And now we are maximizing the likelihood of observing the data we have obtained. Note that the way it was done during the lacures was right! But we were maximizing the likelihood of observing n_h heads in len(D) tosses, not of observing exactly data D. Also for any curious person, the "true bayesian method" that was described by the Professor at the end is called minimum mean-squared estimation (MMSE), that aims to minimize the expected squared error between random variable theta and some estimation of theta using the data random vector g(X). To support my argumenting, here are sources i used to write the above statements: "Foundations of Statistics for Data Scientists" by Alan Agresti (Chapter 4.2), and "Introduction to Probability for Data Science" by Stanley Chan (Chapter 8.1). Sorry for any grammar mistakes, as english is not my first language. As i'm still learning all this data science stuff i can be wrong, and i'm very open to any criticism and discussion. Happy learning!

  • @beluga.314

    @beluga.314

    10 ай бұрын

    You're mixing up 'distribution' and 'density'.P(d1, d2, d3;theta), this notation is correct but P(X = D;theta) is wrong as its a density function and you can't write like that. But since they are also probabilities(discrete), you can write like that here

  • @abhishekprajapat415
    @abhishekprajapat4154 жыл бұрын

    18:19 how did that expression even came, like what is this expression even called in maths. By the way I am b.tech. student so I guess I might not have read the math behind this expression.

  • @SalekeenNayeem

    @SalekeenNayeem

    4 жыл бұрын

    Just look it up Binomial Distribution. Thats a usual way of writing probability of an event which follows binomial distribution. You may also wanna check Bernoulli's Distribution first.

  • @pritamgouda7294
    @pritamgouda72945 ай бұрын

    can someone tell where's the lecture in which he proved K nearest algorithm which he mentioned @5:09

  • @kilianweinberger698

    @kilianweinberger698

    4 ай бұрын

    kzread.info/dash/bejne/oa2h1qmld8e6Xc4.html

  • @pritamgouda7294

    @pritamgouda7294

    4 ай бұрын

    @@kilianweinberger698 sir I saw that lec and it's notes as well but in notes it's mentioned about Bayes optimal classifier but I don't think it's there in the video lec. Please correct me if I'm wrong. Thank you for your reply 😊

  • @Bmmhable
    @Bmmhable4 жыл бұрын

    At 36:43 you call P(D|theta) the likelihood, the quantity we maximize in MLE, but earlier you emphasized how MLE is about maximizing P(D ; theta) and noted how you made a "terrible mistake" in your notes by writing P(D|theta), which is the Bayesian approach...I'm confused.

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    Actually, it is more subtle. Even if you optimize MAP, you still have a likelihood term. So it is not that Bayesian statistics doesn‘t have likelihoods, it is just that it allows you to treat the parameters as a random variable. So P(D|theta) is still the likelihood of the data, just here theta is a random variable, whereas in P(D;theta) it would be a hyper-parameter. Hope this makes sense.

  • @Bmmhable

    @Bmmhable

    4 жыл бұрын

    @@kilianweinberger698 Thanks a lot for the explanation. Highly appreciated.

  • @logicboard7746
    @logicboard77462 жыл бұрын

    bayesian @23:30, then 32:00

  • @deepfakevasmoy3477
    @deepfakevasmoy34774 жыл бұрын

    12:46

  • @vatsan16
    @vatsan164 жыл бұрын

    So the trick to getting past the spam filter is to use obscure words in the english language eh. Who wohuld have thought xD

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    Not the lesson I was trying to get across, but yes :-)

  • @vatsan16

    @vatsan16

    4 жыл бұрын

    @@kilianweinberger698 okay. I am now having a, "omg he replied!!" moment. :D. Anyway, you are a really great teacher. I have searched long and hard for a course on machine learning that covered it from a mathematical perspective. I found yours on a friday and i have now finished 9 lectures in 3 days. Danke schön! :)

  • @kartikshrivastava1500
    @kartikshrivastava15002 жыл бұрын

    Wow, apt explanation. The captions were bad, at some point: "This means that theta is no longer parameter it's a random bear" 🤣

  • @subhasdh2446
    @subhasdh24462 жыл бұрын

    I'm in the 7th lecture. I hope I find myself commenting on the last one.

  • @kilianweinberger698

    @kilianweinberger698

    2 жыл бұрын

    Don’t give up!

  • @xiaoweidu4667
    @xiaoweidu46673 жыл бұрын

    talking about logistics and taking stupid questions from students are major waste of talent of this great teacher.

Келесі