Kilian Weinberger
6 жыл бұрын
46,020
1

Machine Learning Lecture 8 "Estimating Probabilities from Data: Naive Bayes" -Cornell CS4780 SP17

Cornell class CS4780. (Online version: tinyurl.com/eCornellML )
Lecture Notes: www.cs.cornell.edu/courses/cs4...

Пікірлер: 47

@prashantsolanki0074 жыл бұрын
Best Series of lectures on ML with perfect combination of math and statistic reasoning for algorithms.
@siyuren68305 жыл бұрын
This is a really good course of ML.
@venkatasaikumargadde14894 жыл бұрын
The illustration on the effect of prior distribution is awesome
@rajeshs28404 жыл бұрын
Wow!!! Human Brain is Bayesian.... Amazing Explanation.. Hats off to Prof. Kilian ...
@goksuyamac29083 жыл бұрын
This lecture series is such a great service to the community! The pace is perfect and the practical examples are on point, not lost in the theory. Many many thanks.
@aydinahmadli70053 жыл бұрын
I was speechless and extremely satisfied when I heard you telling oven-cake story. It made full sense😄
@abhinavmishra94013 жыл бұрын
It's 12:56 in Germany and I cannot go to sleep because of this treasure I stumbled upon tonight! You, Sir, are beautiful.
@SohailKhan-zb5td2 жыл бұрын
Thanks a lot Prof. This is indeed one of the best lectures on the topic :)
@user-me2bw6ir2i Жыл бұрын
Thanks for this amazing lecture! Had a course on statistics in my uni, but you gave much better understanding.
@rohit2761 Жыл бұрын
What a beautiful lecture series. Please upload more videos on other topics
@xiaoweidu46673 жыл бұрын
very insightful lecture
@doyourealise3 жыл бұрын
21:30 is start of naive bayes, have to start from tomorrow. GOod night :), btw great lecture killian sir,
@DommageCollateral9 ай бұрын
he looks like a real coder to me: no foot, no sleep, no break, all optimized
@sdsunjay4 жыл бұрын
This lecture series is great. However it is usually impossible to hear the students' questions. If Dr. Weinberger could repeat the question before answering it or annotate the video with the student's question that would really help.
@kilianweinberger698
4 жыл бұрын
Yes, will definitely do so the next time I record the class.
@vishchugh4 жыл бұрын
Hi killian. I had a doubt. I’ve often heard people talking about making your dataset balanced, which basically is, let’s take spam classification problem, the number of spam instances should be around number of non spam instances. But when you’re applying algorithms like naive bayes which takes the prior into consideration, shouldn’t we just leave it to being unbalanced which actually captures how the emails are in the inbox in real life. Also Thank you for the amazing lectures !
@kilianweinberger698
4 жыл бұрын
The issue with highly imbalanced data is that it is just tricky to optimize the classifier. E.g. if 99% of your data is class 1, and 1% class -1, then just predicting +1 for all samples gives you 99% accuracy, but you haven’t actually learned anything. My typical approach is to re-balance the data set by still keeping all samples, but assigning smaller weights to the more common classes, so that the weights add up to be the same for all classes.
@GraysNHearts2 жыл бұрын
At 15:00, when you provided a prior (1H 1T) the MLE started very randomly (all all the place) and then slowly converged to 0.7. Which was not the case when there was no prior. Why would having a prior impact MLE the way it did?
@mavichovizana54602 жыл бұрын
Hi prof. Kilian, thanks for the great lecture! at 6:56, P(Y|X=x, D)=∫P(y|θ)P(θ|D)dθ, could you explain why `X=x` is omitted in the body of the integral for true Bayesian? Shouldn't it be P(Y|X=x, D)=∫P(y|θ, X=x, D)P(θ|D, X=x)dθ?
@kilianweinberger698
Жыл бұрын
Oh yes! Thanks for pointing this out.
@DavidKim21063 жыл бұрын
Any plans to teach this course again? I might just wait to take it at cornell if you plan to in the next year or so... Until then, thank you so much! this video series is a gem :)
@kilianweinberger698
3 жыл бұрын
Fall 2021. But don't be too disappointed when you realize I re-use the same jokes every year :-)
@mohammaddindoost2634 Жыл бұрын
does anyone know what were the projects? or where we can find it?
@amarshahchowdry97273 жыл бұрын
Hi killian. I had a doubt. I don't know if this makes sense. But how do we write P(D;theta) as P(y|x;theta) when go ahead and derive Cost function for Linear Regression and Logistic Regression. Shouldn't P(D;theta) = P([x1,...,xN],[y1,...,yN]) or P({x1,y1},..{xN,yN}). Also, after we find the theta for P(X,Y) we are using the same theta for P(y|x;theta).Does this mean that P(X,Y) and P(Y|X) are parameterized by the same theta, or since they are related by Bayes theoram, we can write it this way??? This could be a very dumb doubt but I am confused!!!!! Also Thank you for the amazing lectures !
@kilianweinberger698
3 жыл бұрын
Yes, good question! Maybe take a look at these notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote08.html Basically the idea is you want to model P(y,x), but you really only model the conditional distribution P(y|x;theta), and you assume P(x) is some distribution that is given (but you don't model it) and _importantly_ it is independent of theta. E.g. P(x) is the distribution over all emails that are sent and P(y|x) is the probability that a _given_ email x is spam or not spam (y). Then P(y,x;theta)=P(y|x;theta)P(x;theta)= [because x does not depend on theta] =P(y|x;theta)P(x) now if you take the log and optimize the log-likelihood with respect to theta, you will realize that P(x) is just a constant that has no influence on your choice of theta. So you can drop it. Hope this helps.
@amarshahchowdry9727
3 жыл бұрын
@@kilianweinberger698 Thank you so much for the reply. I understand but still would like to clarify some things so as to bolster my understanding further. Can we look at this , in this way: 1. If we consider a joint P(x,y;theta), and the marginal of this joint to be p(x;theta), which is not dependent on theta, we can write it as p(x). So, basically arg-maxing over the product of conditional probabilities will do the job. Is this right? Moreover, this raises another question in my mind, if the above understanding is correct, then if the joint is dependent on theta, wont the marginal also be dependent on theta. Like the marginal which P(X) derived by integrating over P(X,Y;theta) for all values of Y, shouldn't that also be affected by theta. If we have P(X), we can simply find P(X=xi). 2. Could it be that since we are doing Discriminative Learning, we just ASSUME P(X) to be given and independent of theta, which allows us to model using conditional probabilities. 3. Or am I looking it in a very different and WRONG way. P(X) will be independent of theta since our xi's are sampled from the original probability distribution. Is this what you mean when you say that P(x), i.e. coming from the original distribution P(X,Y). But by this logic, won't P(yi) also be independent of of theta,and accordingly can we write P(x,y;theta) = p(x|y;theta)p(y)??? I honestly feel that i have trapped myself in a ditch with these doubts, for something as trivial but as important as MLE. Help will be highly appreciated!!!!!!
@kilianweinberger698
3 жыл бұрын
In generative learning, we model P(Y,X), so P(X) would also depend on theta. In discriminative learning, we only model P(y|X), and P(X) does *not* depend on theta. However, P(Y,X) still depends on theta, because P(Y,X)=P(Y|X;theta)P(X)
@sekfook97
3 жыл бұрын
@@kilianweinberger698 Hi, I see theta as a parameter in distribution, for example, mean and variance in gaussian. If you have time, would you explain what is theta in this case (spam email classifier)?
@aibada6594
Жыл бұрын
@@sekfook97 theta is a probability of data being either spam or non-spam; this is similar to the theta in the coin toss example
@DavesTechChannel4 жыл бұрын
For an image, the NB hypothesis should not work since pixel values seem very related to each other
@kilianweinberger698
4 жыл бұрын
Yes, but that also holds for words. Typically raw pixels are a bad representation of images, if you don't extract features (e.g. as in a convolutional neural network). If you were to extract SIFT or HOG features from the image and use them as input, the NB assumption could be more valid.
@compilations63583 жыл бұрын
How did we get this eqn P(y|x) = INTEGRAL_theta p(y|theta)*p(theta|D) d_theta, shouldnt it be P(y|x, D) =INTEGRAL_theta p(y|x, D, theta)*p(theta | D) d_theta ?
@JoaoVitorBRgomes3 жыл бұрын
P(Data) is not all my data, right. It is the data that I am interested, e.g. getting Heads (not Heads and Tails). Is this applied to real datasets, like a House Prediction dataset?
@kilianweinberger698
3 жыл бұрын
Yes, it is a little sloppy notation. In discriminative settings with i.i.d. data, P(Data)=\prod_i P(y_i | x_i)
@amarshahchowdry97273 жыл бұрын
I don't know why but my comment as a reply under my original comment is not visible when watching this video, without signing in. So, I am reposting it in the main comment section as well Thank you so much for the reply, Kilian. I understand but still would like to clarify some things so as to bolster my understanding further. Can we look at this , in this way: 1. If we consider a joint P(x,y;theta), and the marginal of this joint to be p(x;theta), which is not dependent on theta, we can write it as p(x). So, basically arg-maxing over the product of conditional probabilities will do the job. Is this right? Moreover, this raises another question in my mind, if the above understanding is correct, then if the joint is dependent on theta, wont the marginal also be dependent on theta. Like the marginal which P(X) derived by integrating over P(X,Y;theta) for all values of Y, shouldn't that also be affected by theta. If we have P(X), we can simply find P(X=xi). 2. Could it be that since we are doing Discriminative Learning, we just ASSUME P(X) to be given and independent of theta, which allows us to model using conditional probabilities. 3. Or am I looking it in a very different and WRONG way. P(X) will be independent of theta since our xi's are sampled from the original probability distribution. Is this what you mean when you say that P(x), i.e. coming from the original distribution P(X,Y). But by this logic, won't P(yi) also be independent of of theta,and accordingly can we write P(x,y;theta) = p(x|y;theta)p(y)??? I honestly feel that i have trapped myself in a ditch with these doubts, for something as trivial but as important as MLE. Help will be highly appreciated!!!!!!
@kilianweinberger698
3 жыл бұрын
Looks like KZread flagged your comment as spam :-/ I think your confusion comes from what exactly the modeling assumptions are. Basically your (2) is right, in discriminative learning, we assume that P(X) is given by mother nature, but we model P(Y|X;theta) with some function (which depends on theta). So when we try to estimate theta, we maximize the likelihood of P(y_1,...,y_n | x_1,...., x_n ;theta). This can then be factored out, because the different (y_i,x_i) pairs are independently sampled. In generative learning, things are different. Here we model P(Y,X;theta').
@mehmetaliozer24033 жыл бұрын
Thanks for this amazing series, could you share the script at 11:00 . Regards.
@kilianweinberger698
3 жыл бұрын
I can give you python code:
@kilianweinberger698
3 жыл бұрын
import matplotlib.pyplot as plt import numpy as np from pylab import * from matplotlib.animation import FuncAnimation N=10000; P=0.6; # true probability M=0; # number of smoothed examples Q=0.5; # prior XX=0 def onclick(event): global P,N,Q,XX,M cla() if event.x0: title('MAP %1.2f' % Q) color='r' else: title('MLE') color='b' counts=cumsum(rand(N)
@mehmetaliozer2403
3 жыл бұрын
@@kilianweinberger698 awesome, thanks a lot :)
@massimoc74943 жыл бұрын
I have a doubt: i thought that if you had P(X=x | Y=y), and then you said that X and Y were independent variables, then P(X=x | Y=y) = P(X=x); (i found this thing in wikipedia too: en.wikipedia.org/wiki/Conditional_probability_distribution , search for "Relation to independence"). So my question is why did you write P(X = x | Y=y) = mul from alpha=1 to d of (P(X_alpha = x_alpha | Y = y)) ? Maybe it's because inside P(X=x | Y=y), X and x are vectors, so if you factored it out yo would write P(X_1 = x_1, X_2 = x_2 , ... , X_d = x_d | Y = y), but then why can't i write mul from alpha=1 to d of (P(X_alpha = x_alpha)) instead of mul from alpha=1 to d of (P(X_alpha = x_alpha | Y = y))
@kilianweinberger698
3 жыл бұрын
Oh, the assumption is not that X and Y are independent but that the different dimensions of X are independent *given* Y. So call the first dimension of X to be X1 and the second X2. Then we get P(X|Y)=P(X1,X2|Y)=P(X1|Y)P(X2|Y)
@massimoc7494
3 жыл бұрын
@@kilianweinberger698 Ty
@JoaoVitorBRgomes3 жыл бұрын
But How do you know if my prior makes sense?!
@kilianweinberger698
3 жыл бұрын
You don't :-) But that's always the case with assumptions. You also assume a distribution ...
@erenarkangil42434 жыл бұрын
This guy definetely needs an expectorant
@gwonchanyoon7748Ай бұрын
i am wondering teenager hahaha!