Machine Learning Lecture 30 "Bagging" -Cornell CS4780 SP17

Lecture Notes:
www.cs.cornell.edu/courses/cs4...

Пікірлер: 58

  • @aakashPotter
    @aakashPotter2 жыл бұрын

    He is the only professor I've seen who asks, "Raise hands if this makes sense". He could have asked "Raise hands if it doesn't make sense", but he knows people who didn't understand feel under-confident and are generally hesitant to ask questions. He repeats key concepts till most of class has understood. Great respect for you sir! And great teaching method!

  • @astutzorr3355
    @astutzorr33553 жыл бұрын

    finding your channel has helped me SOO MUCH, I really appreciate when there is some mathematical rigour and most of the ML guys on the internet just do a wave of hands every time a nuanced concept comes along. Thank you

  • @jaggermarlon4018

    @jaggermarlon4018

    2 жыл бұрын

    InstaBlaster

  • @nicolasrover8803
    @nicolasrover88032 жыл бұрын

    Professor, you are a true genius at explaining this stuff. Your lectures are so much fun and motivating!

  • @Saganist420
    @Saganist4204 жыл бұрын

    Fantastic lecture!

  • @sumitsingh-yz3vm
    @sumitsingh-yz3vm3 жыл бұрын

    Thank you for the amazing lecture series.

  • @vivekupadhyay6663
    @vivekupadhyay66633 жыл бұрын

    Thank You Sir! for this clear and concise explanation of the topic.

  • @salmaabdelmonem7482
    @salmaabdelmonem74823 жыл бұрын

    Thanks a million prof kilian 😀

  • @jenishah9825
    @jenishah9825 Жыл бұрын

    Most people are viewing these lectures because they came across one video and now they want to learn everything Prof Kilian has to share. But if this was offered as a certified course online, more people would know about this and can officially take time out to complete the course.

  • @kilianweinberger698

    @kilianweinberger698

    Жыл бұрын

    Actually, it is: online.cornell.edu/machine-learning/

  • @varunjindal1520
    @varunjindal15203 жыл бұрын

    Awesome explanation..!!

  • @hdang1997
    @hdang19974 жыл бұрын

    "I ask my uncle who's a magician".

  • @rohit2761
    @rohit27612 жыл бұрын

    Hello Everyone, including The Prof Kilian, First of all thank you from the bottom of my heart to upload this online. These Gem Video Lectures. Absolutely Loved them. Anyone including Dr. Kilian , Can you please recommend a Deep learning course like this one ? I'd be highly obliged. It has to have quality like this course.

  • @jayantpriyadarshi9266

    @jayantpriyadarshi9266

    2 жыл бұрын

    i can reccommend

  • @rohit2761

    @rohit2761

    2 жыл бұрын

    @@jayantpriyadarshi9266 tell

  • @jayantpriyadarshi9266

    @jayantpriyadarshi9266

    2 жыл бұрын

    @@rohit2761 search for deep learning by Mitesh Khapra .. on KZread

  • @rohit2761

    @rohit2761

    2 жыл бұрын

    @@jayantpriyadarshi9266 bro thanks, I've already done that course. Please suggest suve other one

  • @jayantpriyadarshi9266

    @jayantpriyadarshi9266

    2 жыл бұрын

    @@rohit2761 there is deep learning 2 as well by him .. which is more about GAN s and all.. if you did this much i think you can read papers.

  • @aragasparyan8295
    @aragasparyan82953 жыл бұрын

    Thanks for the lecture. I have a question concerning showing that bootstrapped data sets are drawn from the original distribution P. In your lecture notes you are assuming that P is discrete, is it an assumption that simplifies the proof of P(X=x) = Q(X=x)? Or in other words, can we show that P and Q are identical distributions in the case when P is not discrete (in the general case)?

  • @nathannguyen2041
    @nathannguyen204110 күн бұрын

    What defines the split D? I might have to go back and watch the Decision Tree video if that was covered there.

  • @DommageCollateral
    @DommageCollateral9 ай бұрын

    i love this teaching style, i would love to see tu berlin create such tutorium and lecture notes, rather than copying snippets from universities from all over the world and then just paste it. a lot of times the use code from huge projects but they just use a little part for the task, but they dont explain the rest of the code, but you need to understand the whole code to fill the gaps. so youre ending up with a project that is quite advanced but some parts are left out and suddenly you need to translate everything including the names of the math symbols and they use different symbols in english, so you always need to think of that. so they dont even crate their own content for the homework and that sucks.

  • @jorgeestebanmendozaortiz873
    @jorgeestebanmendozaortiz8732 жыл бұрын

    For the inpatients, lecture starts at 2:55.

  • @broccolee4328
    @broccolee43282 жыл бұрын

    Hi Dr. Kilian, would you recommend pruning before bagging or it doesn't matter?

  • @yannickpezeu3419
    @yannickpezeu34193 жыл бұрын

    Thanks

  • @pmdl
    @pmdl3 жыл бұрын

    @prof, great lecture. Especially liked how you started from kd trees as a data structure and arrived at decision trees. Btw how do you show Di (the subsamples) is coming from the same distribution as D? Also, in random forests, k = sqrt(d) - what would be a justification / intuition for that?

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Yes, that should be in the notes. The D_i are sampled from the same distribution as D, but they are *not* independent.

  • @pmdl

    @pmdl

    3 жыл бұрын

    @@kilianweinberger698 yes, found that in the notes later. What about the intuition for the sqrt(n) for random forests feature subsample though? Not sure if I missed that too!

  • @pmdl

    @pmdl

    3 жыл бұрын

    Btw prof, the whole video series is awesome! Have to finish up kernels and Gaussian process ,but the rest have been wonderful!

  • @ChiragSinghishere
    @ChiragSinghishere4 жыл бұрын

    before each split randomly subsample k≤dk≤d features (without replacement) and only consider these for your split. (This further increases the variance of the trees.) .....is the line from lecture notes..... i think it should be "decrease the variance" ...

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    No, it is correct. It is a little confusing, because generally fewer features would mean you decrease the variance of your classifier. However by reducing the features for each individual split, you build very different trees - so this *increases* the variance *across* the trees, which you then reduce again by averaging.

  • @ChiragSinghishere

    @ChiragSinghishere

    4 жыл бұрын

    @@kilianweinberger698 Thanks alot...

  • @astutzorr3355
    @astutzorr33553 жыл бұрын

    Can someone help me with why the bagging datasets are not inependant? are we not drawing them randomly? what would be an example that is independent then?

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Well, imagine in the extreme case your original training data set only has a single element {x}, drawn from some distribution P. Now you apply bagging on this data set of size one, and you re-sample m new data sets from it. But because it only contains one element, they are all identical {x},{x},{x},...,{x}. So clearly they are not independent. If I tell you what one data set is, you know all the others. Hope this helps.

  • @roniswar
    @roniswar3 жыл бұрын

    I really enjoy your lectures Prof'! Could you please elaborate on your state that the sets D1,..,Dm are not I.I.D? Since you sample with replacement, I do not see any reason that knowing the previous sets D_1,..,D_(i-1), will help you to predict the set D_i (except for maybe the not very interesting case that D has only a single element, in this case all this process is doomed anyway...).

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Let’s say the elements in D are drawn i.i.d. from P. Given P, if you take the first half of the D, you gain no information about the second half. In contrast, if you now create data sets D_1, D_2,..., D_M from D (through sampling with replacement), these data sets are not i.i.d. For example, imagine P(z)>0 i.e. there is a non-zero chance that we sample a specific element z from P. But imagine that just by chance, z is not in D. If M is large enough, and you observe that z is not in D_1, not in D_2, not in D_3, ... , not in D_{M-1} you can conclude that z was almost certainly not in D, and therefore is extremely unlikely to be in D_M. In other words, the data sets D_1,..,D_{M-1} tell you something about D_M - and they are not independent. Hope this helps. Of course you can also simply look at the extreme case where D={z}. Clearly D_1 tells you something about D_2. Hope this helps.

  • @roniswar

    @roniswar

    3 жыл бұрын

    @@kilianweinberger698 Thanks a lot for the specific answer!! I understand that for outliers (usually the rarest ones, maybe also the common ones) you can learn something about D_m assuming that you have D_1, .., D_{M-1}. It is weird that even Wikipedia (perhaps not the accurate source of truth, but still usually pretty accurate) stats that : " This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling" here en.wikipedia.org/wiki/Bootstrap_aggregating

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Oh I see the confusion. The subsets are dependent samples from the original distribution P, but they are independent bootstraps from the original data set. The bootstraps are even independent of D={x}, as D_1 tells you nothing about D_2 that you couldn’t already infer from D. Hope this helps.

  • @roniswar

    @roniswar

    3 жыл бұрын

    @@kilianweinberger698 Yes, that's exactly was the confusion indeed!! Thanks for clearing it up Prof'!

  • @71sephiroth
    @71sephiroth4 жыл бұрын

    Is it a property of an entropy that it's additive, because I can't seem to figure out why intuitievly we use the weighted sum of H(Sr) and H(Sl)?

  • @kilianweinberger698

    @kilianweinberger698

    4 жыл бұрын

    If you didn’t use the weighted sum, and just the sum H(Sr)+H(Sl) you would end up splitting off a single element. That would make one side completely pure (entropy 0), but wouldn’t help you much to improve your classifier. Hope this helps.

  • @71sephiroth

    @71sephiroth

    4 жыл бұрын

    @@kilianweinberger698 Ahh, thanks! But what is wrong with the 'regular average' aka weights = 1/2? @edit For example if we have target variable = [1, 0, 1, 0, 0, 0, 0, 1] then 1:(n-1) split will 'always' have maximal information gain compared to 2:(n-2), 3:(n-3) ... etc., be it that we use weighted average, 'normal' average or not using average at all in order to calculate information gain. I think in real life that it's not very likely we will have such feature that splits the target in 1:(n-1) fashion, but still I'm curious.

  • @haizhoushi8287

    @haizhoushi8287

    4 жыл бұрын

    Both Gini and Entropy are concave functions, the form of weighted sum matches the form of the property of the concave function.

  • @71sephiroth

    @71sephiroth

    4 жыл бұрын

    @@haizhoushi8287 Can you please elaborate more!

  • @ousam2010

    @ousam2010

    3 жыл бұрын

    @@71sephiroth Actually, I think you are right that it is a weighted sum, i.e., weighted by the size of Sr and Sl. Let's use c(X) to indicate the class (c1,c2,...cK) a sample X belong to. So the weighted sum actually approximate H(Sl)Pr(Sl)+H(Sr)Pr(Sr) = H(c(X)|X in Sl) Pr(X in Sl) + H(c(X)|X in Sr) Pr(X in Sr) = H(c(X)|s(X)), where s(X) is the index indicating if X is in the left node or right node. For example, s(X)=1 if X in Sl and s(X)=0 if X in Sr. Since the conditional entropy H(c(X)|s(X)) can be viewed as the amount of remaining information c(X) given that s(X) is known. It is very natural that we would like to minimize H(c(X)|s(X)) to get the optimum split.

  • @ugurkap
    @ugurkap4 жыл бұрын

    I think running time of the naive implementation may be O(dn^2) instead of O(dkn^2). We have to split every dimension n times, so nd is a must. But I don't have to do the rest of the operations k times, I believe. Using a loop such as: left_pt = 0 right_pt = 0 left_class_1 = 0 left_class_2 = 0 right_class_1 = 0 right_class_2 = 0 for i = 1 to n: class_1 = 0 class_2 = 0 if y_i == class_1: class_1 += 1 if y_i == class_2: class_2 += 1 if x_i[d] left_pt += 1 left_class_1 += class_1 left_class_2 += class_2 if x_i[d] >= some_threshold: right_pt += 1 right_class_1 += class_1 right_class_2 += class_2 Effectively, this will calculate how many data points we have to the left and right. Additionally, it will calculate p_k as well since p_left_class_1 = left_class_1 / left_pt. Is there any reason we should also go over the labels with another loop?

  • @aayushchhabra7465

    @aayushchhabra7465

    4 жыл бұрын

    I was thinking the same. We can make one pass through the entire dataset and keep a running count on different variables for left, right, and one for each class on left and right each. As soon as we encounter any point, we either increase the left or right counter, and simultaneously update the corresponding class counter in the left or right variables. That I believe will remove k from the algorithmic complexity as you suggested.

  • @alimamdouh1764
    @alimamdouh17643 жыл бұрын

    why do we solve the problem of variance by just not insisting on every point being correctly classified ? why dont we just keep branching until certain threshold then just do a majority vote ?

  • @kilianweinberger698

    @kilianweinberger698

    3 жыл бұрын

    Indeed, you can do that. If you enforce a max leaf-node size (e.g. stop splitting if

  • @alimamdouh1764

    @alimamdouh1764

    3 жыл бұрын

    @@kilianweinberger698 it is thr first time I hear about "fiddling parameters". could you please recommend me some resource in which I can read about it.

  • @sandeshhegde9143
    @sandeshhegde91435 жыл бұрын

    Bagging Starts from: kzread.info/dash/bejne/YoB2k8WyYtapiMY.html ( 33.52)

  • @marcogelsomini7655
    @marcogelsomini76552 жыл бұрын

    43:38 ahaha you are the number one

  • @rajatpatel4657
    @rajatpatel46573 жыл бұрын

    why the complexity of 1st approach is O(d*k*N^2)? we can compute all Pk first, and then multiply with H(SL) (for all K class labels ). Therefore, we can do this in O(d*N*(N+K)). isn't it?

  • @StarzzLAB

    @StarzzLAB

    3 жыл бұрын

    I am thinking the same.

  • @husamalsayed8036

    @husamalsayed8036

    3 жыл бұрын

    exactly , that what i was thinking about .

  • @shrishtrivedi2652
    @shrishtrivedi26523 жыл бұрын

    Regression 21:40

  • @umangsharma6908
    @umangsharma69083 жыл бұрын

    The person who gave the thumbs down (dislike) to this video is currently jobless and wasting his money in data science diplomas/courses and consuming bullshit content :P