caltech
Күн бұрын
223,325
1

Lecture 05 - Training Versus Testing

Ғылым және технология

Training versus Testing - The difference between training and testing in mathematical terms. What makes a learning model able to generalize? Lecture 5 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - itunes.apple.com/us/course/ma... and on the course website - work.caltech.edu/telecourse.html
Produced in association with Caltech Academic Media Technologies under the Attribution-NonCommercial-NoDerivs Creative Commons License (CC BY-NC-ND). To learn more about this license, creativecommons.org/licenses/b...
This lecture was recorded on April 17, 2012, in Hameetman Auditorium at Caltech, Pasadena, CA, USA.

Пікірлер: 116

@mikeshudani79848 жыл бұрын
I finally understood concepts that were "taught" in my course. Prof Abu-Mustafa - you're a star!
@michaelmellinger23242 жыл бұрын
Video is from April 17, 2012…10 years ago. These videos are still great. I would love to see either current ones, or even better, another advanced class.
@sendhan6454
2 жыл бұрын
same.
@aviraljanveja51555 жыл бұрын
It clearly shows from his teaching that Prof. Abu Mostafa is in love with the subject. The way he teaches it, having fun in the process lets the student have fun too and in turn fall in love with the subject !
@amodamatya
2 жыл бұрын
True that
@sakcee6 жыл бұрын
This Prof is the best in ML and Data Science, I have watched all the touted lectures from fancy schools, none matches this
@dr.wazihahmad82749 жыл бұрын
Till Now I have not seen the machine learning lectures better than these. In fact he is very good at choosing the words
@user-os9zl5gu5h
9 жыл бұрын
Wazih Ahmad i totally agreed with you
@muzlupopkek76878 ай бұрын
He explains the topics so passionately. I have a smile on my face through the courses 😊 Thank you so much for this amazing course.
@manasawalaaamir11 жыл бұрын
I have not seen a better professor than proff Yaser ! explains concepts with such ease and with so much depth ! simply awesome :) thanks to Caltech for this course !
@imitationmask3 жыл бұрын
puzzle: Without constraint, 3-tuple (x1, x2, x3) should have 2^3 (=8) maximum dichotomies. By adding a constraint, e.g. (x1, x2) cannot have more than 3 dichotomies, there is one of the 4 dichotomies has to be dropped, let's say the 3 dichotomies are (0, 0), (0, 1), (1, 0) , drop (1, 1), the 3-tuple (x1, x2, x3) can only have maximum 4 dichotomies. x1, x2, x3 (0, 0, 0) (0, 0, 1) (0, 1, 0) (0, 1, 1) (invalid because it violates the constraint by having 4 dichotomies of (x2, x3) (1, 0, 0) (1, 0, 1) (invalid because it violates the constraint by having 4 dichotomies of (x1, x3) (1, 1, 0) (invalid because it violates the constraint by having 4 dichotomies of (x1, x2) (1, 1, 1) (invalid because it violates the constraint by having 4 dichotomies of (x1, x2)
@ajaygunalan19955 жыл бұрын
Best lectures in the Milky Way Galaxy...! Love Him.
@diegompinn5 жыл бұрын
"We are in business!!" Love this expression!
@kerolesmonsef4179
3 жыл бұрын
this is a famous Egyptian phase
@diegompinn
3 жыл бұрын
Keroles Monsef thank you so much for your comment! I really love this expression but I figure I really really like it when he says it!
@shakesbeer006 жыл бұрын
Wow! This is the first time that I see tough concepts like VC dimension exposed to undergraduates in such a comprehensible way. Thanks so much Prof. Abu-Mustafa! BTW, to understand the puzzle at the end of class, I strongly suggest you watch twice. Then you will really appreciate what Prof. Abu-Mustafa is trying to explain. His key idea was to show how existence of a finite break point brings down the growth function (which could have been exponential 2^N) to the polynomial order.
@chetan58485 жыл бұрын
Dear Prof.Abu-Mostafa, thank you. Thank you for such a clear and lucid explanation of these abstractions, so trumped up in other texts. Bless you, sir, for finally enabling me to really understand what I had simply given up hope on ever understanding. "It will kill the heck out of any polynomial..." Quips such as these often make me grin. Combined with how neatly I understand what you're saying, viewing these videos is simply a pleasurable experience.
@jjepsuomi11 жыл бұрын
Note that M was the total number of hypothesises. Mostafa argued next that in practise many of the hypothesises are extremely overlapping, for example Hypoth.1 would give almost exactly the same performance as Hypoth.2, so we don't have to consider both of them. This is where he defined the concept of 'dichotomies' at 21:55, that is: "Hypothesises that give significantly different results, that is hypothesises which are not very overlapping in performance".
@karanpatel1906
4 жыл бұрын
jjepsuomi nice explanation
@abhijeetsharma5715
3 жыл бұрын
To be mathematically precise, I think he meant that the "Bad-Events" for different hypotheses are overlapping because many hypotheses are infinitesimally close to each other. And since these events are very overlapping, we can say that the "union bound" is rather a very loose bound in this situation. And hence we try to find a stricter bound for the probability of the union of bad-events.
@BGasperov5 жыл бұрын
By far the best machine learning course I've ever seen, kudos to Prof. Yaser Abu-Mostafa.
@jyotsnamasand24144 жыл бұрын
Prof. Mostafa, I really really love your lectures. The coursework is really interesting and you’ve delivered it so unequivocally.
@norahalmubairik75674 жыл бұрын
You’re the best! You explain concepts in a clear way! Thanks, Dr. Mustafa
@jjepsuomi11 жыл бұрын
At 52:50 if you look at the case of positive rays, when the number of inputs is 1, then the growth function and the number of all possible dichotomies are equal g(1) = 1 +1 = 2^1 = 2. Therefore 1 is NOT a break point because you CAN get all possible dichotomies, the cases are (red), (blue), you can do this with positive rays. The break point is 2 with positive rays because you can't get all possible dichotomies, that is you can't get the case (blue, red), g(2) = 2 + 1 = 3 != 2^2 = 4.
@sepehr409 жыл бұрын
Finally, a tie and a shirt that match the suit !
@yqueweaaculiao
9 жыл бұрын
now we can sleep peacefully
@tobalaba
4 жыл бұрын
fuck that, u shallow one, focus on the knowledge.
@jjepsuomi11 жыл бұрын
The cases are: (red,red), (blue, blue), (red,blue), (blue, red) so we have 4 of them =) But with positive rays we cannot get all the possible dichotomies. We can get the N+1 of them which is 2+1 = 3. We can get the cases (red,red), (blue, blue), (red, blue) but WE CANNOT get the case (blue, red) with positive rays. I suggest you look at the 52:50 at this point few times =). The 'break point' k is the minimum number inputs (x1, x2, ..., xk) when you CANNOT get all the possible dichotomies.
@likhithkatta97156 жыл бұрын
OMG ... You are awesome... Such a great lecture... Thank you Caltech and Prof. Yaser Abu-Mostafa..
@manjuhhh10 жыл бұрын
Thank you Caltech and Prof Yaser Abu Mostafa in particular. Its good to be here..
@jjepsuomi11 жыл бұрын
There are (0, 0), (0, 1), (0, 0), (1, 0), (1, 1). Again the constraint is violated so we must discard the dichotomy. I suggest you look the example again when Mostafa explains it. The key things is (as I understood), the break point will discard MOST of ALL POSSIBLE DICHOTOMIES, which will make the growth function polynomial in N and therefore we get a good bound for the Hoeffding and now we can generalize that is learn =) I hope I helped you...hope I didn't confuse even more x)
@AndyLee-xq8wq Жыл бұрын
damn, this is the best course I've ever seen
@kenanofify9 жыл бұрын
It's called a "growth function" because the maximum number of dichotomies (in the case of 2 classes only or as we call it the binary classification) is 2 to the Nth (where N is the number of the input points), and 2 to the Nth is an exponential function and it is known that exponential functions grows very quickly (with each small change in x, a vast change in y happens), so this is from where the term "growth function" came.
@brainstormingsharing13093 жыл бұрын
Absolutely well done and definitely keep it up!!! 👍👍👍👍👍👍👍👍👍👍👍
11 жыл бұрын
This was a great lecture :) A nontrivial piece of the theory presented clearly and concisely, kudos for that :)
@osamahabdullah37157 жыл бұрын
He is really awesome, I totally get the concept, is there any way to get more example that we can do it by myself for more understanding, understanding the concept something and apply it something else
@Rahulsircar942 жыл бұрын
best course for understanding linear ML models.
@malharjajoo73937 жыл бұрын
The question asked about the binary hypothesis used is an interesting one...
@theamazingjonad97167 жыл бұрын
finally got it... thanks professor.
@-long-4 жыл бұрын
If you have difficulties understanding the puzzle, keep in mind that there is a PARTICULAR constraint in that puzzle (which is breakpoint = 2), if there's no constraint then breakpoint = 4 as in the previous example.
@mohammadwahba30777 жыл бұрын
you are honor for Egypt
@blasttrash7 жыл бұрын
I don't understand the puzzle at the end even though it was explained twice. Ok so for x1, x2, x3 if break point is 2, it means that we will not be able to get all dichotomies for the given x1, x2, x3 right? So why does a point like x1(white), x2(black), x3(black fail? For this point why can't we get all dichotomies? or why does a point like x1(white), x2(white), x3(white) doesn't fail? I think I have some trouble understanding the concept of the puzzle itself, so would be great if someone could help me.
@lucaswolf9445
7 жыл бұрын
yeah, I found it difficult to get as well. As I understand it, each column (ie. x1, x2, x3) represent one data point in our sample set. Each row represents a dichotomy, eg. (black, black, white) classifies x1 and x2 as "YES" and x3 as "NO". Since the break point is two, we know that for any pair of two points, eg. (x1, x2) or (x2, x3), we cannot get every dichotomy on these points. As an example, consider x1 and x2. There are four possible dichotomies: (x1="YES", x2="YES"), (x1="YES", x2="NO"), (x1="NO", x2="YES") and (x1="NO", x2="YES"). But since the break point is two, we cannot get all of these dichotomies from our learner. And of course we can also not get all four of these if we add another point, eg. x3. For the puzzle, we want to find a largest possible set of dichotomies on three points. Professor Abu-Mostafa therefore simply enumerates all the possible dichotomies and whenever he adds one, he checks every pair of two points. If for any pair all four dichotomies can be generated, we have a contradiction with our assumption that 2 was the breakpoint, and hence the dichotomy could not have been generated. Sorry for the long read, hope it helped :)
@swaggerchegger988 жыл бұрын
He's giving solid answers in the Q&A-session. Nice job!
@RealMcDudu4 жыл бұрын
In case anyone wonders - why there are a lot of different hoeffdings inequalities - I think it depends on the bound of Xi. Theorem 1 of Hoeffding (and Wassily) paper states that for 0 eps] eps]
@jjepsuomi11 жыл бұрын
No probs =) Glad if it helped
@yakyuck10 жыл бұрын
Really great lecture.
@vaibhavjain28266 жыл бұрын
Can someone recommend a source to quickly go through all the probability concepts used in this course?
@shiweiweng592911 жыл бұрын
Thanks for your explanations. helpful.
@mrvargarobert10 жыл бұрын
The growth function for the perceptron is N*(N-1)+2
@gonzaloplazag2 жыл бұрын
Great lecture again!!
@jjepsuomi11 жыл бұрын
This means that we can learn, that is generalize well =) He also says that he will proof this result on later lectures. The puzzle is a about, "how does the break point restrict the number of all possible inputs". Here you should watch 56:35. If your break point is 3, but you have 100 inputs (x1,x2, ..., x100) the break point constraint will diminish most of the all 2^100 possible dichotomies. In the puzzle 57.25, he tells us that the break point constraint is 2, but we have 3 points.
@solsticetwo3476
5 жыл бұрын
jjepsuomi Great! But the break point was arbitrarily set to 2. If it were 50, the reduction on the size of probably hypothesis would be lower. So, the question is what is the breakpoint? Seems to be important to know if exist, and even the actual value.
@jjepsuomi11 жыл бұрын
Hi! I think I got the ideas so I will try to explain them to you as I understood them =) The whole point in this lecture as I understood was to give as small probability bound for the Hoeffding inequality in the training part at 8:55, that is we wanted to replace the big capital M with a smaller number, because the M could have been infinitely big, because the whole input space is infinite. The Hoeffding gave us a guarantee that we can generalize, that is get a small out-of-sample error.
@manish1311 жыл бұрын
This is one of the best lectures! But I thought the puzzle was rushed up, didn't really follow it.
@G7DHALIWAL10 жыл бұрын
How do we decide on N or the number of points to be taken for calculating dichotomies?
@theflippedbit5 жыл бұрын
in case you need more comprehensive lecture slides with extra examples and supplementary materials, you can have a look at the corresponding ML foundation course by Prof. Hsuan-Tien Lin ( co-author of the book learning from data by abu-mostafa) from NTU college. Link: www.csie.ntu.edu.tw/~htlin/mooc/ it has got the same slides but with more information.
@jjepsuomi11 жыл бұрын
Now freeze the image on 58:57 and watch it =) The constraint was 2 points correct? This means we CANNOT get all 2^2 = 4 possible different dichotomies. Look at the columns x2, x3. There is (0,0), (0,1), (1,0), (1,1) so there are four of them! But this wasn't allowed! So we must discard the (1,1) case. Next look at 59:58. There are cases (0,0), (0,0), (0,1), (1,0), (1,1). Again we have four DIFFERENT dichotomies there so the constraint is violated. Lastly look 59:31 columns x1, x3.
@rumasheikh59524 жыл бұрын
consider the "triangle" learning model, where h : r² → {-1, +1} and h(x) = +1 if x lies within an arbitrarily chosen triangle in the plane and -1 otherwise. which is the largest number of points in r2 (among the given choices) that can be shattered by this hypothesis set? give me the solution
@gyeonghokim3 жыл бұрын
growth function은 hypotheses가 만들수있는 maximum number of dichotomies using N samples
@Yifzmagarki9 жыл бұрын
I am not English native speaking and can say that lecture is hard but very usefull.
@solsticetwo34765 жыл бұрын
Tell me if I am right so far: 1. One dichotomy maps one or more hypothesis; or equivalently, several hypothesis are mapped in a dichotomy. This is, a configuration of size N has several equivalent solutions. Thus, for the maximum size of the set of dichotomies m, we can says that m
@swagatochatterjee99304 жыл бұрын
As the growth function counts the most number of dichotomies, why isn't mH(4) = 16? and considering the same explanation that you gave for mH(4) why isn't mH(3) =7 then?
@jjepsuomi11 жыл бұрын
He next argued that the number of all possible dichotomies can be AT MOST 2^n. The growth function was defined as the maximum number of dichotomies with given inputs (x1,x2, ..., xN), note that this is not the same as the number of all possible dichotomies = 2^n. This might be a bit unclear so I will use Mostafa's example in 36:15 and 52:50. Notice that with the positive rays when N = 2, that is we have inputs (x1, x2) the number of all possible dichotomies is 2^2 = 4 correct?.
@jjepsuomi11 жыл бұрын
Hi, if you look at the page again I tried to explain the puzzle for user montintinmontintin in my comments (the last ones). Hope it helps =)
@SaimonThapa6 жыл бұрын
what do you mean by negative exponential of e "completely killing off" the hoeffding inequality's RHS?
@uditapatel778
5 жыл бұрын
watch lecture 2. It means the value(probability) on RHS become 1 and any probalistic inequality of the form x
@Chitrang628 жыл бұрын
I have a question, In the positive rays why break point is 2? If you consider positive rays we can get 3 hypothesis.
@movax20h
8 жыл бұрын
+Chitrang Talaviya Exactly, and 3 is less than 2^2 = 4, so k=3 is a break point.
@nairouzmrabah7716
8 жыл бұрын
This is called positive ray ( red marbles (negative) behind positive marbles(positive ) ) blue before red is a false negatve case ===> we only have two marbles for this case So the breakpoint is 2. Unless error . M I wrong?
@martyallen38016 жыл бұрын
I didn't understand the puzzle at all. Anyone who can help me get what's going on? Please!!!
@pratik67089 жыл бұрын
I did not understand the Breakpoints concept. It is confusing
@montintinmontintin11 жыл бұрын
Excellent lecture. Howe the puzzle relates to the growth function and break point concept is not clear at all.
@charismaticaazim6 жыл бұрын
27:23 . How do know for certain the number of hypothesis is 2^N ? I think its becoz we have 2 output (hence the 2) & we have N inputs , hence we have raised it to the power of N ? Thats the number of outputs combinations we could possible have.
@saidameftah7545
5 жыл бұрын
actually because one perceptron is defined by 2 inputs , any inputs in the space , so the number of dichotomies here is equal to possible combination of 2 inputs among N inputs which is exactly 2^N .
@fradamor8 жыл бұрын
I understood the growth function upper limit (2^N)... but I have some doubt about how to define the growth function (the number of max possible dichotomies) for perceptron when N = 4. Why it is not 2^4, If I must get the growth function for all possible configuration (the max). Is there some link where I can read some think about?
@kingpirate5218
8 жыл бұрын
+Francesco D'Amore "At most 14 of 16 possible dichotomies on ANY 4 point can be generated"(from the book Learning from Data). You can dispose the point in any way but there is always two missing, or even worse. If you consider an unwise point distribution you can missing even more than two point. So the max you can obtain is that you are missing 2 possibilities. Try to draw the possibilities to figure it out.
@fradamor
8 жыл бұрын
+King Pirate thank you. I have that book. I will see. Thank you for your suggestion.
@-long-4 жыл бұрын
1:13:00 "for 2 hypotheses that have the same dichotomy, the in-sample error is the same" Is it true in the case of MSE (Mean Squared Error)? - I think not It is true here because the selected metric for in-sample error was the "fraction of incorrect prediction" Am I understanding this correctly? Thank you
@chanle5989
4 жыл бұрын
I think he limited the sense of dichotomies and the following theory of generalization to only the binary classification problem. So MSE in this case would not be a good error measure :)
@vimanyuaggarwal14206 жыл бұрын
Considering the two dimensional input space, if the number of data set is 2 (i.e. 2 data points), then doesn't the perceptron behave like a positive light ray? In that way the break-point for 2D perceptron will be 2 rather than 4. But this is obviously wrong as for 3 data points, all 8 possible dichotomies exist thereby breaking the "if there is a breakpoint for a smaller data set, the breakpoint is still there for the bigger data set". What am I missing here?
@heathjohn62
5 жыл бұрын
The positive light ray can only point in one direction. Thus, the situation of (+ - ) "breaks" the positive light ray hypothesis, whereas the perceptron can simply reorient itself to sort the two points. While the perceptron appears to just be a line, for every line that divides an input space, there are actually two separate hypothesis: one where the data to the right of the line is +1 and the that to the left is -1, and also vice versa.
@wasp49325 жыл бұрын
57:25 We have only 3 (actually 2) points.
@rahulkhapre93516 жыл бұрын
Could someone explain what does merging P(X) and P(y|X) mean ? What does P(x,y) try to capture?
@bibekgautam512
5 жыл бұрын
P(X) is the probability that out of all the points in input space, you pick X for training. P(y|x) is the distribution that you'd like to learn to predict the behaviour of y for given x. P(x,y) is the exact probability that your test point will have a particular x and you get a particular y as prediction from your hypothesis. At least that's what i understand. someone correct me if i'm wrong.
@focke8510 жыл бұрын
Yes
@marcogelsomini76552 жыл бұрын
35:02 important point to grasp the logic imo
@jjepsuomi11 жыл бұрын
Sorry I made a mistake in one of my comments: It was: ""The puzzle is a about, "how does the break point restrict the number of all possible inputs"". IT SHOULD BE "The puzzle is a about, "how does the break point restrict the number of all possible DICHOTOMIES".
@arashkhabazian66897 жыл бұрын
i have a prolem with definition of break point
@hx11hx112 жыл бұрын
快乐分享的天才
@SayanInMoves4 жыл бұрын
awesome
@miladbassil27307 жыл бұрын
what are the prerequisites of this course im starting to think im ill equipped to understand what he is saying.
@scwubbywubby
6 жыл бұрын
basic understanding of statistics and matrices. Computer science would be really helpful too but maybe not absolutely necessary
@theamazingjonad9716
6 жыл бұрын
algorithmic complexity, linear algebra, and multivariable calculus will also help you understand it.
@heathjohn62
5 жыл бұрын
Prerequisites are math 2 and computer science 2 at Caltech. Math 2 is a proof-heavy differential equations class, and requires multivariable calculus and linear algebra. CS2 is a programming class taught in c++ that covers topics like algorithmic complexity, dynamic programming, and data structures. Realistically, you want to have an understanding of matrix math and an intuition for the types of problems you can approach with a computer. I've been doing the homework in python using the numpy package.
@omkarchavan22597 жыл бұрын
related o positive rays ;wont that growth function b equal to 2(N+1)
@solsticetwo3476
5 жыл бұрын
omkar chavan No because of the direction of the arrow.
@jjepsuomi11 жыл бұрын
Sorry by the way about my notation, by g(N) I meant the growth function which Mostafa labeled as small m. Another way of defining the break point is the smallest number for which the growth function does not equal the number of all possible dichotomies, that is g(k) != 2^k. Lastly he concluded that if we have a break point with the set of hypothesises, the growth function is POLYNOMIAL IN N, which is good news, because now we can bound the Hoeffding inequality to be a small probability.
@cristinasanchez90292 жыл бұрын
Greetings from ugr
@thomaslupo3823 жыл бұрын
Nice !!!
@clarkupdike65182 жыл бұрын
I did the puzzle a different way, more globally (I think) rather than the instructor approach of making sure that at max only 3 cases for any 2 X's were ever admitted. Each unique set of counts is a dichotomy, and with a break point of 2, that means only 2 unique sets of counts for the circles and dots are allowed. Therefore: | X's | Counts | Dichotomy Case: | 1 2 3 | O's ●'s | Count --------- + ------- + ----------- + -------- 1 OOO 3 0 1 2 OO● 2 1 2
@matthewmarston5149 Жыл бұрын
"Come to 256 and you'll be teaching engineering at caltech" Hunt for the Red October 1986
@swagatochatterjee99304 жыл бұрын
When the growth function of a hypothesis is polynomial, how can we declare that learning is feasible using that hypothesis?
@poojanpujara8643
4 жыл бұрын
See lecture-6 for that.
@vinayaditya6440
3 жыл бұрын
We obtain a reasonable bound (RHS of Hoeffding's Inequality), with the Polynomial multiplied to a negative exponential. This ensures that the out-of-sample error and the in-sample error track each other reasonably well.
@-long-4 жыл бұрын
17:05 that's where E_out changed
@sfgs97344 жыл бұрын
brilliance
@theAnonymous4ebah10 жыл бұрын
In slide 7: why at most 2 to the power N?!!
@yakyuck
10 жыл бұрын
2^N is whats known as the power set, its the set with all possible combinations of N. For example if N = {{1,},{2},{3}} so in this case N has 3 elements so 2^3 =8. We can see this by looking at all the possible set combinations. Power_Set = {{1},{2},{3},{1,2},{1,3},{2,3},{1,2,3},{empty_set}}. So what it is just a breakdown of all possible combinations of the est given. Hope that helps
@pramodhgopalan11414 жыл бұрын
this guy is so much more intuitive than andrew ng,not to compare tho,Andrew is a god amongst humans.
@Notlien10002 жыл бұрын
1:05:19 "So I'm getting it for free" lol
@tompake11 жыл бұрын
Wouldn't another solution for puzzle be: 0 0 0 0 0 1 0 1 1 1 1 1
@Hasan-bz1go
7 жыл бұрын
No, because our goal was to find the max dichotomies we could get using given constraint.
@timothykennelljr.3725
7 жыл бұрын
I disagree with @Hasan Hameed and agree with @tompake. The proposed solution 0 0 0 0 0 1 0 1 1 1 1 1 would be an answer, and I believe it is the solution one would get with the positive rays example. I think the key of the puzzle is just to show that a break point limits the total number of dichotomies. When going through the puzzle, the validity of the next pattern of three circles depends only on the previous patterns chosen. Provided that the next pattern does not violate the premise that choosing any 2 circles will be missing one of the possibilities from (0, 0), (0, 1), (1, 0), (1, 1) , then it is a valid pattern. Therefore, based on starting point and future choices from the starting point, there are many possible solutions. The result of having fewer dichotomies (the same for a given break point and number of data points) will still be the same. Another solution is below for viewing pleasure :D 1 1 1 0 0 0 1 0 1 1 0 0
@theamazingjonad9716
6 жыл бұрын
@Timothy is right, your answer can also be a solution to the puzzle.
@-long-
4 жыл бұрын
it is but you cannot get more than 4 rows, that's the point