Lecture 05 - Training Versus Testing
Ғылым және технология
Training versus Testing - The difference between training and testing in mathematical terms. What makes a learning model able to generalize? Lecture 5 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - itunes.apple.com/us/course/ma... and on the course website - work.caltech.edu/telecourse.html
Produced in association with Caltech Academic Media Technologies under the Attribution-NonCommercial-NoDerivs Creative Commons License (CC BY-NC-ND). To learn more about this license, creativecommons.org/licenses/b...
This lecture was recorded on April 17, 2012, in Hameetman Auditorium at Caltech, Pasadena, CA, USA.
Пікірлер: 116
I finally understood concepts that were "taught" in my course. Prof Abu-Mustafa - you're a star!
Video is from April 17, 2012…10 years ago. These videos are still great. I would love to see either current ones, or even better, another advanced class.
@sendhan6454
2 жыл бұрын
same.
It clearly shows from his teaching that Prof. Abu Mostafa is in love with the subject. The way he teaches it, having fun in the process lets the student have fun too and in turn fall in love with the subject !
@amodamatya
2 жыл бұрын
True that
This Prof is the best in ML and Data Science, I have watched all the touted lectures from fancy schools, none matches this
Till Now I have not seen the machine learning lectures better than these. In fact he is very good at choosing the words
@user-os9zl5gu5h
9 жыл бұрын
Wazih Ahmad i totally agreed with you
He explains the topics so passionately. I have a smile on my face through the courses 😊 Thank you so much for this amazing course.
I have not seen a better professor than proff Yaser ! explains concepts with such ease and with so much depth ! simply awesome :) thanks to Caltech for this course !
puzzle: Without constraint, 3-tuple (x1, x2, x3) should have 2^3 (=8) maximum dichotomies. By adding a constraint, e.g. (x1, x2) cannot have more than 3 dichotomies, there is one of the 4 dichotomies has to be dropped, let's say the 3 dichotomies are (0, 0), (0, 1), (1, 0) , drop (1, 1), the 3-tuple (x1, x2, x3) can only have maximum 4 dichotomies. x1, x2, x3 (0, 0, 0) (0, 0, 1) (0, 1, 0) (0, 1, 1) (invalid because it violates the constraint by having 4 dichotomies of (x2, x3) (1, 0, 0) (1, 0, 1) (invalid because it violates the constraint by having 4 dichotomies of (x1, x3) (1, 1, 0) (invalid because it violates the constraint by having 4 dichotomies of (x1, x2) (1, 1, 1) (invalid because it violates the constraint by having 4 dichotomies of (x1, x2)
Best lectures in the Milky Way Galaxy...! Love Him.
"We are in business!!" Love this expression!
@kerolesmonsef4179
3 жыл бұрын
this is a famous Egyptian phase
@diegompinn
3 жыл бұрын
Keroles Monsef thank you so much for your comment! I really love this expression but I figure I really really like it when he says it!
Wow! This is the first time that I see tough concepts like VC dimension exposed to undergraduates in such a comprehensible way. Thanks so much Prof. Abu-Mustafa! BTW, to understand the puzzle at the end of class, I strongly suggest you watch twice. Then you will really appreciate what Prof. Abu-Mustafa is trying to explain. His key idea was to show how existence of a finite break point brings down the growth function (which could have been exponential 2^N) to the polynomial order.
Dear Prof.Abu-Mostafa, thank you. Thank you for such a clear and lucid explanation of these abstractions, so trumped up in other texts. Bless you, sir, for finally enabling me to really understand what I had simply given up hope on ever understanding. "It will kill the heck out of any polynomial..." Quips such as these often make me grin. Combined with how neatly I understand what you're saying, viewing these videos is simply a pleasurable experience.
Note that M was the total number of hypothesises. Mostafa argued next that in practise many of the hypothesises are extremely overlapping, for example Hypoth.1 would give almost exactly the same performance as Hypoth.2, so we don't have to consider both of them. This is where he defined the concept of 'dichotomies' at 21:55, that is: "Hypothesises that give significantly different results, that is hypothesises which are not very overlapping in performance".
@karanpatel1906
4 жыл бұрын
jjepsuomi nice explanation
@abhijeetsharma5715
3 жыл бұрын
To be mathematically precise, I think he meant that the "Bad-Events" for different hypotheses are overlapping because many hypotheses are infinitesimally close to each other. And since these events are very overlapping, we can say that the "union bound" is rather a very loose bound in this situation. And hence we try to find a stricter bound for the probability of the union of bad-events.
By far the best machine learning course I've ever seen, kudos to Prof. Yaser Abu-Mostafa.
Prof. Mostafa, I really really love your lectures. The coursework is really interesting and you’ve delivered it so unequivocally.
You’re the best! You explain concepts in a clear way! Thanks, Dr. Mustafa
At 52:50 if you look at the case of positive rays, when the number of inputs is 1, then the growth function and the number of all possible dichotomies are equal g(1) = 1 +1 = 2^1 = 2. Therefore 1 is NOT a break point because you CAN get all possible dichotomies, the cases are (red), (blue), you can do this with positive rays. The break point is 2 with positive rays because you can't get all possible dichotomies, that is you can't get the case (blue, red), g(2) = 2 + 1 = 3 != 2^2 = 4.
Finally, a tie and a shirt that match the suit !
@yqueweaaculiao
9 жыл бұрын
now we can sleep peacefully
@tobalaba
4 жыл бұрын
fuck that, u shallow one, focus on the knowledge.
The cases are: (red,red), (blue, blue), (red,blue), (blue, red) so we have 4 of them =) But with positive rays we cannot get all the possible dichotomies. We can get the N+1 of them which is 2+1 = 3. We can get the cases (red,red), (blue, blue), (red, blue) but WE CANNOT get the case (blue, red) with positive rays. I suggest you look at the 52:50 at this point few times =). The 'break point' k is the minimum number inputs (x1, x2, ..., xk) when you CANNOT get all the possible dichotomies.
OMG ... You are awesome... Such a great lecture... Thank you Caltech and Prof. Yaser Abu-Mostafa..
Thank you Caltech and Prof Yaser Abu Mostafa in particular. Its good to be here..
There are (0, 0), (0, 1), (0, 0), (1, 0), (1, 1). Again the constraint is violated so we must discard the dichotomy. I suggest you look the example again when Mostafa explains it. The key things is (as I understood), the break point will discard MOST of ALL POSSIBLE DICHOTOMIES, which will make the growth function polynomial in N and therefore we get a good bound for the Hoeffding and now we can generalize that is learn =) I hope I helped you...hope I didn't confuse even more x)
damn, this is the best course I've ever seen
It's called a "growth function" because the maximum number of dichotomies (in the case of 2 classes only or as we call it the binary classification) is 2 to the Nth (where N is the number of the input points), and 2 to the Nth is an exponential function and it is known that exponential functions grows very quickly (with each small change in x, a vast change in y happens), so this is from where the term "growth function" came.
Absolutely well done and definitely keep it up!!! 👍👍👍👍👍👍👍👍👍👍👍
This was a great lecture :) A nontrivial piece of the theory presented clearly and concisely, kudos for that :)
He is really awesome, I totally get the concept, is there any way to get more example that we can do it by myself for more understanding, understanding the concept something and apply it something else
best course for understanding linear ML models.
The question asked about the binary hypothesis used is an interesting one...
finally got it... thanks professor.
If you have difficulties understanding the puzzle, keep in mind that there is a PARTICULAR constraint in that puzzle (which is breakpoint = 2), if there's no constraint then breakpoint = 4 as in the previous example.
you are honor for Egypt
I don't understand the puzzle at the end even though it was explained twice. Ok so for x1, x2, x3 if break point is 2, it means that we will not be able to get all dichotomies for the given x1, x2, x3 right? So why does a point like x1(white), x2(black), x3(black fail? For this point why can't we get all dichotomies? or why does a point like x1(white), x2(white), x3(white) doesn't fail? I think I have some trouble understanding the concept of the puzzle itself, so would be great if someone could help me.
@lucaswolf9445
7 жыл бұрын
yeah, I found it difficult to get as well. As I understand it, each column (ie. x1, x2, x3) represent one data point in our sample set. Each row represents a dichotomy, eg. (black, black, white) classifies x1 and x2 as "YES" and x3 as "NO". Since the break point is two, we know that for any pair of two points, eg. (x1, x2) or (x2, x3), we cannot get every dichotomy on these points. As an example, consider x1 and x2. There are four possible dichotomies: (x1="YES", x2="YES"), (x1="YES", x2="NO"), (x1="NO", x2="YES") and (x1="NO", x2="YES"). But since the break point is two, we cannot get all of these dichotomies from our learner. And of course we can also not get all four of these if we add another point, eg. x3. For the puzzle, we want to find a largest possible set of dichotomies on three points. Professor Abu-Mostafa therefore simply enumerates all the possible dichotomies and whenever he adds one, he checks every pair of two points. If for any pair all four dichotomies can be generated, we have a contradiction with our assumption that 2 was the breakpoint, and hence the dichotomy could not have been generated. Sorry for the long read, hope it helped :)
He's giving solid answers in the Q&A-session. Nice job!
In case anyone wonders - why there are a lot of different hoeffdings inequalities - I think it depends on the bound of Xi. Theorem 1 of Hoeffding (and Wassily) paper states that for 0 eps] eps]
No probs =) Glad if it helped
Really great lecture.
Can someone recommend a source to quickly go through all the probability concepts used in this course?
Thanks for your explanations. helpful.
The growth function for the perceptron is N*(N-1)+2
Great lecture again!!
This means that we can learn, that is generalize well =) He also says that he will proof this result on later lectures. The puzzle is a about, "how does the break point restrict the number of all possible inputs". Here you should watch 56:35. If your break point is 3, but you have 100 inputs (x1,x2, ..., x100) the break point constraint will diminish most of the all 2^100 possible dichotomies. In the puzzle 57.25, he tells us that the break point constraint is 2, but we have 3 points.
@solsticetwo3476
5 жыл бұрын
jjepsuomi Great! But the break point was arbitrarily set to 2. If it were 50, the reduction on the size of probably hypothesis would be lower. So, the question is what is the breakpoint? Seems to be important to know if exist, and even the actual value.
Hi! I think I got the ideas so I will try to explain them to you as I understood them =) The whole point in this lecture as I understood was to give as small probability bound for the Hoeffding inequality in the training part at 8:55, that is we wanted to replace the big capital M with a smaller number, because the M could have been infinitely big, because the whole input space is infinite. The Hoeffding gave us a guarantee that we can generalize, that is get a small out-of-sample error.
This is one of the best lectures! But I thought the puzzle was rushed up, didn't really follow it.
How do we decide on N or the number of points to be taken for calculating dichotomies?
in case you need more comprehensive lecture slides with extra examples and supplementary materials, you can have a look at the corresponding ML foundation course by Prof. Hsuan-Tien Lin ( co-author of the book learning from data by abu-mostafa) from NTU college. Link: www.csie.ntu.edu.tw/~htlin/mooc/ it has got the same slides but with more information.
Now freeze the image on 58:57 and watch it =) The constraint was 2 points correct? This means we CANNOT get all 2^2 = 4 possible different dichotomies. Look at the columns x2, x3. There is (0,0), (0,1), (1,0), (1,1) so there are four of them! But this wasn't allowed! So we must discard the (1,1) case. Next look at 59:58. There are cases (0,0), (0,0), (0,1), (1,0), (1,1). Again we have four DIFFERENT dichotomies there so the constraint is violated. Lastly look 59:31 columns x1, x3.
consider the "triangle" learning model, where h : r² → {-1, +1} and h(x) = +1 if x lies within an arbitrarily chosen triangle in the plane and -1 otherwise. which is the largest number of points in r2 (among the given choices) that can be shattered by this hypothesis set? give me the solution
growth function은 hypotheses가 만들수있는 maximum number of dichotomies using N samples
I am not English native speaking and can say that lecture is hard but very usefull.
Tell me if I am right so far: 1. One dichotomy maps one or more hypothesis; or equivalently, several hypothesis are mapped in a dichotomy. This is, a configuration of size N has several equivalent solutions. Thus, for the maximum size of the set of dichotomies m, we can says that m
As the growth function counts the most number of dichotomies, why isn't mH(4) = 16? and considering the same explanation that you gave for mH(4) why isn't mH(3) =7 then?
He next argued that the number of all possible dichotomies can be AT MOST 2^n. The growth function was defined as the maximum number of dichotomies with given inputs (x1,x2, ..., xN), note that this is not the same as the number of all possible dichotomies = 2^n. This might be a bit unclear so I will use Mostafa's example in 36:15 and 52:50. Notice that with the positive rays when N = 2, that is we have inputs (x1, x2) the number of all possible dichotomies is 2^2 = 4 correct?.
Hi, if you look at the page again I tried to explain the puzzle for user montintinmontintin in my comments (the last ones). Hope it helps =)
what do you mean by negative exponential of e "completely killing off" the hoeffding inequality's RHS?
@uditapatel778
5 жыл бұрын
watch lecture 2. It means the value(probability) on RHS become 1 and any probalistic inequality of the form x
I have a question, In the positive rays why break point is 2? If you consider positive rays we can get 3 hypothesis.
@movax20h
8 жыл бұрын
+Chitrang Talaviya Exactly, and 3 is less than 2^2 = 4, so k=3 is a break point.
@nairouzmrabah7716
8 жыл бұрын
This is called positive ray ( red marbles (negative) behind positive marbles(positive ) ) blue before red is a false negatve case ===> we only have two marbles for this case So the breakpoint is 2. Unless error . M I wrong?
I didn't understand the puzzle at all. Anyone who can help me get what's going on? Please!!!
I did not understand the Breakpoints concept. It is confusing
Excellent lecture. Howe the puzzle relates to the growth function and break point concept is not clear at all.
27:23 . How do know for certain the number of hypothesis is 2^N ? I think its becoz we have 2 output (hence the 2) & we have N inputs , hence we have raised it to the power of N ? Thats the number of outputs combinations we could possible have.
@saidameftah7545
5 жыл бұрын
actually because one perceptron is defined by 2 inputs , any inputs in the space , so the number of dichotomies here is equal to possible combination of 2 inputs among N inputs which is exactly 2^N .
I understood the growth function upper limit (2^N)... but I have some doubt about how to define the growth function (the number of max possible dichotomies) for perceptron when N = 4. Why it is not 2^4, If I must get the growth function for all possible configuration (the max). Is there some link where I can read some think about?
@kingpirate5218
8 жыл бұрын
+Francesco D'Amore "At most 14 of 16 possible dichotomies on ANY 4 point can be generated"(from the book Learning from Data). You can dispose the point in any way but there is always two missing, or even worse. If you consider an unwise point distribution you can missing even more than two point. So the max you can obtain is that you are missing 2 possibilities. Try to draw the possibilities to figure it out.
@fradamor
8 жыл бұрын
+King Pirate thank you. I have that book. I will see. Thank you for your suggestion.
1:13:00 "for 2 hypotheses that have the same dichotomy, the in-sample error is the same" Is it true in the case of MSE (Mean Squared Error)? - I think not It is true here because the selected metric for in-sample error was the "fraction of incorrect prediction" Am I understanding this correctly? Thank you
@chanle5989
4 жыл бұрын
I think he limited the sense of dichotomies and the following theory of generalization to only the binary classification problem. So MSE in this case would not be a good error measure :)
Considering the two dimensional input space, if the number of data set is 2 (i.e. 2 data points), then doesn't the perceptron behave like a positive light ray? In that way the break-point for 2D perceptron will be 2 rather than 4. But this is obviously wrong as for 3 data points, all 8 possible dichotomies exist thereby breaking the "if there is a breakpoint for a smaller data set, the breakpoint is still there for the bigger data set". What am I missing here?
@heathjohn62
5 жыл бұрын
The positive light ray can only point in one direction. Thus, the situation of (+ - ) "breaks" the positive light ray hypothesis, whereas the perceptron can simply reorient itself to sort the two points. While the perceptron appears to just be a line, for every line that divides an input space, there are actually two separate hypothesis: one where the data to the right of the line is +1 and the that to the left is -1, and also vice versa.
57:25 We have only 3 (actually 2) points.
Could someone explain what does merging P(X) and P(y|X) mean ? What does P(x,y) try to capture?
@bibekgautam512
5 жыл бұрын
P(X) is the probability that out of all the points in input space, you pick X for training. P(y|x) is the distribution that you'd like to learn to predict the behaviour of y for given x. P(x,y) is the exact probability that your test point will have a particular x and you get a particular y as prediction from your hypothesis. At least that's what i understand. someone correct me if i'm wrong.
Yes
35:02 important point to grasp the logic imo
Sorry I made a mistake in one of my comments: It was: ""The puzzle is a about, "how does the break point restrict the number of all possible inputs"". IT SHOULD BE "The puzzle is a about, "how does the break point restrict the number of all possible DICHOTOMIES".
i have a prolem with definition of break point
快乐分享的天才
awesome
what are the prerequisites of this course im starting to think im ill equipped to understand what he is saying.
@scwubbywubby
6 жыл бұрын
basic understanding of statistics and matrices. Computer science would be really helpful too but maybe not absolutely necessary
@theamazingjonad9716
6 жыл бұрын
algorithmic complexity, linear algebra, and multivariable calculus will also help you understand it.
@heathjohn62
5 жыл бұрын
Prerequisites are math 2 and computer science 2 at Caltech. Math 2 is a proof-heavy differential equations class, and requires multivariable calculus and linear algebra. CS2 is a programming class taught in c++ that covers topics like algorithmic complexity, dynamic programming, and data structures. Realistically, you want to have an understanding of matrix math and an intuition for the types of problems you can approach with a computer. I've been doing the homework in python using the numpy package.
related o positive rays ;wont that growth function b equal to 2(N+1)
@solsticetwo3476
5 жыл бұрын
omkar chavan No because of the direction of the arrow.
Sorry by the way about my notation, by g(N) I meant the growth function which Mostafa labeled as small m. Another way of defining the break point is the smallest number for which the growth function does not equal the number of all possible dichotomies, that is g(k) != 2^k. Lastly he concluded that if we have a break point with the set of hypothesises, the growth function is POLYNOMIAL IN N, which is good news, because now we can bound the Hoeffding inequality to be a small probability.
Greetings from ugr
Nice !!!
I did the puzzle a different way, more globally (I think) rather than the instructor approach of making sure that at max only 3 cases for any 2 X's were ever admitted. Each unique set of counts is a dichotomy, and with a break point of 2, that means only 2 unique sets of counts for the circles and dots are allowed. Therefore: | X's | Counts | Dichotomy Case: | 1 2 3 | O's ●'s | Count --------- + ------- + ----------- + -------- 1 OOO 3 0 1 2 OO● 2 1 2
"Come to 256 and you'll be teaching engineering at caltech" Hunt for the Red October 1986
When the growth function of a hypothesis is polynomial, how can we declare that learning is feasible using that hypothesis?
@poojanpujara8643
4 жыл бұрын
See lecture-6 for that.
@vinayaditya6440
3 жыл бұрын
We obtain a reasonable bound (RHS of Hoeffding's Inequality), with the Polynomial multiplied to a negative exponential. This ensures that the out-of-sample error and the in-sample error track each other reasonably well.
17:05 that's where E_out changed
brilliance
In slide 7: why at most 2 to the power N?!!
@yakyuck
10 жыл бұрын
2^N is whats known as the power set, its the set with all possible combinations of N. For example if N = {{1,},{2},{3}} so in this case N has 3 elements so 2^3 =8. We can see this by looking at all the possible set combinations. Power_Set = {{1},{2},{3},{1,2},{1,3},{2,3},{1,2,3},{empty_set}}. So what it is just a breakdown of all possible combinations of the est given. Hope that helps
this guy is so much more intuitive than andrew ng,not to compare tho,Andrew is a god amongst humans.
1:05:19 "So I'm getting it for free" lol
Wouldn't another solution for puzzle be: 0 0 0 0 0 1 0 1 1 1 1 1
@Hasan-bz1go
7 жыл бұрын
No, because our goal was to find the max dichotomies we could get using given constraint.
@timothykennelljr.3725
7 жыл бұрын
I disagree with @Hasan Hameed and agree with @tompake. The proposed solution 0 0 0 0 0 1 0 1 1 1 1 1 would be an answer, and I believe it is the solution one would get with the positive rays example. I think the key of the puzzle is just to show that a break point limits the total number of dichotomies. When going through the puzzle, the validity of the next pattern of three circles depends only on the previous patterns chosen. Provided that the next pattern does not violate the premise that choosing any 2 circles will be missing one of the possibilities from (0, 0), (0, 1), (1, 0), (1, 1) , then it is a valid pattern. Therefore, based on starting point and future choices from the starting point, there are many possible solutions. The result of having fewer dichotomies (the same for a given break point and number of data points) will still be the same. Another solution is below for viewing pleasure :D 1 1 1 0 0 0 1 0 1 1 0 0
@theamazingjonad9716
6 жыл бұрын
@Timothy is right, your answer can also be a solution to the puzzle.
@-long-
4 жыл бұрын
it is but you cannot get more than 4 rows, that's the point