Machine Learning Tutorial at Imperial College London: Gaussian Processes Richard Turner (University of Cambridge) November 23, 2016
Жүктеу.....
Пікірлер: 69
@zhou7yuan3 жыл бұрын
Motivation: non-linear regression [1:00] Gaussian distribution [3:09] conditioning [5:55] sampling [7:28] New visualization [8:51] New visualization dimension*5 [10:54] dimension*20 [13:06] Regression using Gaussians [15:08] (conditional on 4 un-continuous point) [16:17] Regression: probabilistic inference in function space [19:09] Non-parametric (∞-parametric) vs Parametric model [20:08] (hyper-parameter explain) [23:02] Mathematical Foundations: Definition [24:08] Mathematical Foundations: Regression [30:48] Mathematical Foundations: Marginalisation [34:02] Mathematical Foundations: Prediction [36:29] What effect do the hyper-parameters have? [41:40] short horizontal length-scale [41:58][42:21] long horizontal length-scale [42:30][42:41] [42:58] - l -> horizontal length-scale - \sigma^2 controls the vertical scale of the data Higher dimensional input spaces [44:06] What effect does the form of the covariance function have? [45:20] Laplacian covariance function |x1-x2| [46:16] Rational Quadratic [46:32] Periodic [46:55] The covariance function has a large effect [48:12] Bayesian model comparison (too sensitive to priors) [48:49] Scaling Gaussian Process to Large Datasets [56:04] Motivation: Gaussian Process Regression [56:08] O(N^3) [57:15] idea: summarize dataset by small number (M) pseudo-data [58:38] A Brief History of Gaussian Process Approximations [1:02:01] approximate generative model exact inference (simpler model) [1:02:20] pseudo-data [1:03:11] FITC, PITC, DTC (generate pseudo-data, elsewhere data are independent - broke connections) A Unifying View of Sparse Approximation Gaussian Process Regression (2005) [1:04:12] (problem of this approach) [1:04:31] exact generative model approximate inference [1:05:59] VFE, EP, PP [1:06:27] A Unifying View for Sparse Gaussian Process Approximation using ... (2016) [1:07:10] EP pseudo-point approximation [1:07:45] EP algorithm [1:15:27] Fixed points of EP = FITC approximation [1:23:33] Power EP algorithm (as tractable as EP) [1:25:05] Power EP: a unifying framework [1:25:56] How should I set the power parameter ɑ? [1:27:19] Deep Gaussian Process for Regression [1:34:34] Pros and cons of Gaussian Process Regression [1:34:35] From Gaussian Processes to Deep Gaussian Processes [1:38:26] Deep Gaussian Precesses [1:41:53] Approximate inference for (Deep) Gaussian Processes [1:42:09] Experiment: Value function of the mountain car problem [1:42:31] Experiment: Comparison to Bayesian neural networks [1:44:15]
@dewinmoonl5 жыл бұрын
one of the best GP explanations. People have gotten me lost horribly with "too much math" without properly motivating the problems to begin with. This explanation is to the point, and the math is exactly the same in the end, just presented in a much better way.
@priyamdey3298
3 жыл бұрын
absolutely! The motivation couldn't have been any better, to say the least.
@ncsquirll6 жыл бұрын
really great video. one of the best GP explanations on the web.
@Benedetissimo5 жыл бұрын
The inherent beauty of Gaussian Processes, as well as the clarity of the explanation left me utterly impressed. Thank you so much for uploading!
@Tobaman1113 жыл бұрын
I've come back to this for years. The visualization in the beginning is always a ray of light. Excellent.
@Vikram-wx4hg Жыл бұрын
Super tutorial! Only wish: I wish I could see what Richard is pointing to when he is discussing a slide.
@balalaika6783 жыл бұрын
Best source I could find in youtube, very clear and precise explanations ! After this the equations from a book are much easier to understand !
@heyjianjing Жыл бұрын
By far the best introduction to GP, thank you Prof. Turner!
@IslamEldifrawi2 жыл бұрын
This is the best GP explanation I have seen till now. Great job!!!
@ponyta75 жыл бұрын
Wonderful video, deeply thank you for this. From Seoul.
@michaelwangCH2 жыл бұрын
I listed lots of explanation in lecture halls during my study about gaussian process, your demo is the best one, that I ever saw. Thanks Marc.
@johnkrumm96534 жыл бұрын
Wow, that was a great explanation of GPs! Thank you for making it so clear. You should tour around giving this lecture in huge stadiums. I'd buy the t-shirt! :-)
@airindutta10942 жыл бұрын
Best GP visualization and explanation I have ever seen.
@0929zhurong2 жыл бұрын
The best GP explanation, amazingly done
@Ivan-td7kb5 жыл бұрын
Incredible explanation!
@saikabhagat4 жыл бұрын
absolutely amazing! Thank you!
@tumitran4 жыл бұрын
So nice that they give credits to the earlier paper.
@ethantao92494 жыл бұрын
super clear explanation. Thank you so much!
@julianocamargo66742 жыл бұрын
Brilliant presentation, thanks!
@niveyoga32424 жыл бұрын
Awesome explanation!
@mario75013 жыл бұрын
I wish I had found this video earlier. Took me using the equations myself to code up an example similar to yours to get an intuition of what’s going on
@yode8
3 жыл бұрын
Any advice, or resources or papers. I feel like I generally understood what was happening in the video, but no everything. For example some of covariance functions equations. And also the EP example when he mentioned KL divergence. I am beginning to understand gps for my dissertation but some of the notation nd literature is hard to understand. Thanks
@sathya_official38432 жыл бұрын
Awesome! Totally worth the time
@BassJournal3 жыл бұрын
HOLY SHIET! Thıs was an amazing lesson. Mindblowing
@vmt4gator4 жыл бұрын
great class. Thank you very much
@TheAIEpiphany3 жыл бұрын
It'd be nice to hear about some real-world application of (deep) GPs. We saw its performance on toy datasets compared to similarly-sized NNs. If you throwed in bigger NNs I'd assume they'd improve quite trivially not sure whether that's the case with deep GPs (I might be wrong - I'm no expert on GPs). So far I've seen GPs used only obscurely - somebody uses a GP to figure out a small set of hyperparams. One prominent example is the AlphaGo Zero paper - they have a single sentence in their paper ("Methods" section) where they mention that they've used it to tune MCTS's hyperparams - whether that was even necessary is not at all clear from the paper, so I'm still looking for a use-case where GPs are definitely the right thing to do. I'd love to hear some examples if you know of them! Thanks for the lecture! I found the first part especially useful!
@sakcee Жыл бұрын
Excellent !!! very clear explanation
@GGasparis74 жыл бұрын
amazing video, thank you very much
@GauravJoshi-te6fc Жыл бұрын
Woah! Amazing explanation.
@norkamal76972 жыл бұрын
The best GP explanation evaaa
@redberries80393 жыл бұрын
this is nicely done
@Nunocesarsa4 жыл бұрын
epic class!
@jinyunghong4 жыл бұрын
Great video :)
@cexploreful Жыл бұрын
WOOOOOOOOOOOOOOOW you blow my mind! 🤯
@zakreynolds5472 Жыл бұрын
Thanks this presentation has been really useful but I am a little stuck and have a question. In this first portion of the presentation the CoV function is shown to show correlation between random variables (x axis=variable index) but from there on it seems to revert to being used to compared to values within the same variable (from X in bold on axis to lower case x). I appreciate that this is a difference between multivariate and univariate (I think?) But could you please elaborate?
@yeshuip Жыл бұрын
i understood like variable index coressponds to the variable and we are plotting its values then somehow you talking about variable index can take real values and forgot about the distances. I didn't understand this concept. Can anyone explain me this
@zitafang78888 ай бұрын
Thanks for your explanation. May I ask where I can download the slide?
@bernamdc3 жыл бұрын
At 14:29, why is the 3rd point above the 2nd point? I would expect it to be slightly below, as it is very correlated with point 2 and a bit correlated with point 1
@ardeshirmoinian4 жыл бұрын
Does anyone know of a good description on learning the hyperparameters using k-fold cv?
@Jononor2 жыл бұрын
Does anyone have some insights on how this relates to the Radial Basis Function (RBF) kernel, as used in for example SVM?
@7andromeda3 жыл бұрын
not sure how he goes from the variable index on the x-axis to data points on the x-axis in the visualizations. What is X on 20:20? Is each point on X a data instance, or a single feature value? I guess this X is just one dimension.
@mathewspeter12745 жыл бұрын
Great explanation. Thank you. Is the PPT slide or PDF file that is presented, available for download? Which tool/script is used to generate the contour plots and blue coloured prediction plots? Is it scikit python library?
@ret2666
5 жыл бұрын
Slides for this and similar presentations are here: cbl.eng.cam.ac.uk/Public/Turner/Presentations
@chenxin4741
5 жыл бұрын
Perfect slides for GP
@pr749
4 жыл бұрын
@@ret2666 Hello Richard, first of amazing explanation of the Gaussian Process origins and motivations. I was wondering whether there might have happened some notation mixup at the slide 22:10 (s. 15) Since K(x1,x2) with a scalar x is also a scalar in the final covariance Sigma(x1,x2 = K(x1,x2) + Isigma_y, maybe you originally differentiated between element wise covariances such as k(x1,x2) and the matrix collection of element wise covariance functions with K(x1,x2) so that element K_12 is K_12 = k(x1,x2) = exp... ?
@ret2666
4 жыл бұрын
@@pr749 Thanks for the comment. You're right that I should have written this as: Sigma(x1,x2) = K(x1,x2) + I(x1,x2) sigma^2_y, and explained that I(x1,x2) is a function that is 1 when x1=x2 and zero otherwise. Hope that clarifies things.
@saikabhagat
4 жыл бұрын
@@ret2666 The best explanation on the web by far. Thanks for the link. Somehow it seems unavailable. Is there an alternative location? Truly appreciate your attention.
@parthasarathimukherjee70204 жыл бұрын
How are they assuming that the covariance matrix(similarity between dimensions) is the same as the kernel matrix(similarity between data points)?
@ganeshsk106
4 жыл бұрын
Hi Patha, I have the same confusion. Were you able to understand this? Also from 56:10 minute of the video, he will start saying that they have collections of input (X) and respective ground truth (Y). So the prior assumption is that the data should be generated using the *Squared Exponential Kernel*. So if my understanding is right the data is in 1-D and with "N" data points the Kernel Matrix will be "NxN". Is it right?
@zakreynolds5472
Жыл бұрын
@@ganeshsk106 I am having same confusion. If anyone could explain this it would really help me out!
@lahaale58402 жыл бұрын
Does GP only work super simple data like y=sin(x) + N()? In my experience, even a simple model like linear regression can beat GP in real-world data.
@zacharythatcher73284 жыл бұрын
Can someone explain what is actually being done at 43:30? I understand that you are maximizing the likelihood of getting your outputs, y, given some inputs by varying sigma and l. But what is the output that you are optimizing for? The function at every point other than the known?
@ianmoore957
4 жыл бұрын
Spatially, I like to think of it like a 3D curve (with L, sigma2, and log p(y|theta) as the axis, and theta being your parameter set [L, sigma2]) with a peak (ie, peak -> maximum point of log p(y|theta)); if you take that peak, and project down onto a point on the L,sigma2 plane (ie, [L*,sigma2*]); you have the estimates of your parameters L and sigma2
@MayankGoel447
Жыл бұрын
I guess over all the possible outputs y. Whichever y has the highest probability, you take the corresponding l, sigma^2
@appliedstatistics20437 ай бұрын
Does anyone know where to download the slides?
@maddoo232 жыл бұрын
At 45:30, the covariance of brownian motion cov(B_s, B_t) = min(s,t), right? And not whats given on the slide..
@ret2666
Жыл бұрын
See here for the sense this is Brownian motion: en.wikipedia.org/wiki/Ornstein-Uhlenbeck_process
@kianacademy78537 ай бұрын
rational Qudratic kernel has |x1-x2|^2 term, not |x1-x2|
@apbosh13 жыл бұрын
What practical use have you done with this apart from to teach it? My head exploded about 1 minute in. Clever stuff!
@ryankortvelesy94023 жыл бұрын
51:20 yo dawg I heard you like gaussians so I put an infinite gaussian in your infinite gaussian
@stevepoper8073
3 жыл бұрын
Actually ;D
@yeshuip Жыл бұрын
hello can anyone provide the code please
@o0BluMenTopfErde0o3 жыл бұрын
Now its becoming a shoe draus !
@forheuristiclifeksh783611 ай бұрын
52:33
@pattiknuth48223 жыл бұрын
This video in many cases was INCREDIBLY annoying. Students would ask questions. They were not loud enough to understand. Turner didn't repeat the question so you have no idea what was asked. Sometimes these questions were long so you would have long gaps in the audio. Pro tip: If you're going to allow questions during a lecture, repeat the question so everyone else knows what was asked and the answer then means something.
Пікірлер: 69
Motivation: non-linear regression [1:00] Gaussian distribution [3:09] conditioning [5:55] sampling [7:28] New visualization [8:51] New visualization dimension*5 [10:54] dimension*20 [13:06] Regression using Gaussians [15:08] (conditional on 4 un-continuous point) [16:17] Regression: probabilistic inference in function space [19:09] Non-parametric (∞-parametric) vs Parametric model [20:08] (hyper-parameter explain) [23:02] Mathematical Foundations: Definition [24:08] Mathematical Foundations: Regression [30:48] Mathematical Foundations: Marginalisation [34:02] Mathematical Foundations: Prediction [36:29] What effect do the hyper-parameters have? [41:40] short horizontal length-scale [41:58][42:21] long horizontal length-scale [42:30][42:41] [42:58] - l -> horizontal length-scale - \sigma^2 controls the vertical scale of the data Higher dimensional input spaces [44:06] What effect does the form of the covariance function have? [45:20] Laplacian covariance function |x1-x2| [46:16] Rational Quadratic [46:32] Periodic [46:55] The covariance function has a large effect [48:12] Bayesian model comparison (too sensitive to priors) [48:49] Scaling Gaussian Process to Large Datasets [56:04] Motivation: Gaussian Process Regression [56:08] O(N^3) [57:15] idea: summarize dataset by small number (M) pseudo-data [58:38] A Brief History of Gaussian Process Approximations [1:02:01] approximate generative model exact inference (simpler model) [1:02:20] pseudo-data [1:03:11] FITC, PITC, DTC (generate pseudo-data, elsewhere data are independent - broke connections) A Unifying View of Sparse Approximation Gaussian Process Regression (2005) [1:04:12] (problem of this approach) [1:04:31] exact generative model approximate inference [1:05:59] VFE, EP, PP [1:06:27] A Unifying View for Sparse Gaussian Process Approximation using ... (2016) [1:07:10] EP pseudo-point approximation [1:07:45] EP algorithm [1:15:27] Fixed points of EP = FITC approximation [1:23:33] Power EP algorithm (as tractable as EP) [1:25:05] Power EP: a unifying framework [1:25:56] How should I set the power parameter ɑ? [1:27:19] Deep Gaussian Process for Regression [1:34:34] Pros and cons of Gaussian Process Regression [1:34:35] From Gaussian Processes to Deep Gaussian Processes [1:38:26] Deep Gaussian Precesses [1:41:53] Approximate inference for (Deep) Gaussian Processes [1:42:09] Experiment: Value function of the mountain car problem [1:42:31] Experiment: Comparison to Bayesian neural networks [1:44:15]
one of the best GP explanations. People have gotten me lost horribly with "too much math" without properly motivating the problems to begin with. This explanation is to the point, and the math is exactly the same in the end, just presented in a much better way.
@priyamdey3298
3 жыл бұрын
absolutely! The motivation couldn't have been any better, to say the least.
really great video. one of the best GP explanations on the web.
The inherent beauty of Gaussian Processes, as well as the clarity of the explanation left me utterly impressed. Thank you so much for uploading!
I've come back to this for years. The visualization in the beginning is always a ray of light. Excellent.
Super tutorial! Only wish: I wish I could see what Richard is pointing to when he is discussing a slide.
Best source I could find in youtube, very clear and precise explanations ! After this the equations from a book are much easier to understand !
By far the best introduction to GP, thank you Prof. Turner!
This is the best GP explanation I have seen till now. Great job!!!
Wonderful video, deeply thank you for this. From Seoul.
I listed lots of explanation in lecture halls during my study about gaussian process, your demo is the best one, that I ever saw. Thanks Marc.
Wow, that was a great explanation of GPs! Thank you for making it so clear. You should tour around giving this lecture in huge stadiums. I'd buy the t-shirt! :-)
Best GP visualization and explanation I have ever seen.
The best GP explanation, amazingly done
Incredible explanation!
absolutely amazing! Thank you!
So nice that they give credits to the earlier paper.
super clear explanation. Thank you so much!
Brilliant presentation, thanks!
Awesome explanation!
I wish I had found this video earlier. Took me using the equations myself to code up an example similar to yours to get an intuition of what’s going on
@yode8
3 жыл бұрын
Any advice, or resources or papers. I feel like I generally understood what was happening in the video, but no everything. For example some of covariance functions equations. And also the EP example when he mentioned KL divergence. I am beginning to understand gps for my dissertation but some of the notation nd literature is hard to understand. Thanks
Awesome! Totally worth the time
HOLY SHIET! Thıs was an amazing lesson. Mindblowing
great class. Thank you very much
It'd be nice to hear about some real-world application of (deep) GPs. We saw its performance on toy datasets compared to similarly-sized NNs. If you throwed in bigger NNs I'd assume they'd improve quite trivially not sure whether that's the case with deep GPs (I might be wrong - I'm no expert on GPs). So far I've seen GPs used only obscurely - somebody uses a GP to figure out a small set of hyperparams. One prominent example is the AlphaGo Zero paper - they have a single sentence in their paper ("Methods" section) where they mention that they've used it to tune MCTS's hyperparams - whether that was even necessary is not at all clear from the paper, so I'm still looking for a use-case where GPs are definitely the right thing to do. I'd love to hear some examples if you know of them! Thanks for the lecture! I found the first part especially useful!
Excellent !!! very clear explanation
amazing video, thank you very much
Woah! Amazing explanation.
The best GP explanation evaaa
this is nicely done
epic class!
Great video :)
WOOOOOOOOOOOOOOOW you blow my mind! 🤯
Thanks this presentation has been really useful but I am a little stuck and have a question. In this first portion of the presentation the CoV function is shown to show correlation between random variables (x axis=variable index) but from there on it seems to revert to being used to compared to values within the same variable (from X in bold on axis to lower case x). I appreciate that this is a difference between multivariate and univariate (I think?) But could you please elaborate?
i understood like variable index coressponds to the variable and we are plotting its values then somehow you talking about variable index can take real values and forgot about the distances. I didn't understand this concept. Can anyone explain me this
Thanks for your explanation. May I ask where I can download the slide?
At 14:29, why is the 3rd point above the 2nd point? I would expect it to be slightly below, as it is very correlated with point 2 and a bit correlated with point 1
Does anyone know of a good description on learning the hyperparameters using k-fold cv?
Does anyone have some insights on how this relates to the Radial Basis Function (RBF) kernel, as used in for example SVM?
not sure how he goes from the variable index on the x-axis to data points on the x-axis in the visualizations. What is X on 20:20? Is each point on X a data instance, or a single feature value? I guess this X is just one dimension.
Great explanation. Thank you. Is the PPT slide or PDF file that is presented, available for download? Which tool/script is used to generate the contour plots and blue coloured prediction plots? Is it scikit python library?
@ret2666
5 жыл бұрын
Slides for this and similar presentations are here: cbl.eng.cam.ac.uk/Public/Turner/Presentations
@chenxin4741
5 жыл бұрын
Perfect slides for GP
@pr749
4 жыл бұрын
@@ret2666 Hello Richard, first of amazing explanation of the Gaussian Process origins and motivations. I was wondering whether there might have happened some notation mixup at the slide 22:10 (s. 15) Since K(x1,x2) with a scalar x is also a scalar in the final covariance Sigma(x1,x2 = K(x1,x2) + Isigma_y, maybe you originally differentiated between element wise covariances such as k(x1,x2) and the matrix collection of element wise covariance functions with K(x1,x2) so that element K_12 is K_12 = k(x1,x2) = exp... ?
@ret2666
4 жыл бұрын
@@pr749 Thanks for the comment. You're right that I should have written this as: Sigma(x1,x2) = K(x1,x2) + I(x1,x2) sigma^2_y, and explained that I(x1,x2) is a function that is 1 when x1=x2 and zero otherwise. Hope that clarifies things.
@saikabhagat
4 жыл бұрын
@@ret2666 The best explanation on the web by far. Thanks for the link. Somehow it seems unavailable. Is there an alternative location? Truly appreciate your attention.
How are they assuming that the covariance matrix(similarity between dimensions) is the same as the kernel matrix(similarity between data points)?
@ganeshsk106
4 жыл бұрын
Hi Patha, I have the same confusion. Were you able to understand this? Also from 56:10 minute of the video, he will start saying that they have collections of input (X) and respective ground truth (Y). So the prior assumption is that the data should be generated using the *Squared Exponential Kernel*. So if my understanding is right the data is in 1-D and with "N" data points the Kernel Matrix will be "NxN". Is it right?
@zakreynolds5472
Жыл бұрын
@@ganeshsk106 I am having same confusion. If anyone could explain this it would really help me out!
Does GP only work super simple data like y=sin(x) + N()? In my experience, even a simple model like linear regression can beat GP in real-world data.
Can someone explain what is actually being done at 43:30? I understand that you are maximizing the likelihood of getting your outputs, y, given some inputs by varying sigma and l. But what is the output that you are optimizing for? The function at every point other than the known?
@ianmoore957
4 жыл бұрын
Spatially, I like to think of it like a 3D curve (with L, sigma2, and log p(y|theta) as the axis, and theta being your parameter set [L, sigma2]) with a peak (ie, peak -> maximum point of log p(y|theta)); if you take that peak, and project down onto a point on the L,sigma2 plane (ie, [L*,sigma2*]); you have the estimates of your parameters L and sigma2
@MayankGoel447
Жыл бұрын
I guess over all the possible outputs y. Whichever y has the highest probability, you take the corresponding l, sigma^2
Does anyone know where to download the slides?
At 45:30, the covariance of brownian motion cov(B_s, B_t) = min(s,t), right? And not whats given on the slide..
@ret2666
Жыл бұрын
See here for the sense this is Brownian motion: en.wikipedia.org/wiki/Ornstein-Uhlenbeck_process
rational Qudratic kernel has |x1-x2|^2 term, not |x1-x2|
What practical use have you done with this apart from to teach it? My head exploded about 1 minute in. Clever stuff!
51:20 yo dawg I heard you like gaussians so I put an infinite gaussian in your infinite gaussian
@stevepoper8073
3 жыл бұрын
Actually ;D
hello can anyone provide the code please
Now its becoming a shoe draus !
52:33
This video in many cases was INCREDIBLY annoying. Students would ask questions. They were not loud enough to understand. Turner didn't repeat the question so you have no idea what was asked. Sometimes these questions were long so you would have long gaps in the audio. Pro tip: If you're going to allow questions during a lecture, repeat the question so everyone else knows what was asked and the answer then means something.