You can gain access to a world of education through Stanford Online, the Stanford School of Engineering’s portal for academic and professional education offered by schools and units throughout Stanford University. online.stanford.edu/
Our robust catalog of degree programs, credit-bearing education, professional certificate programs, and free and open content is developed by Stanford faculty, enabling you to expand your knowledge, advance your career, and enhance your life.
Stanford Online is operated and managed by the Stanford Center for Professional Development (SCPD), the global and online education unit within Stanford Engineering. SCPD works closely with Engineering departments, programs, and centers to design and deliver engaging, high quality online, in-person, and blended learning experiences to both matriculated students and a worldwide audience of learners. SCPD collaborates with many Stanford schools & centers to expand university-wide offerings available online.
Пікірлер
12:41 I wonder why CFG is not used in computational linguistics... Is it maybe because giving a word its category unnecessary, as categories are meaning-based that is important to humans but not machines?
Congratulations on the high-level class! Greetings from Brazil.
0:00:55 Deep neural network DNN hᶿ(x)=Wᵣσ(Wᵣ₋₁σ...W₂σ(W₁x)))) with r layers fᵢ layer is composed of Wᵢ matrix multiplications MM and a Lipschitz σ nonlinear activation 0:03:00 Theorem: Rademacher complexity RC of DNNs 0:04:15 Corollary: generalization error of DNNs 0:07:25 Spectral/operator norm of matrix <=> largest singular value <=> Lipschitzness of matrix 0:09:35 Fundamental idea: Cover 𝓕 iteratively (with more and more layers) Use Lipschitzness Control how the error propagates from layer to layer 0:14:40 Proof in two steps ➀ Control the covering number of each layer ➁ Combine single layer covering numbers into multiple layer covering number 0:16:05 Lemma: log covering number sum Target cover radius ε is linear combination of layer cover radii with layer Lipschitzness log N covering number is sum of layer covering number bounds 0:22:10 Proof of lemma 0:36:20 fᵣ∘fᵣ₋₁∘...∘f₂∘f₁ ∈ 𝓕ᵣ∘𝓕ᵣ₋₁∘...∘𝓕₂∘𝓕₁ 0:49:52 Proof of 0:03:00 theorem 1:01:45 Proof done in 3 lines using Holder inequality 1:22:15 Next time: L(θ) ≤ fun(Lip of fᶿ on x₁,...,xₙ, norms of θ), where fun is a polynomial (as opposed to exponential)
I have no doubt that he was an excellent teacher. Now I am studying one of his books (The art of computer programming, 1997). Greetings from Ecuador.
Love this guy's lecture
0:03:00 How to handle infinite 𝓕 function or Q output spaces ε-cover, 𝗰𝗼𝘃𝗲𝗿𝗶𝗻𝗴 𝗻𝘂𝗺𝗯𝗲𝗿 𝗡(ε,𝗤,ρ), empirical probability distribution Pₙ on fixed n data points (uniform), L₂(Pₙ) metric on Pₙ N(ε,𝓕,L₂(Pₙ)) = N(ε,Q,1/√n‖.‖₂) 0:11:40 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗧𝗿𝗶𝘃𝗶𝗮𝗹 𝗱𝗶𝘀𝗰𝗿𝗲𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝗤 𝗼𝘂𝘁𝗽𝘂𝘁 𝘀𝗽𝗮𝗰𝗲 𝓕={ f: Z → {-1,+1} } Rₛ(𝓕) ≤ ε + √{2log(N(ε,𝓕,L₂(Pₙ)))/n} = (discretization) + (RC of finite ε-cover) 0:13:25 Proof 0:19:00 𝗗𝘂𝗱𝗹𝗲𝘆'𝘀 𝗰𝗵𝗮𝗶𝗻𝗶𝗻𝗴 𝘁𝗵𝗲𝗼𝗿𝗲𝗺, 𝗵𝗶𝗲𝗿𝗮𝗿𝗰𝗵𝗶𝗰𝗮𝗹 𝗱𝗶𝘀𝗰𝗿𝗲𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝗤 𝘀𝗽𝗮𝗰𝗲 Stronger discretization theorem 𝓕={ f: Z → ℝ } Rₛ(𝓕) ≤ 12 ∫_0^∞ √{log(N(ε,𝓕,L₂(Pₙ)))/n} dε 0:21:55 Intuition of the proof 0:30:00 Proof of Dudley 0:55:50 Interpretation of Dudley 𝗧𝗵𝗿𝗲𝗲 𝗰𝗮𝘀𝗲𝘀 𝗼𝗳 ε-𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲 𝗼𝗳 𝗗𝘂𝗱𝗹𝗲𝘆'𝘀 (a) if N(ε,𝓕,L₂(Pₙ)) is of the form (1/ε)^R then ∫_0^1 √{log(N(ε,𝓕,L₂(Pₙ)))/n}dε = 1/√n∫_0^1 R logε dε ≍ √{R/n} (b) if form a^(R/ε) then ∫_0^1 . dε = Õ(√{R/n}) (c) if form a^(R/ε²) - most frequent case!! then ∫_0^1 . dε = ∞ Fixed by 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗱 𝗗𝘂𝗱𝗹𝗲𝘆'𝘀 Rₛ(𝓕) ≤ 4α +12 ∫_α^∞ √{2log(N(ε,𝓕,L₂(Pₙ)))/n} dε = (discretization) + (RC of finite ε-cover) with α lowest bound in the ε discretization If α~1/poly(n), then Rₛ(𝓕) ≤ O(√{R/n}) 1:09:40 𝗧𝗵𝗲𝗼𝗿𝗲𝗺 𝗳𝗼𝗿 𝗰𝗼𝘃𝗲𝗿𝗶𝗻𝗴 𝗻𝘂𝗺𝗯𝗲𝗿 𝗯𝗼𝘂𝗻𝗱𝘀 𝗳𝗼𝗿 𝗹𝗶𝗻𝗲𝗮𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 Theorem 3 of Zhang 2002 log N(ε,𝓕_q,ρ) = [B²C²/ε²]log₂(2d+1), see case (c) of Dudley's 1:14:15 Use the 𝗰𝗼𝗻𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝗥=𝗕𝗖, B=norm of clasifier, C=norm of data RC≤Õ(√{R/n})=Õ(BC/√n) 1:15:35 𝗖𝗼𝘃𝗲𝗿𝗶𝗻𝗴 𝗻𝘂𝗺𝗯𝗲𝗿 𝗳𝗼𝗿 𝗺𝘂𝗹𝘁𝗶𝘃𝗮𝗿𝗶𝗮𝘁𝗲 𝗹𝗶𝗻𝗲𝗮𝗿 𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 (𝗺𝗮𝘁𝗿𝗶𝗰𝗲𝘀) (used as building block) 𝓕={x↦Wx: W∈ℝᵐˣᵈ, ‖W‖₂₁<B} where ‖W‖₂₁ is 𝚺ᵐᵢ₌₁‖wᵢ‖₂ᵀ log N(ε,𝓕,L₂(Pₙ)) ≤ [B²C²/ε²] log(2dm), see case (c) of Dudley's 1:19:30 Sum of complexity measures of m linear models ‖W‖₂₁ is 𝚺ᵐᵢ₌₁‖wᵢ‖₂ᵀ 1:20:30 𝗟𝗶𝗽𝘀𝗰𝗵𝗶𝘁𝘇 𝗰𝗼𝗺𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻 (see Talagrand lemma earlier Lect) ε-cover -> ε/κ-cover for κ-Lipschitz φ for φ∘𝓕 log N(ε,φ∘𝓕,ρ) ≤ log N(ε/κ,𝓕,ρ)
THANK YOU! this is what I've been looking for.
Great duality lecture!
This guy is not rigorous.
what a cliffhanger!
thank you!
This talk and the slides are gold! Love the whirlwind tour of the BC (Before ChatGPT) and AD (After Deployment of ChatGPT). How fast time went by ... I barely remember Vicuna, and it was just a year ago :D
I actually used the statistical methods to extract the correct spelling of the function for the linear discriminant analysis. Spoiler: it is "lda" Great book btw
Thanks, great explantion!
@27:40 Chris forget being a professor and jumps straight to being Chainsaw man 😂😂
Section three exists twice. RIP
Awesome
26:30 Parameters time complexity without reuse 34:20? Invert x 35:15 Parameters time complexity with reuse of weights w 43:40? Bayesian network probability table trained on infinite data would, in principle, be able to capture any relationship 44:00? Difference between lhs and rhs 49:00 Parameterise continuous rv with K Gaussians 53:40 Autoencoders as non linear pca (unsupervised) 56:30 Enforce ordering so that autoencoders can generate samples 1:01:30? Mask weights to get parameters in one pass
🤨
I am grateful to Stanford University and wish these open courses to be up-to-date and continuous.
For cyber security?
BRILIANT TECHER
Thank you, Chris <3 What an amazing course, made amazing by the amazing professor
Where can I get the psets?
He has his slides in his head! Loved the content.
0:01:10 𝗖𝗹𝗮𝘀𝘀𝗶𝗰𝗮𝗹 𝗺𝗮𝗰𝗵𝗶𝗻𝗲 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 ML 𝘁𝗵𝗲𝗼𝗿𝘆 ➀ 𝗔𝗽𝗽𝗿𝗼𝘅𝗶𝗺𝗮𝘁𝗶𝗼𝗻 𝘁𝗵𝗲𝗼𝗿𝘆/expressivity/representation power Best model in class, bound L(θ̂)=min_{θ∈Θ} L(θ); But there may be a better class of functions... Approximation theory=Why is the hypothesis class is expressive enough to give the hypothesis function of concern ➁ 𝗚𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝘁𝗵𝗲𝗼𝗿𝘆/statistical aspect Excess risk is the central notion (last last few week of lectures), bound L(θ̂)-L(θ⃰) [excess risk] ≤ L(θ̂)-L̂(θ̂) [generalization loss] +|L(θ⃰)-L̂(θ⃰)| [uninteresting term, always ~1/√n] L(θ̂)-L̂(θ̂)≤√{complexity/n} Occam's razor, simplest, most parsimonius explanation is the best explanation; bias-variance trade-off L̂_reg(θ)=L̂(θ)+λR(θ) w/ the R regularizer then the statistical claim: if θ̂𝜆 is the global min of L̂_reg, then L(θ̂)-L(θ⃰) or L(θ̂)-L̂(θ̂) are bounded ➂ 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻: How to numerically find θ̂=argmin_{θ∈Θ} L̂(θ) or L̂_reg(θ) convex optimization/gradient descent GD/stochastic gradient descent SGD 0:14:55 In 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 DL 𝘁𝗵𝗲𝗼𝗿𝘆 two fundamental things change ➀ Non-linear model -> non-convex loss ➁ Overparametrized models: more parameters generally help -> even p>>n is good, even after you have zero training error -> 𝗺𝗶𝘀𝘁𝗲𝗿𝘆+ 0:19:45 ➀➁➂ become intertwined in DL ➀ 𝗔𝗽𝗽𝗿𝗼𝘅𝗶𝗺𝗮𝘁𝗶𝗼𝗻 𝘁𝗵𝗲𝗼𝗿𝘆 doesn't change much in DL Large models are expressive Universal approximation theory - if models need to be exponentially large, they are not really implementable min_{θ∈Θ} L̂(θ) small or zero training loss, perfectly memorize training data if >n neurons, it may not generalize though ➁ 𝗚𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Only weak regularization is used ➂ 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Optimization is not only about finding any global minimizer (they might have different test performance) Practical optimizer should find a θ̂ that ➀ is global minimum of L̂(θ) ➁ has a special property: e.g. is of low complexity -> can generalize well SGD with certain properties, e.g. batch sizes 0:37:50 Tasks are: Task 1. (optimization) Prove that optimizer converges to approx global/local min of L̂(θ) Task 2. (regularization) The minimal θ=θ̂ in Task 1. is of low enough complexity to generalize well Task 3. (generalization bound) ∀θ that has small enough complexity and L̂(θ)≈0, the test error L(θ) is also small 0:56:45 Generalization bounds for neural networks Setup for two-layer networks: θ=(w,U), U∈ℝᵐˣᵈ fᶿ(x)=wᵀφ(Ux) ∈ ℝ, φ is elementwise RELU Goals: ➀ Show Rademacher complexity bounds ➁ How useful are these bounds in practice 0:59:50 𝗥𝗮𝗱𝗲𝗺𝗮𝗰𝗵𝗲𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆 𝗯𝗼𝘂𝗻𝗱 𝘁𝗵𝗲𝗼𝗿𝗲𝗺 𝗳𝗼𝗿 𝟮-𝗹𝗮𝘆𝗲𝗿 𝗡𝗡 #𝟭 1:12:50 Proof: ~remove sup from def of RC 1:21:23 Lipschitz composition Talagrand lemma on the RELU φ
IIT JEE 1984 top ten ranker. Gold medalist from IIT Kanpur, batch of 1988.
28:00 A unique view at attention. In this image all 6 nodes are related with all 6 nodes in self-attention case. And in cross attention it would be like set A sends a message to nodes in set B. And voila, it's a fully-connected layer! But with tokens passed instead of values
Andrew ng also took same kind of example to explain LM.
I think at 37:10 professor did not make it quite clear for probability = 0. The student confused probability with possibility. It is totally ok for thing A that is p(A) = 0 to happen to some extent. Am I right?
Please put the subject of the talk in the title. You can then market the OpenAI speakers
Hi. Can anyone recommend any textbook that can help in further study of this course. Thank you
How do we know what is small vs large? For example, with emergent tasks, it highlights that more data could lead to more accuracy with enough compute. The small LM would have not seen accuracy improvements but the large LM did. For the tasks currently indicated as flat, couldn't we just not have enough compute now to know if these tasks would get more accurate?
I could be wrong... But as I understand what mister Lamport is saying... This is just digital design... Combinational... Sequential circuits... I could also be wrong but... Clocks are more of combinational circuits... On the other hand, sequential circuits have clock circuits in them... 🤷
The students were asking some great questions, no wonder I don't go to Stanford
im the dude at the end (dont go to Stanford xd)
Surprised by the amount of hair an AI scholar may have retained.
100x😊
Thank youuuu
7217 1:07
Great introduction on deep generative models!
In the poker question, probability of A' is 42 options, right ? Since one of the 7 cards already on the table is A of clubs ?
we have 52 cards in total
Thanks for sharing this
This video is crazy 🔥🔥 interesting to know how crazy defi has grown and it’s great to be able to see this seminar through KZread it was an interesting chat to listen to defi is growing globally it’s not just in the us now it’s in other countries take in mind I am commenting from outside of the U.S crazy stuff excellent content 👍🏻
Shouldn’t there be a different KZread channel for AI from Stanford.
Strange world. This dude is almost a kid and gives a lecture
I am happy to learn from any kid :)
His intuition is older than me
Love the section on "kale divergence"! Thanks KZread auto-captioning! 😂
Great lecture. But some time a little faster pace than Christopher Manning.
I am addicted to Prof. Jure's accent now😂!!!
Very clear and interesting lecture