Stanford Online

Stanford Online

You can gain access to a world of education through Stanford Online, the Stanford School of Engineering’s portal for academic and professional education offered by schools and units throughout Stanford University. online.stanford.edu/

Our robust catalog of degree programs, credit-bearing education, professional certificate programs, and free and open content is developed by Stanford faculty, enabling you to expand your knowledge, advance your career, and enhance your life.

Stanford Online is operated and managed by the Stanford Center for Professional Development (SCPD), the global and online education unit within Stanford Engineering. SCPD works closely with Engineering departments, programs, and centers to design and deliver engaging, high quality online, in-person, and blended learning experiences to both matriculated students and a worldwide audience of learners. SCPD collaborates with many Stanford schools & centers to expand university-wide offerings available online.

Пікірлер

  • @yeachanchoi449
    @yeachanchoi44917 минут бұрын

    12:41 I wonder why CFG is not used in computational linguistics... Is it maybe because giving a word its category unnecessary, as categories are meaning-based that is important to humans but not machines?

  • @eng.edersilva
    @eng.edersilva5 сағат бұрын

    Congratulations on the high-level class! Greetings from Brazil.

  • @ferencszalma7094
    @ferencszalma709410 сағат бұрын

    0:00:55 Deep neural network DNN hᶿ(x)=Wᵣσ(Wᵣ₋₁σ...W₂σ(W₁x)))) with r layers fᵢ layer is composed of Wᵢ matrix multiplications MM and a Lipschitz σ nonlinear activation 0:03:00 Theorem: Rademacher complexity RC of DNNs 0:04:15 Corollary: generalization error of DNNs 0:07:25 Spectral/operator norm of matrix <=> largest singular value <=> Lipschitzness of matrix 0:09:35 Fundamental idea: Cover 𝓕 iteratively (with more and more layers) Use Lipschitzness Control how the error propagates from layer to layer 0:14:40 Proof in two steps ➀ Control the covering number of each layer ➁ Combine single layer covering numbers into multiple layer covering number 0:16:05 Lemma: log covering number sum Target cover radius ε is linear combination of layer cover radii with layer Lipschitzness log N covering number is sum of layer covering number bounds 0:22:10 Proof of lemma 0:36:20 fᵣ∘fᵣ₋₁∘...∘f₂∘f₁ ∈ 𝓕ᵣ∘𝓕ᵣ₋₁∘...∘𝓕₂∘𝓕₁ 0:49:52 Proof of 0:03:00 theorem 1:01:45 Proof done in 3 lines using Holder inequality 1:22:15 Next time: L(θ) ≤ fun(Lip of fᶿ on x₁,...,xₙ, norms of θ), where fun is a polynomial (as opposed to exponential)

  • @duduromeroa
    @duduromeroa11 сағат бұрын

    I have no doubt that he was an excellent teacher. Now I am studying one of his books (The art of computer programming, 1997). Greetings from Ecuador.

  • @chongsun7872
    @chongsun787211 сағат бұрын

    Love this guy's lecture

  • @ferencszalma7094
    @ferencszalma709415 сағат бұрын

    0:03:00 How to handle infinite 𝓕 function or Q output spaces ε-cover, 𝗰𝗼𝘃𝗲𝗿𝗶𝗻𝗴 𝗻𝘂𝗺𝗯𝗲𝗿 𝗡(ε,𝗤,ρ), empirical probability distribution Pₙ on fixed n data points (uniform), L₂(Pₙ) metric on Pₙ N(ε,𝓕,L₂(Pₙ)) = N(ε,Q,1/√n‖.‖₂) 0:11:40 𝗧𝗵𝗲𝗼𝗿𝗲𝗺: 𝗧𝗿𝗶𝘃𝗶𝗮𝗹 𝗱𝗶𝘀𝗰𝗿𝗲𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝗤 𝗼𝘂𝘁𝗽𝘂𝘁 𝘀𝗽𝗮𝗰𝗲 𝓕={ f: Z → {-1,+1} } Rₛ(𝓕) ≤ ε + √{2log(N(ε,𝓕,L₂(Pₙ)))/n} = (discretization) + (RC of finite ε-cover) 0:13:25 Proof 0:19:00 𝗗𝘂𝗱𝗹𝗲𝘆'𝘀 𝗰𝗵𝗮𝗶𝗻𝗶𝗻𝗴 𝘁𝗵𝗲𝗼𝗿𝗲𝗺, 𝗵𝗶𝗲𝗿𝗮𝗿𝗰𝗵𝗶𝗰𝗮𝗹 𝗱𝗶𝘀𝗰𝗿𝗲𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝗤 𝘀𝗽𝗮𝗰𝗲 Stronger discretization theorem 𝓕={ f: Z → ℝ } Rₛ(𝓕) ≤ 12 ∫_0^∞ √{log(N(ε,𝓕,L₂(Pₙ)))/n} dε 0:21:55 Intuition of the proof 0:30:00 Proof of Dudley 0:55:50 Interpretation of Dudley 𝗧𝗵𝗿𝗲𝗲 𝗰𝗮𝘀𝗲𝘀 𝗼𝗳 ε-𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝗰𝗲 𝗼𝗳 𝗗𝘂𝗱𝗹𝗲𝘆'𝘀 (a) if N(ε,𝓕,L₂(Pₙ)) is of the form (1/ε)^R then ∫_0^1 √{log(N(ε,𝓕,L₂(Pₙ)))/n}dε = 1/√n∫_0^1 R logε dε ≍ √{R/n} (b) if form a^(R/ε) then ∫_0^1 . dε = Õ(√{R/n}) (c) if form a^(R/ε²) - most frequent case!! then ∫_0^1 . dε = ∞ Fixed by 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗱 𝗗𝘂𝗱𝗹𝗲𝘆'𝘀 Rₛ(𝓕) ≤ 4α +12 ∫_α^∞ √{2log(N(ε,𝓕,L₂(Pₙ)))/n} dε = (discretization) + (RC of finite ε-cover) with α lowest bound in the ε discretization If α~1/poly(n), then Rₛ(𝓕) ≤ O(√{R/n}) 1:09:40 𝗧𝗵𝗲𝗼𝗿𝗲𝗺 𝗳𝗼𝗿 𝗰𝗼𝘃𝗲𝗿𝗶𝗻𝗴 𝗻𝘂𝗺𝗯𝗲𝗿 𝗯𝗼𝘂𝗻𝗱𝘀 𝗳𝗼𝗿 𝗹𝗶𝗻𝗲𝗮𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 Theorem 3 of Zhang 2002 log N(ε,𝓕_q,ρ) = [B²C²/ε²]log₂(2d+1), see case (c) of Dudley's 1:14:15 Use the 𝗰𝗼𝗻𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝗥=𝗕𝗖, B=norm of clasifier, C=norm of data RC≤Õ(√{R/n})=Õ(BC/√n) 1:15:35 𝗖𝗼𝘃𝗲𝗿𝗶𝗻𝗴 𝗻𝘂𝗺𝗯𝗲𝗿 𝗳𝗼𝗿 𝗺𝘂𝗹𝘁𝗶𝘃𝗮𝗿𝗶𝗮𝘁𝗲 𝗹𝗶𝗻𝗲𝗮𝗿 𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 (𝗺𝗮𝘁𝗿𝗶𝗰𝗲𝘀) (used as building block) 𝓕={x↦Wx: W∈ℝᵐˣᵈ, ‖W‖₂₁<B} where ‖W‖₂₁ is 𝚺ᵐᵢ₌₁‖wᵢ‖₂ᵀ log N(ε,𝓕,L₂(Pₙ)) ≤ [B²C²/ε²] log(2dm), see case (c) of Dudley's 1:19:30 Sum of complexity measures of m linear models ‖W‖₂₁ is 𝚺ᵐᵢ₌₁‖wᵢ‖₂ᵀ 1:20:30 𝗟𝗶𝗽𝘀𝗰𝗵𝗶𝘁𝘇 𝗰𝗼𝗺𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻 (see Talagrand lemma earlier Lect) ε-cover -> ε/κ-cover for κ-Lipschitz φ for φ∘𝓕 log N(ε,φ∘𝓕,ρ) ≤ log N(ε/κ,𝓕,ρ)

  • @friendlyskiespodcast
    @friendlyskiespodcast16 сағат бұрын

    THANK YOU! this is what I've been looking for.

  • @zhiyuanli6072
    @zhiyuanli607221 сағат бұрын

    Great duality lecture!

  • @chenhari3066
    @chenhari306622 сағат бұрын

    This guy is not rigorous.

  • @DemianUsul
    @DemianUsul23 сағат бұрын

    what a cliffhanger!

  • @quishzhu
    @quishzhuКүн бұрын

    thank you!

  • @SebastianRaschka
    @SebastianRaschkaКүн бұрын

    This talk and the slides are gold! Love the whirlwind tour of the BC (Before ChatGPT) and AD (After Deployment of ChatGPT). How fast time went by ... I barely remember Vicuna, and it was just a year ago :D

  • @user-kh7sx2vz2j
    @user-kh7sx2vz2jКүн бұрын

    I actually used the statistical methods to extract the correct spelling of the function for the linear discriminant analysis. Spoiler: it is "lda" Great book btw

  • @miguelmassiris9091
    @miguelmassiris9091Күн бұрын

    Thanks, great explantion!

  • @rudraprasaddash3809
    @rudraprasaddash3809Күн бұрын

    @27:40 Chris forget being a professor and jumps straight to being Chainsaw man 😂😂

  • @zfarahx
    @zfarahxКүн бұрын

    Section three exists twice. RIP

  • @joeybasile1572
    @joeybasile1572Күн бұрын

    Awesome

  • @CPTSMONSTER
    @CPTSMONSTER2 күн бұрын

    26:30 Parameters time complexity without reuse 34:20? Invert x 35:15 Parameters time complexity with reuse of weights w 43:40? Bayesian network probability table trained on infinite data would, in principle, be able to capture any relationship 44:00? Difference between lhs and rhs 49:00 Parameterise continuous rv with K Gaussians 53:40 Autoencoders as non linear pca (unsupervised) 56:30 Enforce ordering so that autoencoders can generate samples 1:01:30? Mask weights to get parameters in one pass

  • @grumio3863
    @grumio38632 күн бұрын

    🤨

  • @VICTORHUGO-ll8je
    @VICTORHUGO-ll8je2 күн бұрын

    I am grateful to Stanford University and wish these open courses to be up-to-date and continuous.

  • @lenscraft921
    @lenscraft9212 күн бұрын

    For cyber security?

  • @Nardosmelaku-ph6sb
    @Nardosmelaku-ph6sb2 күн бұрын

    BRILIANT TECHER

  • @ivaninkorea
    @ivaninkorea2 күн бұрын

    Thank you, Chris <3 What an amazing course, made amazing by the amazing professor

  • @asma_shakeel
    @asma_shakeel3 күн бұрын

    Where can I get the psets?

  • @ariG23498
    @ariG234983 күн бұрын

    He has his slides in his head! Loved the content.

  • @ferencszalma7094
    @ferencszalma70943 күн бұрын

    0:01:10 𝗖𝗹𝗮𝘀𝘀𝗶𝗰𝗮𝗹 𝗺𝗮𝗰𝗵𝗶𝗻𝗲 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 ML 𝘁𝗵𝗲𝗼𝗿𝘆 ➀ 𝗔𝗽𝗽𝗿𝗼𝘅𝗶𝗺𝗮𝘁𝗶𝗼𝗻 𝘁𝗵𝗲𝗼𝗿𝘆/expressivity/representation power Best model in class, bound L(θ̂)=min_{θ∈Θ} L(θ); But there may be a better class of functions... Approximation theory=Why is the hypothesis class is expressive enough to give the hypothesis function of concern ➁ 𝗚𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝘁𝗵𝗲𝗼𝗿𝘆/statistical aspect Excess risk is the central notion (last last few week of lectures), bound L(θ̂)-L(θ​⃰) [excess risk] ≤ L(θ̂)-L̂(θ̂) [generalization loss] +|L(θ​⃰)-L̂(θ​⃰)| [uninteresting term, always ~1/√n] L(θ̂)-L̂(θ̂)≤√{complexity/n} Occam's razor, simplest, most parsimonius explanation is the best explanation; bias-variance trade-off L̂_reg(θ)=L̂(θ)+λR(θ) w/ the R regularizer then the statistical claim: if θ̂𝜆 is the global min of L̂_reg, then L(θ̂)-L(θ​⃰) or L(θ̂)-L̂(θ̂) are bounded ➂ 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻: How to numerically find θ̂=argmin_{θ∈Θ} L̂(θ) or L̂_reg(θ) convex optimization/gradient descent GD/stochastic gradient descent SGD 0:14:55 In 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 DL 𝘁𝗵𝗲𝗼𝗿𝘆 two fundamental things change ➀ Non-linear model -> non-convex loss ➁ Overparametrized models: more parameters generally help -> even p>>n is good, even after you have zero training error -> 𝗺𝗶𝘀𝘁𝗲𝗿𝘆+ 0:19:45 ➀➁➂ become intertwined in DL ➀ 𝗔𝗽𝗽𝗿𝗼𝘅𝗶𝗺𝗮𝘁𝗶𝗼𝗻 𝘁𝗵𝗲𝗼𝗿𝘆 doesn't change much in DL Large models are expressive Universal approximation theory - if models need to be exponentially large, they are not really implementable min_{θ∈Θ} L̂(θ) small or zero training loss, perfectly memorize training data if >n neurons, it may not generalize though ➁ 𝗚𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Only weak regularization is used ➂ 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Optimization is not only about finding any global minimizer (they might have different test performance) Practical optimizer should find a θ̂ that ➀ is global minimum of L̂(θ) ➁ has a special property: e.g. is of low complexity -> can generalize well SGD with certain properties, e.g. batch sizes 0:37:50 Tasks are: Task 1. (optimization) Prove that optimizer converges to approx global/local min of L̂(θ) Task 2. (regularization) The minimal θ=θ̂ in Task 1. is of low enough complexity to generalize well Task 3. (generalization bound) ∀θ that has small enough complexity and L̂(θ)≈0, the test error L(θ) is also small 0:56:45 Generalization bounds for neural networks Setup for two-layer networks: θ=(w,U), U∈ℝᵐˣᵈ fᶿ(x)=wᵀφ(Ux) ∈ ℝ, φ is elementwise RELU Goals: ➀ Show Rademacher complexity bounds ➁ How useful are these bounds in practice 0:59:50 𝗥𝗮𝗱𝗲𝗺𝗮𝗰𝗵𝗲𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆 𝗯𝗼𝘂𝗻𝗱 𝘁𝗵𝗲𝗼𝗿𝗲𝗺 𝗳𝗼𝗿 𝟮-𝗹𝗮𝘆𝗲𝗿 𝗡𝗡 #𝟭 1:12:50 Proof: ~remove sup from def of RC 1:21:23 Lipschitz composition Talagrand lemma on the RELU φ

  • @Roshan-tb3iz
    @Roshan-tb3iz3 күн бұрын

    IIT JEE 1984 top ten ranker. Gold medalist from IIT Kanpur, batch of 1988.

  • @nerouchih3529
    @nerouchih35293 күн бұрын

    28:00 A unique view at attention. In this image all 6 nodes are related with all 6 nodes in self-attention case. And in cross attention it would be like set A sends a message to nodes in set B. And voila, it's a fully-connected layer! But with tokens passed instead of values

  • @Lalala_1701
    @Lalala_17013 күн бұрын

    Andrew ng also took same kind of example to explain LM.

  • @Lee-zo3dy
    @Lee-zo3dy3 күн бұрын

    I think at 37:10 professor did not make it quite clear for probability = 0. The student confused probability with possibility. It is totally ok for thing A that is p(A) = 0 to happen to some extent. Am I right?

  • @DanBillings
    @DanBillings3 күн бұрын

    Please put the subject of the talk in the title. You can then market the OpenAI speakers

  • @ChidinmaOnyeri
    @ChidinmaOnyeri3 күн бұрын

    Hi. Can anyone recommend any textbook that can help in further study of this course. Thank you

  • @heyitsjoshd
    @heyitsjoshd4 күн бұрын

    How do we know what is small vs large? For example, with emergent tasks, it highlights that more data could lead to more accuracy with enough compute. The small LM would have not seen accuracy improvements but the large LM did. For the tasks currently indicated as flat, couldn't we just not have enough compute now to know if these tasks would get more accurate?

  • @rucellegarciano4105
    @rucellegarciano41054 күн бұрын

    I could be wrong... But as I understand what mister Lamport is saying... This is just digital design... Combinational... Sequential circuits... I could also be wrong but... Clocks are more of combinational circuits... On the other hand, sequential circuits have clock circuits in them... 🤷

  • @zacharykosove9048
    @zacharykosove90484 күн бұрын

    The students were asking some great questions, no wonder I don't go to Stanford

  • @roro5179
    @roro51792 сағат бұрын

    im the dude at the end (dont go to Stanford xd)

  • @dodowoh3683
    @dodowoh36834 күн бұрын

    Surprised by the amount of hair an AI scholar may have retained.

  • @AbdeeAwol
    @AbdeeAwol4 күн бұрын

    100x😊

  • @hajerjm
    @hajerjm4 күн бұрын

    Thank youuuu

  • @brashcrab
    @brashcrab4 күн бұрын

    7217 1:07

  • @yuxingben399
    @yuxingben3994 күн бұрын

    Great introduction on deep generative models!

  • @arpitkumar592
    @arpitkumar5924 күн бұрын

    In the poker question, probability of A' is 42 options, right ? Since one of the 7 cards already on the table is A of clubs ?

  • @Lee-zo3dy
    @Lee-zo3dy3 күн бұрын

    we have 52 cards in total

  • @ahmad1239112
    @ahmad12391124 күн бұрын

    Thanks for sharing this

  • @annalvarez3247
    @annalvarez32475 күн бұрын

    This video is crazy 🔥🔥 interesting to know how crazy defi has grown and it’s great to be able to see this seminar through KZread it was an interesting chat to listen to defi is growing globally it’s not just in the us now it’s in other countries take in mind I am commenting from outside of the U.S crazy stuff excellent content 👍🏻

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w5 күн бұрын

    Shouldn’t there be a different KZread channel for AI from Stanford.

  • @hedu5303
    @hedu53035 күн бұрын

    Strange world. This dude is almost a kid and gives a lecture

  • 3 күн бұрын

    I am happy to learn from any kid :)

  • @chaidaro
    @chaidaroКүн бұрын

    His intuition is older than me

  • @joebobthe13th
    @joebobthe13th5 күн бұрын

    Love the section on "kale divergence"! Thanks KZread auto-captioning! 😂

  • @chongsun7872
    @chongsun78725 күн бұрын

    Great lecture. But some time a little faster pace than Christopher Manning.

  • @erichsiung9704
    @erichsiung97045 күн бұрын

    I am addicted to Prof. Jure's accent now😂!!!

  • @nivcohen5371
    @nivcohen53715 күн бұрын

    Very clear and interesting lecture