The Principle of Maximum Entropy

The machine learning consultancy: truetheta.io
Want to work together? See here: truetheta.io/about/#want-to-w...
What's the safest distribution to pick in the absence of information? What about in the case where you have some, though only partial, information? The Principle of Maximum Entropy answers these questions well and as a result, is a frequent guiding rule for selecting distributions in the wild.
SOCIAL MEDIA
LinkedIn : / dj-rich-90b91753
Twitter : / duanejrich
Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
Sources
Chapters 11-12 of [2] were primary sources - this is where I ironed out most of my intuition on this subject. Chapter 12 of [1] was helpful for understanding the relationship between the maximum entropy criteria and the form of the distribution that meets it. [3] was useful for a high level perspective and [4] was helpful for determining the list of maximum entropy distribution.
Also, thank you to Dr. Hanspeter Schmid of the University of Applied Sciences and Arts, Northwestern Switzerland. He helped me interpret some of the more technical details of [2] and prevented me from attaching an incorrect intuition to the continuous case - much appreciated!
[1] T. M. Cover and J. A. Thomas. Elements of Information Theory. 2nd edition. John Wiley, 2006.
[2] E. T. Jaynes. Probability theory: the logic of science. Cambridge university press, 2003.
[3] Principle of Maximum Entropy, Wikipedia, en.wikipedia.org/wiki/Princip...
[4] Maximum Entropy Distribution, Wikipedia, en.wikipedia.org/wiki/Maximum...
Timestamps :
0:00 Intro
00:41 Guessing a Distribution and Maximum Entropy
04:16 Adding Information
06:40 An Example
08:00 The Continuous Case
10:26 The Shaky Continuous Foundation

Пікірлер: 109

  • @pedrobianchi1929
    @pedrobianchi19292 жыл бұрын

    These principles explained here appear everywhere: thermodynamics, machine learning, information theory. Very fundamental

  • @maniam5460
    @maniam54602 жыл бұрын

    You know that feeling when you find a criminally overlooked channel and you’re about to get in on the ground level of something that’s gonna blow up? This is you now

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    That is quite nice of you - thank you! I hope you're right, but for now, I'm working on my patience. It can take quite a while to get noticed on KZread. I'm trying to keep my expectations realistic.

  • @ilyboc
    @ilyboc2 жыл бұрын

    It blew my mind that those famous distributions come naturally as the ones that give maximum entropy when we set the domain and constraints in a general way. Now I kind of know why they are special.

  • @MP-if2kf

    @MP-if2kf

    2 жыл бұрын

    Definitely very cool. In many cases there are also other fascinating characterizations. For example: assume continuous distribution on positive support with the no-memory-property-->solve differential equation-->find that it MUST be the exponential

  • @kaishang6406
    @kaishang640610 ай бұрын

    Just in the recent 3b1b video about normal distribution, there is a mention about normal distribution maximizes entropy. Then immediately I saw it here on your video displaying the normal distribution as the one that maximizes the entropy while constraining average and variance which are the only two parameters of the normal distribution. That is very nice.

  • @arongil
    @arongil10 ай бұрын

    I can't get over how much fun you make learning about stats, ML, and information theory---not to mention that you teach it with skill like Feynman's and a style that is all your own.

  • @Mutual_Information

    @Mutual_Information

    10 ай бұрын

    That's quite a compliment - Feynman is a total inspiration for many, myself included. His energy about the topics makes you *want* to learn about them.

  • @equanimity26
    @equanimity26 Жыл бұрын

    An amazing video. Proving once again why the internet is a blessing to humanity.

  • @NowInAus
    @NowInAus Жыл бұрын

    Really stimulating. Your last example looked to be heading towards information in the variables. Got me hooked

  • @HarrysKavan
    @HarrysKavan Жыл бұрын

    I didn't expect much and was disappointed. What a great Video. I wish you lots more followers!

  • @Mutual_Information

    @Mutual_Information

    Жыл бұрын

    Thank you so much! More to come :)

  • @alec-lewiswang5213
    @alec-lewiswang5213 Жыл бұрын

    I found this video very helpful! Thanks for making it! The animated visuals especially are great :)

  • @dobb2106
    @dobb21062 жыл бұрын

    I’m glad I clicked on your comment, this channel is very well presented and I look forward to your future content.

  • @nandanshettigar873
    @nandanshettigar873 Жыл бұрын

    Great video, love the level of complexity and fundamentals. I feel this just gave me some fresh inpso on my research

  • @mattiascardecchia799
    @mattiascardecchia7992 жыл бұрын

    Brilliant explanation!

  • @murilopalomosebilla2999
    @murilopalomosebilla29992 жыл бұрын

    The quality of your content is amazing!

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thanks a lot! I try 😅

  • @antoinestevan5310
    @antoinestevan53102 жыл бұрын

    I do not think I would produce any interesting analysis today. I simply... appreciated it a lot! :-)

  • @NoNTr1v1aL
    @NoNTr1v1aL2 жыл бұрын

    Amazing video!

  • @sivanschwartz3813
    @sivanschwartz38132 жыл бұрын

    Thank a lot for this great and informative video!! one of the best explanations I have come across

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thanks! Glad you enjoyed it, more to come

  • @tylernardone3788
    @tylernardone37882 жыл бұрын

    Great video! Great channel! Im working my way throught that Jaynes book [2] and absolutely love it.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    That is a heroic move! He has some wild insights on probability theory. Guy was a complete beast.

  • @ckq
    @ckq11 ай бұрын

    This channel is so amazing. I had a fuzzy understanding of a lot of these concepts, but this clarifies it. For example, my intuition suggested that for a given mean and variance, the maximum entropy estimate would be a beta-binomial distribution, but I wasn't really able to prove it to myself. 7:00

  • @Mutual_Information

    @Mutual_Information

    11 ай бұрын

    Glad this is helping!

  • @nerdsofgotham
    @nerdsofgotham2 жыл бұрын

    Been 20 years since I last did information theory. This seems closely related to the asymptotic equipartition principle. Excellent video.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Oh I’m sure they’re related in some mysterious and deep way I don’t yet understand, just because that’s a big topic in source [1] for this topic :)

  • @garvitprashar3671
    @garvitprashar36712 жыл бұрын

    I have a feeling you will become famous someday because and video quality is really good...

  • @arnold-pdev
    @arnold-pdev Жыл бұрын

    Great video!

  • @sirelegant2002
    @sirelegant20024 ай бұрын

    Incredible lecture, thank you so much

  • @outtaspacetime
    @outtaspacetime Жыл бұрын

    This one saved my life!

  • @kenzilamberto1981
    @kenzilamberto19812 жыл бұрын

    your video is easy to understand, I like it

  • @mCoding
    @mCoding2 жыл бұрын

    Another fantastic video! I would love to improve my knowledge about the Jeffreys prior for a parameter space.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thank you! Always means a lot. And yea, now that I've covered the Fisher Information, I can hit that one soon. Appreciate the suggestion - it's on the list!

  • @derickd6150

    @derickd6150

    Жыл бұрын

    @@Mutual_Information I do believe (and I hope this is the case) that we are going through a boom in science channels right now. It seems that the youtube algorithm is identifying that sub-audience that loves this content and recommending these types of channels to them sooner and sooner. So I really hope it happens to you very soon!

  • @YiqianWu-dh8nr
    @YiqianWu-dh8nr23 күн бұрын

    大概捋了一下整体思路,用很多浅显易懂的描述代替了很多复杂的数学公式,让我至少明白了他的原理。感谢!

  • @praveenfuntoo
    @praveenfuntoo2 жыл бұрын

    I able to apply this equation in my work , thanks to make it plausible .

  • @albertoderfisch1580
    @albertoderfisch15802 жыл бұрын

    woah this is such a good explanation. I just randomly discovered this channel but I'm sure it's bound to blow up. Just a bit of critique: Idk if this is only meant for college students but if you want to get a slightly broader audience you could focus a bit more on giving intuition for the concepts.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thank you very much! Yea the level of technical background I expect of the audience is an important question. I’m partial to keeping it technical. I think it’s OK not to appeal to everyone. My audience will just be technical and small :)

  • @zorro77777

    @zorro77777

    2 жыл бұрын

    @@Mutual_Information ++ with Alberto le Fisch : "Prof. Feynman... "If you cannot explain something in simple terms, you don't understand it." And I am sure you understand so please explain us! :)

  • @diegofcm6201

    @diegofcm6201

    Жыл бұрын

    His channel already puts lots of efforts into doing exactly that, the final bit of the video explaining about the FUNDAMENTAL difference from discrete 2 continuous breakage of "label invariance" just blowed my mind, seriously some of the best intuition I've ever received about something.

  • @mino99m14

    @mino99m14

    Жыл бұрын

    @@zorro77777 well, he also said something like "If I could summarise my work in a sentence it wouldn't be worth a Nobel price". Which means that although something can be simplified, this doesn't mean it won't take a long time to explain. It's not that easy and he is not your employee you know. Also in that quote he meant to be able to explain to physics undergrads, which you would expect them to have some knowledge already.

  • @kylebowles9820
    @kylebowles98202 ай бұрын

    I think it would be up to the parametrization to care about area or side length depending on the problem case in that example. I'd like my tools to do their own distilled thing in small, predictable, usable pieces.

  • @TuemmlerTanne11
    @TuemmlerTanne112 жыл бұрын

    Btw i dont know if Mutual Information is a good channel name. The term is pretty stacked and I can't just say "do you know Mutual Information" like I can say "do you know 3blue1brown"... It also makes it harder to find your channel, because if someone looks up mutual information on youtube you wont show up at the top. Maybe thats your strategy though, to have people find your channel when they search for Mutual Information on youtube ;) Anyways, im sure you have thought about this, but thats my take.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    I fear you may be correct. I've heard a few people say they tried to find my channel but couldn't when they searched. But part of me thinks I've gone too far. There's actually quite a bit of work I'd have to do to make a title change, and if the cost is my channel is a little bit more hidden, I think that's OK. Weirdly, I'm kinda enjoying the small channel stage (being a bit presumption that I'll eventually not be in this stage :) ). It's less pressure, gives me time to really nail the feedback and it's easier to have 2-way communication with the viewers. Don't get me wrong, I'd like to grow the channel, but I'm OK with leaving some growth hacks on the table. That said, I'm not totally set on "Mutual Information." I'd like to feel it out a bit more. As always, appreciate the feedback!

  • @dermitdembrot3091
    @dermitdembrot30912 жыл бұрын

    Your videos are great! I am curious about the connection between maximum entropy and Bayesian Inference. They seem related. Lets think about Bayesian inference in a variational way where you minimize a KL divergence between approximate and true posterior KL(q(z)||p(z|x)) where z is e.g. the vector of all our unknown digits and x the digit mean. Maximizing this KL divergence is equivalent to maximizing the sum of (1) H(q(z)), an entropy maximization objective (2) CE(q(z),p(z)), a cross-entropy term with the prior distribution p(z). This term is constant for a uniform prior (3) E_q(z) log p(x|z), a likelihood term that produces constraints, in our digits case p(x|z) is deterministically 1 if the condition x=mean(z) is fulfilled and 0 otherwise. All z with log p(x|z) = log 0 = -infty must be given a probability of 0 by q(z) to avoid the KL objective reaching negative infinity. On the other hand, once this constraint is fulfilled, all remaining choices of q attain E_q(z) log p(x|z) = E_q(z) log 1 = 0, therefore the entropy terms gets to decide among them.

  • @dermitdembrot3091

    @dermitdembrot3091

    2 жыл бұрын

    further, if we choose the digits to be i.i.d. ~ q(z_1) (z_1 being the first digit), as the number of digits N goes to infinity, the empirical mean, mean(z), will converge almost surely to the mean of q(z_1), so in the limit, we can put the constraint on the mean of q(z_1) instead of the empirical mean, as done by the maximum entropy principle. Digits being i.i.d. should be an unproblematic restriction due to symmetry (and due to entropy maximization).

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Wow yes you dived right into a big topic. Variational inference is a big way we get around some of the intractability naive bayesian stats can yield. You seem to know that well - thanks for all the details

  • @NoNTr1v1aL
    @NoNTr1v1aL2 жыл бұрын

    When will you make a video on Mutual Information to honor your channel's name?

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Haha it's coming! But I got a few things in queue ahead of it :)

  • @regrefree
    @regrefree2 жыл бұрын

    Very informative. I had to stop and go back many times because you are speaking and explaining things very fast :-p

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    I’ve gotten this feedback a few times now. I’ll be working on it for the next vids, though i still talk fast on the vids I’ve already shot.

  • @ckq
    @ckq11 ай бұрын

    12:45, I think taking the log would be useful in the squares scenario since then the squaring would become a linear transformation rather than non-linear

  • @johnbicknell8424
    @johnbicknell84242 жыл бұрын

    Great video. Are you aware of a way to represent the entropy as a single number, not a distribution? Thanks!

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thanks! And to answer your question, the entropy *is* a single number which measures a distribution.

  • @kristoferkrus
    @kristoferkrus6 ай бұрын

    Hm, I tried to use this method to find the maximum entropy distribution when you know all first three moments of the distribution, that is, both the mean, the variance and the skewness, but I end up with an expression that either leads to a distribution completely without skewness or one with a PDF that goes to infinity, either as x approaches infinity or as x approaches minus infinity (I have an x^3 term in the exponent), and which therefore can't be normalized. Is that a case which this method doesn't work for? Is there some other way to find the maximum entropy distribution when you know all first three moments in that case?

  • @kristoferkrus

    @kristoferkrus

    6 ай бұрын

    Okay, I think I found the answer to my question. According to Wikipedia, this method works for the continuous case if the support is a closed subset S of the real numbers (which I guess means that S has a minimum and a maximum value?), and it doesn't mention the case where S = R. But presume that S is the interval [-a, +a], where a is very large, then this method works. And I realized that the solution you get when you use this method is a distribution that is very similar to a normal distribution, except for a tiny increase in density just by one the of the two endpoints to make the distribution skewed, which is not really the type of distribution I imagined. I believe the reason this doesn't work if S = R is because there is no maximum entropy distribution that satisfies those constraints, in the sense that if you have a distribution that does satisfy those constraints, you can always find another distribution that also satisfies the constraints, but with a lower distribution. Similarly, if you let S = [-a, a] again, you can use this method to find a solution, but if you let a → ∞, the limit of the solution you will get by using this method is a normal distribution. But as you let a → ∞, the kurtosis of the solution will also approach infinity, which may be undesired. So if you want to prevent that, you may also constrain the kurtosis, maybe by putting an upper limit to it or by choosing it to take on a specific value. When you do this, all of a sudden the method works again for S = R.

  • @alixpetit2285
    @alixpetit22852 жыл бұрын

    Nice video, do you think that set shaping theory can change the approach to information theory?

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    I don't know anything about set shaping theory, so.. maybe! Whatever it is, I think it could only *extend* information theory. I believe the core of information theory is very much settled.

  • @informationtheoryvideodata2126

    @informationtheoryvideodata2126

    2 жыл бұрын

    Set shaping theory is a new theory, but the results are incredible, it can really change information theory.

  • @SuperGanga2010
    @SuperGanga2010 Жыл бұрын

    Is the shaky continuous foundation related to the Bertrand paradox?

  • @Mutual_Information

    @Mutual_Information

    Жыл бұрын

    I am not aware of that connection. When researching it, I just discovered that these ideas we're intended for the continuous domain. People extended it into the continuous domain, but then certain properties were lost.

  • @boar6615
    @boar66157 ай бұрын

    Thank you so much! the graphs were especially helpful, and the concise language helped me finally understand this concept better

  • @Mutual_Information

    @Mutual_Information

    7 ай бұрын

    Exactly what I'm trying to do

  • @kabolat
    @kabolat Жыл бұрын

    Great video! Thanks a lot. A little feedback: The example you give in 12:00-13:00 is a bit hard to follow without visualization. The blackboard and simulations you use are very helpful in general. It would be great if you do not leave that section still and only talk. Even some bullet points would be nice.

  • @Mutual_Information

    @Mutual_Information

    Жыл бұрын

    Thanks - useful, specific feedback is in short supply, so this is very much appreciated. I count yours as a "keep things motivated and visual"-type of feedback, which is something I'm actively working on (but not always great about). Anyway, it's a work in progress and hopefully you'll see the differences in upcoming videos. Thanks again!

  • @sukursukur3617
    @sukursukur36172 жыл бұрын

    Imagine you have a raw set. You want to build a histogram. You dont know bin range, bin start and end locations and number of bins. Can ideal histogram be built by using max entropy law?

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    I've heard about this and I've actually seen it used as an effective feature-engineering preprocessing step in a serious production model. Unfortunately, I looked and couldn't find the exact method and I forget the details. But there seems to be a good amount of material on the internet for "entropy based discretization." I'd give those a look

  • @MP-if2kf
    @MP-if2kf2 жыл бұрын

    One thing is bothering me... The justification of using entropy seems circular. In the first case, where no information is added, we are implicitly assuming that the distribution of the digits is discrete uniform. Because we are choosing the distribution based on the number of possible sequences corresponding a distribution. This is only valid if any sequence is just as likely. But this is only true if we assume the distribution is uniform. Things are a bit more interesting when we add the moment conditions. I guess what we are doing, is conditioning on distributions satisfying the moment conditions, and choosing among these the distribution with the most possible sequences. We seem to be using a uniform prior (distribution for the data), in essence. My question is: why would this be a good idea? What actually is the justification of using entropy? Which right now in my mind is: why should we be using the prior assumption that the distribution is uniform when we want to choose a 'most likely' distribution? Don't feel obliged to respond to my rambling. Just wanted to write it down. Thank you for your video!

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    lol doesn't sound like rambling to me. I see you're point about it being circular. But I don't think that's the case in fact. Let's say it wasn't uniformly distributed.. Maybe odd numbers are more likely. Now make a table of all sequences and their respective probabilities. Still, you'll find that sequences with uniform counts have a relative advantage.. it may not be as strong due to whatever the actual distribution is.. but the effect of "there are more sequences with nearly even counts" is always there.. even if the distribution of each digit isn't uniform. It's that effect we learn on.. and in the absence of assuming anything about the digit distribution.. that leads you to the uniform distribution. In other words, the uniform distribution is a consequence, not an assumption.

  • @MP-if2kf

    @MP-if2kf

    2 жыл бұрын

    @@Mutual_Information I have to think about it a bit more. In any case, thank you for your careful reply! Really appreciate it.

  • @Septumsempra8818
    @Septumsempra8818 Жыл бұрын

    WOW!

  • @MP-if2kf
    @MP-if2kf2 жыл бұрын

    Cool video! You lost me at the lambda's though... They are chosen to meet the equations... what do they solve exactly?

  • @MP-if2kf

    @MP-if2kf

    2 жыл бұрын

    Are they the Lagrange multipliers?

  • @MP-if2kf

    @MP-if2kf

    2 жыл бұрын

    I guess I get it, the lambda is just chosen to get the maximal entropy distribution given the moment condition...

  • @MP-if2kf

    @MP-if2kf

    2 жыл бұрын

    Amazing video, I will have to revisit it some times though

  • @MP-if2kf

    @MP-if2kf

    2 жыл бұрын

    only didnt understand the invariance bit...

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    The invariance bit is something that I really didn't explore well. It's something I only realized while I was researching the video. The way I would think about it is.. the motivating argument for max entropy doesn't apply over the continuous domain b/c you can't enumerate "all possibly sequences of random samples".. so if you use the max entropy approach in the continuous domain anyway.. you are doing something which imports a hidden assumption you don't realize. Something like.. minimize the KL-divergence from some reference distribution.. idk.. something weird. As you can tell, I think it's OK to not understand the invariance bit :)

  • @abdjahdoiahdoai
    @abdjahdoiahdoai2 жыл бұрын

    do you plan to make a video on expectation maximization? loll funny you put a information theory textbook on the desk for this video

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Glad you noticed :) Yes EM is on the list! I have a few things in front of it but it's definitely coming.

  • @abdjahdoiahdoai

    @abdjahdoiahdoai

    2 жыл бұрын

    @@Mutual_Information nice

  • @Gggggggggg1545.7
    @Gggggggggg1545.72 жыл бұрын

    Another great video. My only comment would be slow down slightly to give more time to digest the words and graphics.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thank you and I appreciate the feedback. I’ve already shot 2 more vids so I won’t be rolling into those, but I will for the one I’m writing right now. Also working on avoiding the uninteresting details that don’t add to the big picture.

  • @janerikbellingrath820
    @janerikbellingrath820 Жыл бұрын

    nice

  • @desir.ivanova2625
    @desir.ivanova26252 жыл бұрын

    Nice video! I think there's an error in your list at 10:42 - the Cauchy distribution is not exponential family.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thank you! Looking into it, I don't believe it's an error. I'm not claiming here that these are within the exponential family. I'm saying these are max entropy distribution under certain constraints, which is a different set. You can see the cauchy distribution listed here : en.wikipedia.org/wiki/Maximum_entropy_probability_distribution But thank you for keeping an eye out for errors. They are inevitable, but extra eyes are my best chance at a good defense against them.

  • @desir.ivanova2625

    @desir.ivanova2625

    2 жыл бұрын

    @@Mutual_Information Thanks for your quick reply! And thanks for the link - I can see that indeed there's a constraint (albeit a very strange one) for which Cauchy is the max entropy distribution. I guess then, I was confused by the examples in the table + those that you then list -- all distributions were exponential family and Cauchy was the odd one out. Also, please correct me if I'm wrong, but I think if you do moment matching for the mean (i.e. you look at all possible distributions that realise a mean parameter \mu), then the max entropy distribution is an exponential family one. And the table was doing exactly that. Now, we can't do moment matching for the Cauchy distribution as none of its moments are defined. So that was the second reason for my confusion.

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thanks makes a lot of sense. To be honest, I don’t understand the max entropy exponential family connection all that well. There seems to be these bizarre distributes that are max entropy but aren’t exponential fam. I’m not sure why they’re there, so I join you in your confusion!

  • @kristoferkrus
    @kristoferkrus Жыл бұрын

    What do you mean that the entropy only depends on the variable's probabilities and not its values? You also said that the variance does depend on its values, but I don't see why the variance would while the entropy would not. You say that you can define the entropy as a measure of a bar graph, but so can the variance.

  • @Mutual_Information

    @Mutual_Information

    Жыл бұрын

    entropy = - sum p(x) log p(x).. notice only p(x) appears in the equation - you never see just "x" in that expression. For (discrete) variance.. it's sum of p(x)(x-E[x])^2.. notice x does appear on it's own. When I say the bar graph, I'm only referring to the vertical heights of the bars (which are the p(x)'s).. you can use just those set of numbers to compute the entropy. For the variance, you'd need to know something in addition to those probabilities (the values those probabilities correspond to).

  • @kristoferkrus

    @kristoferkrus

    Жыл бұрын

    @@Mutual_Information Ah, I see! I don't know what I was thinking. For some reason, I thought probability when you said value. It makes total sense now. Great video by the way! Really insightful!

  • @manueltiburtini6528
    @manueltiburtini652811 ай бұрын

    Then, Why is the logistic regression also called Maximum Entropy? Am I wrong?

  • @Mutual_Information

    @Mutual_Information

    11 ай бұрын

    You're not wrong. It's the same reason. If you optimize NLL and you leave the function which maps from W'x (coefficients-x-features) open, and then maximize entropy.. then the function you'd get is the softmax! So logistic regression comes from max-ing entropy.

  • @mino99m14
    @mino99m14 Жыл бұрын

    Great video! Maybe it’s just me but the explanation of this equation is a bit misleading 3:10. Specifically the part where you say to transform the counts into probabilities. For a moment I thought you meant that nd/N is the probability of having a string with nd copies of d and I was very confused. What is actually saying is that if we have N numbers of 1 digit in which there are n0 copies of 0, n1 copies of 1, and this for all digits (this means that n0+n1+…+n9 = N.) The probability of getting the digit d is nd/N. I got confused because the main problem was about strings of size N and these probabilities just consider a single string N with nd copies of each digit d.

  • @Mutual_Information

    @Mutual_Information

    Жыл бұрын

    Yes, there's a change of perspective on the problem. I tried to communicate that with the table, but I see how it's still confusing. You seem to have gotten through it with just a good think on the matter

  • @mino99m14

    @mino99m14

    Жыл бұрын

    @@Mutual_Information it's alright. Having the derivation of the expression helped me a lot. I appreciate you take part of your time to add details like these in your videos 🙂...

  • @TuemmlerTanne11
    @TuemmlerTanne112 жыл бұрын

    Honestly your videos get me excited for a topic like nothing else. Reminder to myself not to watch your videos if I need to do anything else that day... Jokes aside, awesome video again!

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    Thank you very much! I'm glad you like it and I'm happy to hear there are others like you who get excited about these topics like I do. I'll keep the content coming!

  • @SystemScientist
    @SystemScientist8 ай бұрын

    Supercool

  • @zeio-nara
    @zeio-nara2 жыл бұрын

    It's too hard, too many equations, I didn't understand anything. Can you explain it in simple terms?

  • @Mutual_Information

    @Mutual_Information

    2 жыл бұрын

    I appreciate the honesty! I'd say.. go through the video slowly. The moment you find something.. something specific!.. ask it here and I'll answer :)

  • @whozz
    @whozz11 ай бұрын

    6:13 In this case, the Gods have nothing to do with 'e' showing up there haha Actually, we could reformulate this result to any other proper basis b and the lambdas would just get shrank by the factor ln(b).

  • @piero8284
    @piero82848 ай бұрын

    The math gods work in mysterious ways 🤣

  • @bscutajar
    @bscutajar11 ай бұрын

    Great video, but one pet peeve is that I found your repetitive hand gestures somewhat distracting.

  • @Mutual_Information

    @Mutual_Information

    11 ай бұрын

    Yea they're terrible. I took some shit advice of "learn to talk with your hands" and it produced some cringe. It makes me want to reshoot everything, but it's hard to justify how long that would take. So, here we are.

  • @bscutajar

    @bscutajar

    11 ай бұрын

    @@Mutual_Information 😂😂 don't worry about it man, the videos are great. I think there's no reason for any hand gestures since the visuals are focused on the animations.

  • @bscutajar

    @bscutajar

    10 ай бұрын

    @@Mutual_Information Just watched 'How to Learn Probability Distributions' and in that video I didn't find the hand gestures distracting at all since they were mostly related with the ideas you were conveying. The issue in this video is that they were a bit mechanical and repetitive. This is a minor detail though I love your videos so far!

  • @ebrahimfeghhi1777
    @ebrahimfeghhi17772 жыл бұрын

    Great video!