The Principle of Maximum Entropy
The machine learning consultancy: truetheta.io
Want to work together? See here: truetheta.io/about/#want-to-w...
What's the safest distribution to pick in the absence of information? What about in the case where you have some, though only partial, information? The Principle of Maximum Entropy answers these questions well and as a result, is a frequent guiding rule for selecting distributions in the wild.
SOCIAL MEDIA
LinkedIn : / dj-rich-90b91753
Twitter : / duanejrich
Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
Sources
Chapters 11-12 of [2] were primary sources - this is where I ironed out most of my intuition on this subject. Chapter 12 of [1] was helpful for understanding the relationship between the maximum entropy criteria and the form of the distribution that meets it. [3] was useful for a high level perspective and [4] was helpful for determining the list of maximum entropy distribution.
Also, thank you to Dr. Hanspeter Schmid of the University of Applied Sciences and Arts, Northwestern Switzerland. He helped me interpret some of the more technical details of [2] and prevented me from attaching an incorrect intuition to the continuous case - much appreciated!
[1] T. M. Cover and J. A. Thomas. Elements of Information Theory. 2nd edition. John Wiley, 2006.
[2] E. T. Jaynes. Probability theory: the logic of science. Cambridge university press, 2003.
[3] Principle of Maximum Entropy, Wikipedia, en.wikipedia.org/wiki/Princip...
[4] Maximum Entropy Distribution, Wikipedia, en.wikipedia.org/wiki/Maximum...
Timestamps :
0:00 Intro
00:41 Guessing a Distribution and Maximum Entropy
04:16 Adding Information
06:40 An Example
08:00 The Continuous Case
10:26 The Shaky Continuous Foundation
Пікірлер: 109
These principles explained here appear everywhere: thermodynamics, machine learning, information theory. Very fundamental
You know that feeling when you find a criminally overlooked channel and you’re about to get in on the ground level of something that’s gonna blow up? This is you now
@Mutual_Information
2 жыл бұрын
That is quite nice of you - thank you! I hope you're right, but for now, I'm working on my patience. It can take quite a while to get noticed on KZread. I'm trying to keep my expectations realistic.
It blew my mind that those famous distributions come naturally as the ones that give maximum entropy when we set the domain and constraints in a general way. Now I kind of know why they are special.
@MP-if2kf
2 жыл бұрын
Definitely very cool. In many cases there are also other fascinating characterizations. For example: assume continuous distribution on positive support with the no-memory-property-->solve differential equation-->find that it MUST be the exponential
Just in the recent 3b1b video about normal distribution, there is a mention about normal distribution maximizes entropy. Then immediately I saw it here on your video displaying the normal distribution as the one that maximizes the entropy while constraining average and variance which are the only two parameters of the normal distribution. That is very nice.
I can't get over how much fun you make learning about stats, ML, and information theory---not to mention that you teach it with skill like Feynman's and a style that is all your own.
@Mutual_Information
10 ай бұрын
That's quite a compliment - Feynman is a total inspiration for many, myself included. His energy about the topics makes you *want* to learn about them.
An amazing video. Proving once again why the internet is a blessing to humanity.
Really stimulating. Your last example looked to be heading towards information in the variables. Got me hooked
I didn't expect much and was disappointed. What a great Video. I wish you lots more followers!
@Mutual_Information
Жыл бұрын
Thank you so much! More to come :)
I found this video very helpful! Thanks for making it! The animated visuals especially are great :)
I’m glad I clicked on your comment, this channel is very well presented and I look forward to your future content.
Great video, love the level of complexity and fundamentals. I feel this just gave me some fresh inpso on my research
Brilliant explanation!
The quality of your content is amazing!
@Mutual_Information
2 жыл бұрын
Thanks a lot! I try 😅
I do not think I would produce any interesting analysis today. I simply... appreciated it a lot! :-)
Amazing video!
Thank a lot for this great and informative video!! one of the best explanations I have come across
@Mutual_Information
2 жыл бұрын
Thanks! Glad you enjoyed it, more to come
Great video! Great channel! Im working my way throught that Jaynes book [2] and absolutely love it.
@Mutual_Information
2 жыл бұрын
That is a heroic move! He has some wild insights on probability theory. Guy was a complete beast.
This channel is so amazing. I had a fuzzy understanding of a lot of these concepts, but this clarifies it. For example, my intuition suggested that for a given mean and variance, the maximum entropy estimate would be a beta-binomial distribution, but I wasn't really able to prove it to myself. 7:00
@Mutual_Information
11 ай бұрын
Glad this is helping!
Been 20 years since I last did information theory. This seems closely related to the asymptotic equipartition principle. Excellent video.
@Mutual_Information
2 жыл бұрын
Oh I’m sure they’re related in some mysterious and deep way I don’t yet understand, just because that’s a big topic in source [1] for this topic :)
I have a feeling you will become famous someday because and video quality is really good...
Great video!
Incredible lecture, thank you so much
This one saved my life!
your video is easy to understand, I like it
Another fantastic video! I would love to improve my knowledge about the Jeffreys prior for a parameter space.
@Mutual_Information
2 жыл бұрын
Thank you! Always means a lot. And yea, now that I've covered the Fisher Information, I can hit that one soon. Appreciate the suggestion - it's on the list!
@derickd6150
Жыл бұрын
@@Mutual_Information I do believe (and I hope this is the case) that we are going through a boom in science channels right now. It seems that the youtube algorithm is identifying that sub-audience that loves this content and recommending these types of channels to them sooner and sooner. So I really hope it happens to you very soon!
大概捋了一下整体思路,用很多浅显易懂的描述代替了很多复杂的数学公式,让我至少明白了他的原理。感谢!
I able to apply this equation in my work , thanks to make it plausible .
woah this is such a good explanation. I just randomly discovered this channel but I'm sure it's bound to blow up. Just a bit of critique: Idk if this is only meant for college students but if you want to get a slightly broader audience you could focus a bit more on giving intuition for the concepts.
@Mutual_Information
2 жыл бұрын
Thank you very much! Yea the level of technical background I expect of the audience is an important question. I’m partial to keeping it technical. I think it’s OK not to appeal to everyone. My audience will just be technical and small :)
@zorro77777
2 жыл бұрын
@@Mutual_Information ++ with Alberto le Fisch : "Prof. Feynman... "If you cannot explain something in simple terms, you don't understand it." And I am sure you understand so please explain us! :)
@diegofcm6201
Жыл бұрын
His channel already puts lots of efforts into doing exactly that, the final bit of the video explaining about the FUNDAMENTAL difference from discrete 2 continuous breakage of "label invariance" just blowed my mind, seriously some of the best intuition I've ever received about something.
@mino99m14
Жыл бұрын
@@zorro77777 well, he also said something like "If I could summarise my work in a sentence it wouldn't be worth a Nobel price". Which means that although something can be simplified, this doesn't mean it won't take a long time to explain. It's not that easy and he is not your employee you know. Also in that quote he meant to be able to explain to physics undergrads, which you would expect them to have some knowledge already.
I think it would be up to the parametrization to care about area or side length depending on the problem case in that example. I'd like my tools to do their own distilled thing in small, predictable, usable pieces.
Btw i dont know if Mutual Information is a good channel name. The term is pretty stacked and I can't just say "do you know Mutual Information" like I can say "do you know 3blue1brown"... It also makes it harder to find your channel, because if someone looks up mutual information on youtube you wont show up at the top. Maybe thats your strategy though, to have people find your channel when they search for Mutual Information on youtube ;) Anyways, im sure you have thought about this, but thats my take.
@Mutual_Information
2 жыл бұрын
I fear you may be correct. I've heard a few people say they tried to find my channel but couldn't when they searched. But part of me thinks I've gone too far. There's actually quite a bit of work I'd have to do to make a title change, and if the cost is my channel is a little bit more hidden, I think that's OK. Weirdly, I'm kinda enjoying the small channel stage (being a bit presumption that I'll eventually not be in this stage :) ). It's less pressure, gives me time to really nail the feedback and it's easier to have 2-way communication with the viewers. Don't get me wrong, I'd like to grow the channel, but I'm OK with leaving some growth hacks on the table. That said, I'm not totally set on "Mutual Information." I'd like to feel it out a bit more. As always, appreciate the feedback!
Your videos are great! I am curious about the connection between maximum entropy and Bayesian Inference. They seem related. Lets think about Bayesian inference in a variational way where you minimize a KL divergence between approximate and true posterior KL(q(z)||p(z|x)) where z is e.g. the vector of all our unknown digits and x the digit mean. Maximizing this KL divergence is equivalent to maximizing the sum of (1) H(q(z)), an entropy maximization objective (2) CE(q(z),p(z)), a cross-entropy term with the prior distribution p(z). This term is constant for a uniform prior (3) E_q(z) log p(x|z), a likelihood term that produces constraints, in our digits case p(x|z) is deterministically 1 if the condition x=mean(z) is fulfilled and 0 otherwise. All z with log p(x|z) = log 0 = -infty must be given a probability of 0 by q(z) to avoid the KL objective reaching negative infinity. On the other hand, once this constraint is fulfilled, all remaining choices of q attain E_q(z) log p(x|z) = E_q(z) log 1 = 0, therefore the entropy terms gets to decide among them.
@dermitdembrot3091
2 жыл бұрын
further, if we choose the digits to be i.i.d. ~ q(z_1) (z_1 being the first digit), as the number of digits N goes to infinity, the empirical mean, mean(z), will converge almost surely to the mean of q(z_1), so in the limit, we can put the constraint on the mean of q(z_1) instead of the empirical mean, as done by the maximum entropy principle. Digits being i.i.d. should be an unproblematic restriction due to symmetry (and due to entropy maximization).
@Mutual_Information
2 жыл бұрын
Wow yes you dived right into a big topic. Variational inference is a big way we get around some of the intractability naive bayesian stats can yield. You seem to know that well - thanks for all the details
When will you make a video on Mutual Information to honor your channel's name?
@Mutual_Information
2 жыл бұрын
Haha it's coming! But I got a few things in queue ahead of it :)
Very informative. I had to stop and go back many times because you are speaking and explaining things very fast :-p
@Mutual_Information
2 жыл бұрын
I’ve gotten this feedback a few times now. I’ll be working on it for the next vids, though i still talk fast on the vids I’ve already shot.
12:45, I think taking the log would be useful in the squares scenario since then the squaring would become a linear transformation rather than non-linear
Great video. Are you aware of a way to represent the entropy as a single number, not a distribution? Thanks!
@Mutual_Information
2 жыл бұрын
Thanks! And to answer your question, the entropy *is* a single number which measures a distribution.
Hm, I tried to use this method to find the maximum entropy distribution when you know all first three moments of the distribution, that is, both the mean, the variance and the skewness, but I end up with an expression that either leads to a distribution completely without skewness or one with a PDF that goes to infinity, either as x approaches infinity or as x approaches minus infinity (I have an x^3 term in the exponent), and which therefore can't be normalized. Is that a case which this method doesn't work for? Is there some other way to find the maximum entropy distribution when you know all first three moments in that case?
@kristoferkrus
6 ай бұрын
Okay, I think I found the answer to my question. According to Wikipedia, this method works for the continuous case if the support is a closed subset S of the real numbers (which I guess means that S has a minimum and a maximum value?), and it doesn't mention the case where S = R. But presume that S is the interval [-a, +a], where a is very large, then this method works. And I realized that the solution you get when you use this method is a distribution that is very similar to a normal distribution, except for a tiny increase in density just by one the of the two endpoints to make the distribution skewed, which is not really the type of distribution I imagined. I believe the reason this doesn't work if S = R is because there is no maximum entropy distribution that satisfies those constraints, in the sense that if you have a distribution that does satisfy those constraints, you can always find another distribution that also satisfies the constraints, but with a lower distribution. Similarly, if you let S = [-a, a] again, you can use this method to find a solution, but if you let a → ∞, the limit of the solution you will get by using this method is a normal distribution. But as you let a → ∞, the kurtosis of the solution will also approach infinity, which may be undesired. So if you want to prevent that, you may also constrain the kurtosis, maybe by putting an upper limit to it or by choosing it to take on a specific value. When you do this, all of a sudden the method works again for S = R.
Nice video, do you think that set shaping theory can change the approach to information theory?
@Mutual_Information
2 жыл бұрын
I don't know anything about set shaping theory, so.. maybe! Whatever it is, I think it could only *extend* information theory. I believe the core of information theory is very much settled.
@informationtheoryvideodata2126
2 жыл бұрын
Set shaping theory is a new theory, but the results are incredible, it can really change information theory.
Is the shaky continuous foundation related to the Bertrand paradox?
@Mutual_Information
Жыл бұрын
I am not aware of that connection. When researching it, I just discovered that these ideas we're intended for the continuous domain. People extended it into the continuous domain, but then certain properties were lost.
Thank you so much! the graphs were especially helpful, and the concise language helped me finally understand this concept better
@Mutual_Information
7 ай бұрын
Exactly what I'm trying to do
Great video! Thanks a lot. A little feedback: The example you give in 12:00-13:00 is a bit hard to follow without visualization. The blackboard and simulations you use are very helpful in general. It would be great if you do not leave that section still and only talk. Even some bullet points would be nice.
@Mutual_Information
Жыл бұрын
Thanks - useful, specific feedback is in short supply, so this is very much appreciated. I count yours as a "keep things motivated and visual"-type of feedback, which is something I'm actively working on (but not always great about). Anyway, it's a work in progress and hopefully you'll see the differences in upcoming videos. Thanks again!
Imagine you have a raw set. You want to build a histogram. You dont know bin range, bin start and end locations and number of bins. Can ideal histogram be built by using max entropy law?
@Mutual_Information
2 жыл бұрын
I've heard about this and I've actually seen it used as an effective feature-engineering preprocessing step in a serious production model. Unfortunately, I looked and couldn't find the exact method and I forget the details. But there seems to be a good amount of material on the internet for "entropy based discretization." I'd give those a look
One thing is bothering me... The justification of using entropy seems circular. In the first case, where no information is added, we are implicitly assuming that the distribution of the digits is discrete uniform. Because we are choosing the distribution based on the number of possible sequences corresponding a distribution. This is only valid if any sequence is just as likely. But this is only true if we assume the distribution is uniform. Things are a bit more interesting when we add the moment conditions. I guess what we are doing, is conditioning on distributions satisfying the moment conditions, and choosing among these the distribution with the most possible sequences. We seem to be using a uniform prior (distribution for the data), in essence. My question is: why would this be a good idea? What actually is the justification of using entropy? Which right now in my mind is: why should we be using the prior assumption that the distribution is uniform when we want to choose a 'most likely' distribution? Don't feel obliged to respond to my rambling. Just wanted to write it down. Thank you for your video!
@Mutual_Information
2 жыл бұрын
lol doesn't sound like rambling to me. I see you're point about it being circular. But I don't think that's the case in fact. Let's say it wasn't uniformly distributed.. Maybe odd numbers are more likely. Now make a table of all sequences and their respective probabilities. Still, you'll find that sequences with uniform counts have a relative advantage.. it may not be as strong due to whatever the actual distribution is.. but the effect of "there are more sequences with nearly even counts" is always there.. even if the distribution of each digit isn't uniform. It's that effect we learn on.. and in the absence of assuming anything about the digit distribution.. that leads you to the uniform distribution. In other words, the uniform distribution is a consequence, not an assumption.
@MP-if2kf
2 жыл бұрын
@@Mutual_Information I have to think about it a bit more. In any case, thank you for your careful reply! Really appreciate it.
WOW!
Cool video! You lost me at the lambda's though... They are chosen to meet the equations... what do they solve exactly?
@MP-if2kf
2 жыл бұрын
Are they the Lagrange multipliers?
@MP-if2kf
2 жыл бұрын
I guess I get it, the lambda is just chosen to get the maximal entropy distribution given the moment condition...
@MP-if2kf
2 жыл бұрын
Amazing video, I will have to revisit it some times though
@MP-if2kf
2 жыл бұрын
only didnt understand the invariance bit...
@Mutual_Information
2 жыл бұрын
The invariance bit is something that I really didn't explore well. It's something I only realized while I was researching the video. The way I would think about it is.. the motivating argument for max entropy doesn't apply over the continuous domain b/c you can't enumerate "all possibly sequences of random samples".. so if you use the max entropy approach in the continuous domain anyway.. you are doing something which imports a hidden assumption you don't realize. Something like.. minimize the KL-divergence from some reference distribution.. idk.. something weird. As you can tell, I think it's OK to not understand the invariance bit :)
do you plan to make a video on expectation maximization? loll funny you put a information theory textbook on the desk for this video
@Mutual_Information
2 жыл бұрын
Glad you noticed :) Yes EM is on the list! I have a few things in front of it but it's definitely coming.
@abdjahdoiahdoai
2 жыл бұрын
@@Mutual_Information nice
Another great video. My only comment would be slow down slightly to give more time to digest the words and graphics.
@Mutual_Information
2 жыл бұрын
Thank you and I appreciate the feedback. I’ve already shot 2 more vids so I won’t be rolling into those, but I will for the one I’m writing right now. Also working on avoiding the uninteresting details that don’t add to the big picture.
nice
Nice video! I think there's an error in your list at 10:42 - the Cauchy distribution is not exponential family.
@Mutual_Information
2 жыл бұрын
Thank you! Looking into it, I don't believe it's an error. I'm not claiming here that these are within the exponential family. I'm saying these are max entropy distribution under certain constraints, which is a different set. You can see the cauchy distribution listed here : en.wikipedia.org/wiki/Maximum_entropy_probability_distribution But thank you for keeping an eye out for errors. They are inevitable, but extra eyes are my best chance at a good defense against them.
@desir.ivanova2625
2 жыл бұрын
@@Mutual_Information Thanks for your quick reply! And thanks for the link - I can see that indeed there's a constraint (albeit a very strange one) for which Cauchy is the max entropy distribution. I guess then, I was confused by the examples in the table + those that you then list -- all distributions were exponential family and Cauchy was the odd one out. Also, please correct me if I'm wrong, but I think if you do moment matching for the mean (i.e. you look at all possible distributions that realise a mean parameter \mu), then the max entropy distribution is an exponential family one. And the table was doing exactly that. Now, we can't do moment matching for the Cauchy distribution as none of its moments are defined. So that was the second reason for my confusion.
@Mutual_Information
2 жыл бұрын
Thanks makes a lot of sense. To be honest, I don’t understand the max entropy exponential family connection all that well. There seems to be these bizarre distributes that are max entropy but aren’t exponential fam. I’m not sure why they’re there, so I join you in your confusion!
What do you mean that the entropy only depends on the variable's probabilities and not its values? You also said that the variance does depend on its values, but I don't see why the variance would while the entropy would not. You say that you can define the entropy as a measure of a bar graph, but so can the variance.
@Mutual_Information
Жыл бұрын
entropy = - sum p(x) log p(x).. notice only p(x) appears in the equation - you never see just "x" in that expression. For (discrete) variance.. it's sum of p(x)(x-E[x])^2.. notice x does appear on it's own. When I say the bar graph, I'm only referring to the vertical heights of the bars (which are the p(x)'s).. you can use just those set of numbers to compute the entropy. For the variance, you'd need to know something in addition to those probabilities (the values those probabilities correspond to).
@kristoferkrus
Жыл бұрын
@@Mutual_Information Ah, I see! I don't know what I was thinking. For some reason, I thought probability when you said value. It makes total sense now. Great video by the way! Really insightful!
Then, Why is the logistic regression also called Maximum Entropy? Am I wrong?
@Mutual_Information
11 ай бұрын
You're not wrong. It's the same reason. If you optimize NLL and you leave the function which maps from W'x (coefficients-x-features) open, and then maximize entropy.. then the function you'd get is the softmax! So logistic regression comes from max-ing entropy.
Great video! Maybe it’s just me but the explanation of this equation is a bit misleading 3:10. Specifically the part where you say to transform the counts into probabilities. For a moment I thought you meant that nd/N is the probability of having a string with nd copies of d and I was very confused. What is actually saying is that if we have N numbers of 1 digit in which there are n0 copies of 0, n1 copies of 1, and this for all digits (this means that n0+n1+…+n9 = N.) The probability of getting the digit d is nd/N. I got confused because the main problem was about strings of size N and these probabilities just consider a single string N with nd copies of each digit d.
@Mutual_Information
Жыл бұрын
Yes, there's a change of perspective on the problem. I tried to communicate that with the table, but I see how it's still confusing. You seem to have gotten through it with just a good think on the matter
@mino99m14
Жыл бұрын
@@Mutual_Information it's alright. Having the derivation of the expression helped me a lot. I appreciate you take part of your time to add details like these in your videos 🙂...
Honestly your videos get me excited for a topic like nothing else. Reminder to myself not to watch your videos if I need to do anything else that day... Jokes aside, awesome video again!
@Mutual_Information
2 жыл бұрын
Thank you very much! I'm glad you like it and I'm happy to hear there are others like you who get excited about these topics like I do. I'll keep the content coming!
Supercool
It's too hard, too many equations, I didn't understand anything. Can you explain it in simple terms?
@Mutual_Information
2 жыл бұрын
I appreciate the honesty! I'd say.. go through the video slowly. The moment you find something.. something specific!.. ask it here and I'll answer :)
6:13 In this case, the Gods have nothing to do with 'e' showing up there haha Actually, we could reformulate this result to any other proper basis b and the lambdas would just get shrank by the factor ln(b).
The math gods work in mysterious ways 🤣
Great video, but one pet peeve is that I found your repetitive hand gestures somewhat distracting.
@Mutual_Information
11 ай бұрын
Yea they're terrible. I took some shit advice of "learn to talk with your hands" and it produced some cringe. It makes me want to reshoot everything, but it's hard to justify how long that would take. So, here we are.
@bscutajar
11 ай бұрын
@@Mutual_Information 😂😂 don't worry about it man, the videos are great. I think there's no reason for any hand gestures since the visuals are focused on the animations.
@bscutajar
10 ай бұрын
@@Mutual_Information Just watched 'How to Learn Probability Distributions' and in that video I didn't find the hand gestures distracting at all since they were mostly related with the ideas you were conveying. The issue in this video is that they were a bit mechanical and repetitive. This is a minor detail though I love your videos so far!
Great video!