Softmax Function Explained In Depth with 3D Visuals

The softmax function is often used in machine learning to transform the outputs of the last layer of your neural network (the logits) into probabilities. In this video, I explain how the softmax function works and provide some intuition for thinking about it in higher dimensions. In addition to using the softmax function for classification, it’s also used in models that use attention, such as transformer models. The softmax function is very similar to the sigmoid function, except it’s generalized for higher dimensions, so if you’re also interested in learning more about the sigmoid function, check out my previous video about it linked below.
My previous video, "Why We Use the Sigmoid Function in Neural Networks for Binary Classification":
📼 • Why Do We Use the Sigm...
My other video, "Derivative of Sigmoid and Softmax Explained Visually":
📼 • Derivative of Sigmoid ...
GitHub code for visualizing how the logit values that are passed into the softmax function change over time as the model is trained with SGD (stochastic gradient descent) or the Adam optimizer:
💻 github.com/elliotwaite/softma...
Desmos 2D graph of softmax for 4 classes:
📈 www.desmos.com/calculator/drq...
GeoGebra 3D graph of softmax for 2 classes (with derivatives and Gaussians):
📈 www.geogebra.org/classic/qhdd...
GeoGebra 3D graph of softmax for 3 classes (with derivatives):
📈 www.geogebra.org/classic/ps9g...
GeoGebra 3D graph of softmax with Gaussians for 3 classes:
📈 www.geogebra.org/classic/vgwa...
GeoGebra 3D graph of the shape of the softmax input space for 4 and 5 classes:
📈 www.geogebra.org/classic/emjn...
Join our Discord community:
💬 / discord
Connect with me:
🐦 Twitter - / elliotwaite
📷 Instagram - / elliotwaite
👱 Facebook - / elliotwaite
💼 LinkedIn - / elliotwaite
🎵 Kazukii - Return
→ / ohthatkazuki
→ open.spotify.com/artist/5d07M...
→ / officialkazuki

Пікірлер: 94

  • @elliotwaite
    @elliotwaite3 жыл бұрын

    CORRECTIONS: • At 5:22, instead of "more random" I probably should have said "less predictable," and instead of "less random, more deterministic," I probably should have said "more predictable," since it's only deterministic if one class has a value of 1 and all others 0, and other than that it represents randomness just with different distributions. Also, this is related to the context of when one is sampling from this distribution, for example when choosing the next character or word in a sentence when doing text generation. • At 16:51 I said, "It's basically the idea that as our networks are training, they're trying to push these logit values towards those Gaussian distributions," but this should only be taken loosely since the gradients don't point towards any single point. It would be more accurate to describe the direction of the gradient as being a weighted combination of both towards the subspace occupied by that Gaussian and towards the intersection of all the other subspaces, which can be seen by the slope of the gradient shown at 10:04.

  • @apophenic_

    @apophenic_

    Жыл бұрын

    Thank you!

  • @rembautimes8808
    @rembautimes88082 жыл бұрын

    After 30s you know that this lecture is worth its weight in gold.

  • @elliotwaite

    @elliotwaite

    2 жыл бұрын

    Thanks!

  • @catthrowvandisc5633
    @catthrowvandisc56333 жыл бұрын

    thank you for taking the effort to produce this, it is super helpful!

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks, catthrowvandisc! Glad you found it helpful.

  • @EBCEu4
    @EBCEu43 жыл бұрын

    Awesome visuals! Thank you for a great work!

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks, Yuriy! Glad you liked it.

  • @chrisogonas
    @chrisogonas Жыл бұрын

    Awesome illustrations! Thanks

  • @buffmypickles
    @buffmypickles3 жыл бұрын

    Well thought out and extremely informative. Thank you for making this.

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks!

  • @harry8
    @harry8 Жыл бұрын

    This content is amazing, and the visualizations are so intuitive and easy-to-understand. Awesome video!

  • @elliotwaite

    @elliotwaite

    Жыл бұрын

    Thanks, Harry! I appreciate it.

  • @patite3103
    @patite31033 жыл бұрын

    What you've done is just amazing!!

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks!

  • @bArda26
    @bArda263 жыл бұрын

    wow you make desmos sing! I love desmos as well. such as an amazing thing, incredible visualization tool. Great video, please keep making more!

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Agreed, Desmos is great! Fun to use and helpful in gaining insights. Thanks, bArda26! I'll keep the videos coming.

  • @wencesvm
    @wencesvm3 жыл бұрын

    Amazing video Elliot!!! Great insghts, keep it up.

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks, Wenceslao! Comments like this make my day.

  • @mohammadkashif750
    @mohammadkashif7503 жыл бұрын

    very helpful, content was amazing😍

  • @ryzurrin
    @ryzurrin2 жыл бұрын

    Amazing video, by far one of the best and most vidual explanations I have yet to see. Thanks for making such a great video. Can't wait to watch more of your videos.

  • @elliotwaite

    @elliotwaite

    2 жыл бұрын

    Thanks!

  • @toniiicarbonelll287
    @toniiicarbonelll2873 жыл бұрын

    So useful! Thanks a lot

  • @nikhilshingadiya7798
    @nikhilshingadiya77983 жыл бұрын

    I have no words for you !!! Awesome

  • @shashankdhananjaya9923
    @shashankdhananjaya99232 жыл бұрын

    Ultimate explanation. Thank you a million.

  • @elliotwaite

    @elliotwaite

    2 жыл бұрын

    Thanks!

  • @RomainPuech
    @RomainPuech10 ай бұрын

    Best, most thorough explanation possible!

  • @elliotwaite

    @elliotwaite

    10 ай бұрын

    Thank you!

  • @EpicGamer-ux1tu
    @EpicGamer-ux1tu10 ай бұрын

    I love you man, this video is what I needed. Much love and best of luck mate.

  • @elliotwaite

    @elliotwaite

    10 ай бұрын

    Thanks!

  • @Nico-rl4bo
    @Nico-rl4bo3 жыл бұрын

    I started a neural nets from scratch tutorial and your videos are amazing for support/understanding.

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks!

  • @mic9657
    @mic96574 ай бұрын

    This is great! You have put so much effort into this to make it easy to understand

  • @elliotwaite

    @elliotwaite

    4 ай бұрын

    Thanks!

  • @boutiquemaths
    @boutiquemaths7 ай бұрын

    Thanks so much for creating this thorough amazing explainer exposing the hidden details of softmax visually. Amazing!

  • @elliotwaite

    @elliotwaite

    7 ай бұрын

    :) I'm glad you liked it.

  • @eugene63218
    @eugene632183 жыл бұрын

    Man. This is the best video I've seen on this topic.

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks!

  • @avinrajan4815
    @avinrajan4815 Жыл бұрын

    Very helpful! The visualisation was very good!! Thank you...

  • @elliotwaite

    @elliotwaite

    Жыл бұрын

    Thanks! Glad you liked it.

  • @andreykorneychuk8053
    @andreykorneychuk80533 жыл бұрын

    The best video I have ever seen about softmax

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks!

  • @LifeKiT-i
    @LifeKiT-i10 ай бұрын

    I particularly love the way you explain it in graphical calculator, which most youtubers won't dare to use. The whole ML operation is a math, love you videos!! please update us by uploading more videos like this!

  • @elliotwaite

    @elliotwaite

    10 ай бұрын

    Thanks, I appreciate the comment. I hope to make more videos eventually.

  • @vatsal_gamit
    @vatsal_gamit3 жыл бұрын

    Very well explained 👍

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks!

  • @alidanish6303
    @alidanish63039 ай бұрын

    I was searching for intuitive explanation for Sigmoid and Softmax because they both have something in common for that something there are rare materials.And to my surprise Elliot you explained most of nuances in the concept if not all. I was hoping to find intuition on some other aspects in activation functions but sadly there you didn't made any videos for the last many years. I understand due to lack of time and probably nothing comes out of this hobby of yours you left making videos on technical content. But the way you have perceived these ideas have produced gold.

  • @elliotwaite

    @elliotwaite

    9 ай бұрын

    Thanks. I'm glad you liked the explanation. I may make more videos in the future. The main reason I stopped is because I want to build an AI company, and I felt like making these videos was slowing down my progress towards that goal. But I think once I get my business going a little more, it will actually be beneficial to get back into making videos. But we'll see. Thanks again for the kind comment.

  • @stanshunpike369
    @stanshunpike3692 жыл бұрын

    I just did a presentation with Desmos and I thought mine was good, but WOW props to you. Ur Desmos skills are insanely good man

  • @elliotwaite

    @elliotwaite

    2 жыл бұрын

    Thanks! Yeah, I'm a big Desmos fan. It's a great learning tool.

  • @yoggi1222
    @yoggi12223 жыл бұрын

    Excellent !

  • @adityaghosh8601
    @adityaghosh86013 жыл бұрын

    Thank you so much.

  • @tato_good
    @tato_good2 жыл бұрын

    very good explanation

  • @anton_98
    @anton_982 жыл бұрын

    Addition (shift) to all classes is changing the scores. For example we have 3 classes with values: 5, 3 and 2. So we can say: 5/10 = 3/10 + 2/10. After addition 1 to value of all calsses: 6/13 != 4/13 + 3/13. So addition must change scores. It is also easy to see if we add 1000 for example. So it will be 1005, 1003 and 1002. And score of each roughly become to 0.3333....

  • @anton_98

    @anton_98

    2 жыл бұрын

    It was comment to this part of video: kzread.info/dash/bejne/q6iWu7SCfpO0ZMo.html.

  • @elliotwaite

    @elliotwaite

    2 жыл бұрын

    The reason an addition to all inputs doesn't change their outputs is because instead of using the raw input values in the fractions, you exponentiate the inputs first before using them in the fraction: e^2 / (e^2 + e^3 + e^5) = e^1002 / (e^1002 + e^1003 + e^1005) Thanks for the question.

  • @anton_98

    @anton_98

    2 жыл бұрын

    @@elliotwaite Thank you for explanation.

  • @zyzhang1130
    @zyzhang11303 жыл бұрын

    very helpful!

  • @scotth.hawley1560
    @scotth.hawley15603 жыл бұрын

    Great video. The point where you take the log of all these functions to make the loss seems like it could use a bit more motivation: The log makes sense for exponential-like sigmoids and if you're requiring a probabilistic interpretation, but perhaps there is some other function more appropriate for algebraic sigmoids? (Although I can't think of any that would produce the "nice" properties that you get from cross-entropy loss & logistic sigmoid, apart from maybe a piecewise scheme like "Generalized smooth hinge loss".)

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Good point. I didn't go into the reasoning of why I chose the negative log-likelihood loss other than mentioning that it is the standard used for softmax outputs that are interpreted as probabilities. I wasn't sure of a concise way to explain it, so I thought I would just take it as a given without going into the reasoning. But maybe I could have at least pointed viewers to another resource that they could read for if they wanted to better understand why it is typically used any how other loss functions could also potentially be used. Thanks for the feedback.

  • @melkenhoning158
    @melkenhoning1582 жыл бұрын

    THANK YOU

  • @TylerMatthewHarris
    @TylerMatthewHarris3 жыл бұрын

    Thanks!

  • @user-iu7qg9lo7y
    @user-iu7qg9lo7y2 жыл бұрын

    great stuff

  • @HeitorPastorelli
    @HeitorPastorelli9 ай бұрын

    awesome video, thank you

  • @elliotwaite

    @elliotwaite

    9 ай бұрын

    :) thanks for the comment. I'm glad you liked it.

  • @snehotoshbanerjee1938
    @snehotoshbanerjee19383 жыл бұрын

    Superb, Excellent Elliot!!! BTW, which software you used?

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks, Snehotosh! For the 2D graph, I used Desmos, and for the 3D graphs, I used GeoGebra. They both have web-based versions and GeoGebra also has a native version. I posted links in the description to all the interactive graphs shown in the video if you want to get some ideas for how to use them.

  • @horizon9863
    @horizon98633 жыл бұрын

    Please keep doing it!

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    Thanks, Cankun! I plan to make more videos soon.

  • @tomgreene4329
    @tomgreene4329 Жыл бұрын

    this is probably the best math video I have ever watched

  • @elliotwaite

    @elliotwaite

    Жыл бұрын

    Thanks!

  • @alroy01g
    @alroy01g2 жыл бұрын

    great video, many thanks! One question: around 5:00, you mentioned that getting probabilities together make them more random, and separating them more deterministic, I can see that graphically but I can't reconcile with the definition of variance which asserts the more spread of the points the more variance, maybe I'm mixing things up here??

  • @elliotwaite

    @elliotwaite

    2 жыл бұрын

    Thanks, glad you liked the video. About your question, it is a bit counterintuitive at first, but I think the key to understanding how having more similar probabilities can lead to a more random outcome is remembering that the probabilities we are talking about are the probabilities for different outcomes. And when I say "random," I'm using it a bit vaguely and not in the same way as variance, which I explain below. So for example, if we were rolling a six sided die, we could think of plotting the bar heights of the probabilities of getting each of the six sides. If it were a weighted die that almost always rolled a 5, then the bar for 5 would be large and the others would be small, and we might say the die was not very random. In this case, the distribution of the value we'd get from rolling the die would also have a low variance with most of the distribution mass being for 5. On the other hand, for a fair die, the probabilities for the sides would almost all be equal with their bar heights all being about the same, and we might say the die was more random. In this case the distribution of the value we'd get from rolling the die would also have a higher variance with the probability mass distributed more evenly across the outcomes. To see the difference in my usage of the word "random" and the precise term "variance," let's consider a weighted die that rolls a 1 half the time and a 6 the other half. We might say that this die was less random than a fair die, but the distribution of the outcome values would actually have a higher variance than a fair die. And in fact it would be the way to weight the die to have the highest variance possible. So when I say "random" I just mean that we are less certain about which class will be the outcome, but if those classes correspond to specific values in a field (a 1d, 2d, 3d, or higher dimensional space), then the variance of the distribution of those values could be higher or lower depending on which values the classes correspond to. I hope that helps.

  • @alroy01g

    @alroy01g

    2 жыл бұрын

    @@elliotwaite Ah, yes you're right, those are different outcomes, appreciate the detailed explanation, thanks for your time 👍

  • @my_master55
    @my_master55 Жыл бұрын

    Thanks Elliot, great video! 👍 Could you please clarify why the output space is evenly distributed among classes? Shouldn't there be more space for the dominant class? Or you consider that all the classes are equivalently dominant? Thanks 🙌

  • @elliotwaite

    @elliotwaite

    Жыл бұрын

    So the visuals are actually showing all possible cases in one graph, both what the output would be when one class is dominant and what the output would be when the classes are equivalent. The spaces where one color is above the others is where that class is dominant and the place in the middle where the curves cross and are all equal height is what the output would be if the classes were equivalent. And the spaces are all the same size because the softmax function doesn’t give any preference to any of the inputs before knowing their values. Let me know if it is still unclear.

  • @my_master55

    @my_master55

    Жыл бұрын

    @@elliotwaite oh, okay, thank you 👍 So what we see is just "the middle of the decision" and therefore it only looks like the classes are equal. Which means, this does not imply that the classes are like 50/50 everywhere in case of 2 classes, but at the intersection it indeed looks like they are 50/50. Hope I got it right 😊

  • @elliotwaite

    @elliotwaite

    Жыл бұрын

    @@my_master55 I’m not sure what you mean by "the middle of the decision." But yes, at the intersection they are 50/50 in the 2 class example and 1/3, 1/3, 1/3 in the 3 class example.

  • @rutvikjaiswal4986
    @rutvikjaiswal49863 жыл бұрын

    Even without watching this video, I liked it! because I know this is really going to change my thought toward softmax function and I learn a lot through this video.

  • @Mahesha999
    @Mahesha9992 жыл бұрын

    I didnt get why exactly the smaller curve itself is changing shape when you decrease / increase green value at 2:25. I guess going through this desmos will bring more clarity. Can you open source it? Or its already shared?

  • @elliotwaite

    @elliotwaite

    2 жыл бұрын

    The link to the publicly accessible Desmos graph is in the description. The bottom graph is a copy of the top graph, but scaled down vertically so that if you add up the heights of all the dots, that total height has a height of 1. So when you increase the height of the green dot, that increases the total height of all the dots, so the bottom graph has to be scaled down even more to make that total height stay at a height of 1, which is why the bottom graph squashes down more as the height of the green value increases.

  • @arseniymaryin738
    @arseniymaryin7383 жыл бұрын

    Hi, can I see your neural network training (at 12.12) in some environment such as Jupiter notebook?

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    I don't have a notebook of it, but you can get the code here: github.com/elliotwaite/softmax-logit-paths

  • @lenam8048
    @lenam804810 ай бұрын

    why you have so few subscribers :

  • @elliotwaite

    @elliotwaite

    10 ай бұрын

    I know 😥. I think I need to make more videos.

  • @zhiyingwang1234
    @zhiyingwang1234 Жыл бұрын

    The comment (5:06) about temperature parameter is not accurate, when temp parameter

  • @elliotwaite

    @elliotwaite

    Жыл бұрын

    I'm not sure I understand your comment. Can you clarify what you mean by "theta values"? In the video I was trying to say that increasing the temperature flattens the distribution (making it more uniform), and that lowering the termpature pushes the arg max up the others down (making it more like a spike for only the arg max probability). Are you saying that you think it is the other way around, or was there something else that you thought was inaccurate?

  • @fibonacci112358s
    @fibonacci112358s11 ай бұрын

    "To imagine 23-dimensional space, I first consider n-dimensional space, and then I set n = 23"

  • @elliotwaite

    @elliotwaite

    11 ай бұрын

    Haha. For real though, that probably is the best way to do it.

  • @DiogoSanti
    @DiogoSanti3 жыл бұрын

    Lol "I am not confortable with geometry"... If you aren't, imagine marginal people... Haha

  • @elliotwaite

    @elliotwaite

    3 жыл бұрын

    😁 Thanks, I try. The higher dimensional stuff is tough, but I've slowly been getting better at finding ways to reason about it. But I'd imagine there are people, perhaps mathematicians who work in higher dimensions regularly, that have much more effective mental tools than I do.

  • @DiogoSanti

    @DiogoSanti

    3 жыл бұрын

    @@elliotwaite Maybe but higher dimensions are hypothetical as we are trapped in 3 dimensions ... the rest is just imagination hehe

  • @joeybasile1572
    @joeybasile15722 ай бұрын

    Great video. Thank you