BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)

Ғылым және технология

Self-supervised representation learning relies on negative samples to keep the encoder from collapsing to trivial solutions. However, this paper shows that negative samples, which are a nuisance to implement, are not necessary for learning good representation, and their algorithm BYOL is able to outperform other baselines using just positive samples.
OUTLINE:
0:00 - Intro & Overview
1:10 - Image Representation Learning
3:55 - Self-Supervised Learning
5:35 - Negative Samples
10:50 - BYOL
23:20 - Experiments
30:10 - Conclusion & Broader Impact
Paper: arxiv.org/abs/2006.07733
Abstract:
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods intrinsically rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using the standard linear evaluation protocol with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.
Authors: Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko
Links:
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 134

  • @amitchaudhary9869
    @amitchaudhary98694 жыл бұрын

    I have this intuition on why they are adding an MLP after ResNet50. Let's say a student has an exam tomorrow on a subject. His/her study a few days before would mostly be focused on the questions/topics that might come the next day instead of getting a holistic understanding of the subject itself. Similarly, if we use ResNet-50 representations directly on the task, it would be updated in a way to get better at the task of contrastive loss instead of learning to generate useful generic representations. The SimCLR paper also sheds light on this through their ablation studies. In learning to separate two things, it could discard useful properties(for downstream tasks) in the representation that is not helping it separate things. Adding an MLP can help project it into space where the MLP representations can be specialized for contrastive tasks, while the ResNet-50 can produce generic representations. In analogy to NLP, this reminds me of the query/key/value division in the transformer model.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Yes that makes sense

  • @tengotooborn

    @tengotooborn

    3 жыл бұрын

    Why would adding an MLP help that?

  • @3663johnny

    @3663johnny

    Жыл бұрын

    I actually think this is to reduce the storage for image search engines🧐🧐

  • @tanmay8639
    @tanmay86394 жыл бұрын

    This guy is epic.. never stop making videos bro

  • @Denetony

    @Denetony

    4 жыл бұрын

    Don't say that "never stop" ! He has been posting every single day! He deserves a holiday or at least a day break haha

  • @BradNeuberg
    @BradNeuberg3 жыл бұрын

    Just a small note how much I appreciate these in-depth paper reviews you've been doing! Great work, and its a solid contribution to the broader community.

  • @1Kapachow1
    @1Kapachow13 жыл бұрын

    There is actual motivation behind the projection step. The idea is that in our self supervised task, we try to reach a representation which doesn't change with augmentations of the image. However, we don't know the actual task, and we might remove TOO MUCH information from the representation. Therefore, adding a final "projection head", and taking the representation before that projection increases the chance of having "a bit more" useful information for other tasks as well. It is mentioned/describe a bit in SimCLR.

  • @sushilkhadka8069

    @sushilkhadka8069

    3 ай бұрын

    so the vector representation before the projection layer would contain fine-grained information about the input and after projection vector would contain more semantic meaning?

  • @AdamRaudonis
    @AdamRaudonis4 жыл бұрын

    These are my favorite videos! Love your humor as well. Keep it up!

  • @PhucLe-qs7nx
    @PhucLe-qs7nx4 жыл бұрын

    This can be seen from the Yann Lecun talk as a method in "Regularized Latent Variable Energy-based Model", instead of the "Contrastive Energy-based Model". In the Contrastive setting, you need positive and negative samples to know where to push up and down your energy function. This method doesn't need negative samples because it's implicitly constrained to have the same amount of energy as the "target network", only gradually shifted around by the positive samples. So I agree with Yannic, this must be in some kind of balance in the intialization for it to work.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Very good observation

  • @RaviAnnaswamy

    @RaviAnnaswamy

    3 жыл бұрын

    This is a deeper principle. It is much easier to learn from what is working than from what is NOT correct. but slower.

  • @actualBIAS
    @actualBIAS Жыл бұрын

    Wahnsinn. Dein Channel ist eine wahre Goldgrube. Danke für den Content!

  • @clara4338
    @clara43384 жыл бұрын

    Awesome explanations, thank you for the video!!

  • @alifarrokh9863
    @alifarrokh9863 Жыл бұрын

    Hi Yannic, I really appreciate what you do. Thanks for the great videos.

  • @bensas42
    @bensas422 жыл бұрын

    This video was amazing. Thank you so much!!

  • @ayankashyap5379
    @ayankashyap53794 жыл бұрын

    Thank you for all your videos

  • @ensabinha
    @ensabinha3 ай бұрын

    16:20 using the projection at the end allows the encoder to learn and capture more general and robust features beneficial for various tasks, while the projection, closely tied to the specific contrastive learning task, focuses on more task-specific features.

  • @florianhonicke5448
    @florianhonicke54484 жыл бұрын

    Great summary!

  • @dorbarber1
    @dorbarber1 Жыл бұрын

    hi yannic. Thanks for your great surveys on the best DL papers. I really enjoy them. Regarding your comments on the projection necessity, it is better explain in the simCLR paper(by Hinton) that was publish about the same time as this paper. In short, they showed empirical results that the projection does help. the reason behind this idea is that you would like “z” representation to be agnostic to the augmentation but you don't need “y” representation to be agnostic to the augmentation. you would want the “y” representation to be separable to the augmentation so it will be easy to "get rid of” this information in the projection stage. the difference between y and z is not only the dimensionality. z is used for the loss calculations and y representation is the representation you will eventually use in the downstream task

  • @thak456
    @thak4564 жыл бұрын

    love your honesty....

  • @RohitKumarSingh25
    @RohitKumarSingh254 жыл бұрын

    I think the idea behind projection layer might be to avoid curse of dimensionality for the loss function. In low dimensional space distance between points are more differentiable. So that might help in l2 norm. For no collapse no idea. 😅

  • @LidoList
    @LidoList2 жыл бұрын

    Super simple, all credits go to Speaker. Thanks bro ))))

  • @StefanausBC
    @StefanausBC4 жыл бұрын

    Hey Yannic, thank you for all this great videos. Always enjoying it. Yesterday, ImageGPT got published, I would be interested to get your feedback on that on too!

  • @siyn007
    @siyn0074 жыл бұрын

    The broader impact rant is hilarious

  • @veedrac

    @veedrac

    4 жыл бұрын

    What's wrong with it though? The paper has no clear broader ethical impacts beyond those generally applicable to the field, so it makes sense that they've not said anything concrete to their work.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Nothing is wrong. It's just that it's pointless, which was my entire point about broader impact statements.

  • @veedrac

    @veedrac

    4 жыл бұрын

    ​@@YannicKilcher If we only had papers like these, sure, but not every paper is neutral-neutral alignment.

  • @theodorosgalanos9663

    @theodorosgalanos9663

    4 жыл бұрын

    The 'broader impact Turing test': if a paper requires a broader impact statement for users to see its impact outside of the context of the paper, then it doesn't really have one. The broader impact of your work should be immediately evident after the abstract and introduction, where the real problem you're trying to solve is described. Most papers in ML nowadays don't really do that, real problems are tough and they don't allow you to set a new SOTA on ImageNet. p.s. a practitioner in another domain I admit my bias towards real life applications in this rant; I'm not voting for practice over research I'm just voting against separating it.

  • @softerseltzer

    @softerseltzer

    3 жыл бұрын

    Probably it's a field required by a journal.

  • @ai_station_fa
    @ai_station_fa2 жыл бұрын

    Thanks again!

  • @sushilkhadka8069
    @sushilkhadka80693 ай бұрын

    Okay for those of you wondering how is it that it's avoiding model collapse (learning a constant representation for all the inputs) when there's no constractive setting in the approach. It's because of the BatchNormalisation Layer and ema update of the target network. Both BN and EMA encourages variability in the model's representation.

  • @NewTume123
    @NewTume1233 жыл бұрын

    Regarding the projection head. The guys in the SimClr/SupCon paper showed that you get better results by using projection head (just need to remove it for future fine-tuning). That's why in the MoCoV2 they also started applying a projection head

  • @teslaonly2136
    @teslaonly21364 жыл бұрын

    Nice!

  • @Phobos11
    @Phobos114 жыл бұрын

    Self-supervised week 👏

  • @jwstolk
    @jwstolk4 жыл бұрын

    8 hours on 512 TPU's and still doesn't end up with a constant output. I'd say it works.

  • @daniekpo
    @daniekpo3 жыл бұрын

    When I first saw the authors, I wondered what DeepMind was doing with computer vision (this is maybe the first vision paper that I've seen from them), but after reading the paper it totally made sense coz they use the same idea as DQN and some other reinforcement learning algorithms.

  • @Marcos10PT
    @Marcos10PT4 жыл бұрын

    Damn Yannick you are fast! Soon these videos will start being released before the papers 😂

  • @revimfadli4666

    @revimfadli4666

    4 жыл бұрын

    Perhaps there will be a double event?

  • @youngjin8300
    @youngjin83004 жыл бұрын

    You rolling my friend rolling!

  • @JTMoustache
    @JTMoustache4 жыл бұрын

    For the projection the parameters go from O(Y x N) > O(Y x Z + Z x N).. maybe that is the reason ? Of course that makes sense only if YN > ZY + ZN => N ( Y - Z ) > ZY => N > ZY / (Y-Z) and Z < Y

  • @behzadbozorgtabar9413
    @behzadbozorgtabar94134 жыл бұрын

    It is very similar to MoCov2, the only difference is using deeper network (prediction network, which I guess the gain comes from) and MSE loss instead of InfoNSE loss.

  • @cameron4814
    @cameron48144 жыл бұрын

    this video is better than the paper. i'm going to try to reimplement this based on only the video.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Famous last words

  • @franklinyao7597
    @franklinyao75973 жыл бұрын

    Yannic, at 18:01. You said there is no need for a projection. My answer is that maybe the intermidiate represent is better than the final representation because earlier layers learn universal representation while finall layers give you more specific represnetation for specific tasks.

  • @Phobos11
    @Phobos114 жыл бұрын

    This configuration of the two encoders sounds exactly like stochastic weight averaging, only that the online and sliding window parameters are both being used actively 🤔. From SWA, the second encoder should have a wider minima, helping it generalize better

  • @Schrammletts
    @Schrammletts3 жыл бұрын

    I strongly suspect that normalization is a key part of making this work. If you had a batchnorm on the last layer, then you couldn't map everything to 0, because the output must be mean 0 variance 1. So you have to find the mean 0 variance 1 representation that's invariant to the data augmentations, which starts to sound much more like a good representation

  • @daisukelab
    @daisukelab3 жыл бұрын

    Here we got one mystery partly confirmed, this nice article shows that BN in MLP has an effect like contrastive loss: untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html And paper: arxiv.org/abs/2010.00578

  • @asadek100
    @asadek1004 жыл бұрын

    Hi Yannic, Can u make a video to explain Graph Neural Network.

  • @roman6575
    @roman65753 жыл бұрын

    Around the the 17:30 mark, I think the idea behind using the projection MLP 'g' was to make the encoder 'f' learn more semantic features by using higher level features for the contrastive loss.

  • @WaldoCampos1

    @WaldoCampos1

    Жыл бұрын

    The idea of using a projection comes from SimCLR's architecture. In their paper they proved that it improves the quality of the representation.

  • @adamrak7560
    @adamrak75603 жыл бұрын

    You are right in assuming that there are no strong guarantees to stop it from reaching the global minimum, which is degenerate. But the way they train it, with the averaged version creates a _massive_ mountain around the degenerate solution. Gradients will point away from the degenerate case unless you are already too close, because of the "target" model. The degenerate case is a _local_ attractor only. (and the good solutions are local attractors too, but more numerous) This means that unless the stepsize is too big, this network should robustly converge to a adequate solution instead of the degenerate one. In the original network, where you differentiate with with both paths, you will always get gradient which points somewhat towards the degenerate solution, because you use the same network in the two paths and sum the gradients: in a local linear approximation this will tend towards the degenerate case. The degenerate case is a _global_ attractor. (and the good solutions are only local attractors)

  • @moormanjean5636

    @moormanjean5636

    Жыл бұрын

    Very well explained!!

  • @d13gg0
    @d13gg02 жыл бұрын

    @Yannic I think the projections are important because otherwise the representations are too sparse to calculate a useful loss.

  • @herp_derpingson
    @herp_derpingson4 жыл бұрын

    00:00 I wonder if there is a correlation between number of authors and GPU/TPU hours used. . I dont really see the novelty in this paper. They are just flexing their GPU budget on us.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    No, I think it's more like how many people took 5 minutes to make a spurious comment on the design doc at the beginning of the project.

  • @anthonybell8512
    @anthonybell85123 жыл бұрын

    Couldn't you predict the distance between two patches to prevent the network from degenerating to the constant representation c? This seems like it would help the model even more for learning representations than a yes/no label because it would also encode a notion of distance - i.e. I expect patches at opposite corners of the image to be less similar than patches right next to each other.

  • @dshlai
    @dshlai4 жыл бұрын

    I just saw that leaderboard on the paperwithcode

  • @MH-km9gd
    @MH-km9gd4 жыл бұрын

    So it's basically iterative mean teacher with a symmetrical loss? Bit the projection layer is a neat thing. Would be nice if they would have shown it working on a single GPU

  • @maxraphael1
    @maxraphael14 жыл бұрын

    The idea of weights moving averaging in teacher-student architecture was also used in this other paper arxiv.org/abs/1703.01780

  • @jonatan01i
    @jonatan01i4 жыл бұрын

    The target network's representation of an original image is a planet (moving slowly) that pulls the augmented versions' representations to itself. The other original images' planet representations are far away and only if there are many common things in them do they have a strong attraction towards each other.

  • @dermitdembrot3091

    @dermitdembrot3091

    3 жыл бұрын

    Agreed! These representations form a cloud where there is a tiny pull of gravity toward the center. This pull is however insignificant compared to the forces acting locally. It would probably take a huge lot more training for a collapse to materialize.

  • @dermitdembrot3091

    @dermitdembrot3091

    3 жыл бұрын

    I suspect that collapse would first happen locally so that it would be interesting to test whether this method has problems encoding fine differences between similar images.

  • @marat61
    @marat614 жыл бұрын

    If you compare representation y' which is obtained using exp mean of the past parameters your new parametrs will be as close to past as possible inspite of all possible augmentations. So model indeed is forced to stay at the same spot forced to move in dirrection of just ignoring augmentations

  • @darkmythos4457
    @darkmythos44574 жыл бұрын

    Using stable targets with an EMA model is not really new, used a lot in semi-supervised learning, like in the paper Mean Teachers. As for the projection, in SimCLR, they explain why it is important.

  • @DanielHesslow

    @DanielHesslow

    4 жыл бұрын

    So if I understand it correctly, the projection is useful because we train the representations to be invariant to data augmentations, if instead we use the layer before the projection as the representation we can also keep information that does vary over data augmentations, which may be useful in downstream tasks? In the SimCLR they also show that the output size of the projection is largely irrelevant. However for this paper I wonder if there is not a point to projecting it down to a lower dimension in that it would increase the likelihood of getting stuck in a local minima and not degenerating to constant representations? Although 256 dimensions is still fairly large so maybe that doesn't play a role.

  • @gopietz

    @gopietz

    4 жыл бұрын

    I still see his argument because to my understanding the q should be able to replace the projection step and do the same thing.

  • @DanielHesslow

    @DanielHesslow

    4 жыл бұрын

    @@gopietz The prediction network, q, is only applied to the online network, so if we were to yank out the projection network our embeddings would need to be invariant to data augmentations, so I do think there is a point to have it there.

  • @tsunamidestructor
    @tsunamidestructor4 жыл бұрын

    Hi Yannic, just dropping in (as usual) to say that I love your content! I was just wondering - what h/w and/or s/w do you use to draw around the PDFs?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Thanks :) I use OneNote

  • @citiblocsMaster
    @citiblocsMaster4 жыл бұрын

    33:40 Yeah let me just spin up my 512 TPUv3 cluster real quick

  • @mohammadyahya78
    @mohammadyahya78 Жыл бұрын

    Thank you very much. Can this be used with NLP instead of images?

  • @krystalp9856
    @krystalp98563 жыл бұрын

    That broader impact section seems like an undergrad wrote it lol. I would know cuz that's exactly how I write reports for projects in class.

  • @bzqp2
    @bzqp23 жыл бұрын

    29:36 All the performance gains are from the seed=1337

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover24704 жыл бұрын

    I guess it's because completely ignoring the input is kind of hard when there is a separate architecture which is tracking parameters. Forgetting the input would be a huge change in initial parameters so should increase the loss between the architectures thus could be stopped.

  • @Ma2rten
    @Ma2rten3 жыл бұрын

    I'm not sure if I put a link in KZread, but if you google "Understanding self-supervised and contrastive learning with Bootstrap Your Own Latent" someone figured out why it works and doesn't produce the degenerate solution.

  • @lamdang1032
    @lamdang10324 жыл бұрын

    My 2 cents on why collapse doesn’t happen. For ResNet collapse means one of the few projection layer must have weight all zeroed. This is very difficult to obtain since it is only 1 global minimum vs infinite number of local minima.also the moving average helps because in order to the weight to be completely zeroed the moving average must also mean be zeroed, which means weights are zeroed for a alot of iterations. Mode collapse may have happens partially for the experiment where they remove the moving average.

  • @sushilkhadka8069

    @sushilkhadka8069

    2 ай бұрын

    Model can collapse when it spits out the same vector representation for all the inputs .i.e constant vector not only vectors of zeros. In the video @yannic simply gives vectors of zero as an example.

  • @lamdang1032

    @lamdang1032

    Ай бұрын

    @@sushilkhadka8069 please read my comment again, I talk about weight zero not activation. In order for activation to be a constant W must be 0 therefore the activation equal bias term

  • @daisukelab
    @daisukelab4 жыл бұрын

    "if you construct augmentations smartly" ;)

  • @citiblocsMaster
    @citiblocsMaster4 жыл бұрын

    31:12 I disagree slightly. I think the power of the representations comes from the fact that they throw an insane amount of compute at the problem. Approaches such as Adversarial AutoAugment (arxiv.org/abs/1912.11188) or AutoAugment more broadly show that it's possible to learn such augmentations.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Yes in part, but I think a lot of papers show just how important the particulars of the augmentation procedure are

  • @korydonati8926
    @korydonati89264 жыл бұрын

    I think you pronounce this "B-Y-O-L". It's a play on "B-Y-O-B" for "Bring your own booze". Basically this means to bring your own drinks to a party or gathering.

  • @qcpwned07
    @qcpwned074 жыл бұрын

    kzread.info/dash/bejne/i4Sat8uIfby1dag.html I don't think the projection can be ignored here, ResNet has been trained for a long while, so it's weights are fairly stable, it would probably take a long time to notch them in the right direction, and you would loose some of its capacity to represent the input set. By adding the projection, you have a new naive network on which you can apply a greater learning rate without fear of alienising the structure of the pre-learned model. -- Basically it is serves somewhat the same function as the classification layer one would usually add for fine tuning

  • @3dsakr
    @3dsakr4 жыл бұрын

    Very interesting!!, Thanks a lot for your videos, I just found your channel yesterday and I already watched like 5 videos :) I want your help in understanding only 1 thing that I feel I'm lost in, which is: learning what a single neuron do. for example: you have an input jpg image of 800x600, pass to Conv2D layer1, then Conv2D layer2 the question is: how to get the final result from layer1 (as an image), and how to find out what each neuron is doing in that layer? (same for layer2) keep the good work up.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    That's a complicated question, especially in a convolutional network. have a look at openai microscope

  • @3dsakr

    @3dsakr

    4 жыл бұрын

    @@YannicKilcher Thanks, after some search I found this, github.com/tensorflow/lucid and this microscope.openai.com/models/inceptionv1/mixed5a_3x3_0/56 and this distill.pub/2020/circuits/early-vision/#conv2d1

  • @3dsakr

    @3dsakr

    4 жыл бұрын

    @@YannicKilcher and this, interpretablevision.github.io/

  • @eelcohoogendoorn8044
    @eelcohoogendoorn80444 жыл бұрын

    One question that occurs with me is wrt the batch size. The show it matters less than methods that mine negatives within the batch, obviously. But why does it matter at all? Just because of batchnorm related concerns? There are some good solutions to this available nowadays are there not? If I used groupnorm for instance, and my dataset was not too big / my patience sufficed (kinda the situation I am in), could I make this work on a single GPU with tiny batches? I dont see any reason why not. Just trying to get a sense of how important the bazzilion TPUs are in the big picture.

  • @eelcohoogendoorn8044

    @eelcohoogendoorn8044

    4 жыл бұрын

    Ah right they explicitly claim its just the batch norm. Interesting, would not expect it to matter that much with triple-digit batch sizes.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    It's also the fact that there are better negatives in larger batches

  • @eelcohoogendoorn8044

    @eelcohoogendoorn8044

    4 жыл бұрын

    @@YannicKilcher Yeah it matters a lot if you are mining within the batch; been there, done that. Im actually surprised at how little it seems to matter for the negatives in their example. My situation is a very big mem-hungry model and hardware-constrained situation; im having to deal with batches of size 4 a lot of the time. Sadly they dont go that low in this paper, but if its true batchnorm is the only bottleneck, thats very promising to me.

  • @eelcohoogendoorn8044

    @eelcohoogendoorn8044

    4 жыл бұрын

    ...im not sampling negatives from the batch if the batch size is 4, obviously. memory banks for decently but i also find them very tuning-sensitive; how tolerant is one supposed to be 'staleness' of entries in the memory bank, etc. Actually finding negatives of the right hardness consistently isnt easy. Its kinda interesting that the staleness of the memory bank entries can be a feature, if this paper is to be believed.

  • @AIology2022
    @AIology20223 жыл бұрын

    What is its difference with Siamese Neural Network. I did not see anything new.

  • 2 жыл бұрын

    Wait, at 18:26, shouldn't it be q_theta(f_theta(v)) instead of q_theta(f_theta(z))? Anyway, great video!

  • @snippletrap
    @snippletrap3 жыл бұрын

    Yannic it's probably pronounced bee-why-oh-ell, after BYOB, which stands for Bring Your Own Beer,

  • @004307ec
    @004307ec4 жыл бұрын

    Off-policy design from RL?

  • @tuanphamanh7919
    @tuanphamanh79194 жыл бұрын

    hay, chúc bạn một ngày tốt lành

  • @Gabend
    @Gabend3 жыл бұрын

    "Why is this exactly here? Probably simply because it works." This sums up the whole deep learning paradigm.

  • @wagbagsag
    @wagbagsag4 жыл бұрын

    Why wouldn't the q(z) network (prediction of target embedding) just be be the identity? q(z) doesn't know which transformations were applied, so the only thing that differs consistently between the two embeddings is that one was produced by a target network whose weights are filtered by an exponential moving average. I would think that the expected difference between a signal and it's EMA filtered target should be zero since EMA is an unbiased filter...?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I don't think EMA is an unbiased filter. I mean, I'm happy to be convinced otherwise, but it's just my first intuition.

  • @wagbagsag

    @wagbagsag

    4 жыл бұрын

    ​@@YannicKilcher I guess the EMA-filtered params lag the current params, and in general that means that they should be higher loss than the current params. So the q network is basically learning which direction is uphill on the loss surface? I agree that it's not at all clear why this avoids mode collapse

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    @@wagbagsag yes that sounds right. And yes, it's a mystery to me too 😁

  • @mlworks
    @mlworks2 жыл бұрын

    It does sound little like GANs except instead of random noise and ground truth image, we have two variants of the same image learning from each other. Am I misinterpreting completely ?

  • @sushilkhadka8069

    @sushilkhadka8069

    2 ай бұрын

    @mlworks Found you in a comment section!! lol

  • @bzqp2
    @bzqp23 жыл бұрын

    10:30 The local minimum idea doesn't make too much sense with so many dimensions. Without any preventive mechanism it would always collapse to a constant. It's really hard for me to believe it's just an initialization balance.

  • @user-rh8hi4ph4b

    @user-rh8hi4ph4b

    2 жыл бұрын

    I think the different "target parameters" are crucial in preventing collapse here. As long as the target parameters are different from the online parameters, it would be almost impossible for both online and target parameters to produce the same constant output, _unless_ the model is fully converged (as in the parameters no longer change, causing the online and target parameters to become the same over time). So one might argue that the global minimum of "it's just the same constant output regardless of input" doesn't exist, since that behavior could never yield a minimal loss because of the difference in online and target parameters. If that difference is large enough, such as with a very high momentum in the update of the target parameters, the loss of the trivial/collapsed behavior might even be worse than that of non-trivial behaviors, preventing collapse that way.

  • @sushilkhadka8069

    @sushilkhadka8069

    2 ай бұрын

    @@user-rh8hi4ph4b the same idea hit my mind when I asked myself why would it prevent model collapse. very good observation, thanks for sharing!

  • @grafzhl
    @grafzhl4 жыл бұрын

    16:41 4092 is my favorite power of 2 😏

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Haha I realized while recording 😁

  • @MariuszWoloszyn
    @MariuszWoloszyn4 жыл бұрын

    Actually it's the other way around. Transformers are inspired by transfer learning on image recognition models.

  • @Agrover112
    @Agrover1124 жыл бұрын

    By the time I enter a Master's , DL will be so fucking different

  • @tshev
    @tshev3 жыл бұрын

    It is interesting what numbers can you get with 3090 in one week.

  • @CristianGarcia
    @CristianGarcia4 жыл бұрын

    I asked the question about the mode colapse and a researcher friend pointed me to Mean Teacher (awesome name) where they also do exponential averaging, they might have some insights into why it works: arxiv.org/abs/1703.01780

  • @theodorosgalanos9663
    @theodorosgalanos96634 жыл бұрын

    Alternate name idea: BYOTPUs

  • @haizhoushi8287

    @haizhoushi8287

    3 жыл бұрын

    short for Buy Your Own TPUs

  • @dippatel1739
    @dippatel17394 жыл бұрын

    As I said earlier. Label exists. Augmentation: I am about to end this man‘s career.

  • @BPTScalpel
    @BPTScalpel4 жыл бұрын

    I respectfully disagree with the pseudo code release. With it, it took me half a day to implement the paper while it took me way more time to replicate SimCLR because the codebase the SimCLR authors published was disgusting to say the least. One thing that really grinds my gears though is that they (deliberately?) ignored the semi supervised literature.

  • @ulm287

    @ulm287

    4 жыл бұрын

    Are you able to reproduce the results? I would imagine 512 TPUs must cost a lot of money or do you just let it run for days? My main concern like in the video is why it doesnt collapse to constant representation ... from a theoretical perspective, you are literally exploiting the problem of not being able to optimize a NN, which is weird. If they use their loss for example as validation, then that would mean that they are not CV over the init, as constant zero init would be 0 loss ... Did you find any hacks they used to avoid this? like "anti-"weight decay or so lol

  • @BPTScalpel

    @BPTScalpel

    4 жыл бұрын

    ​@@ulm287 I have access to a lot of compute for free thanks to the french gov. Right now, we replicated MoCo, MoCo v2 and SimCLR with our codebase and BYOL is currently running. I think the main reason it works is that, because of the EMA and the different views, the collapse is very unlikely: while it seems obvious to us that outputting constant features minimises the loss, this solution is very hard to find because of the EMA updates. To the network, it's just one of the solution and not even an easy one. The reason it learns anything at all is because there is already some very weak signal in a randomly initialised network, signal that they refine over time. Some pattern are so obvious that even a randomly initialised target network with a modern architecture will be able to output features that contain informations related to them (1.4% accuracy on ImageNet with random features). The online network will pick this signal up and will refine it to make it robust to the different views (18.4% accuracy on ImageNet with BYOL with a random target network). Now you can replace the target network with this trained online network and start again. The idea is that this new target network will be able to pick up signals from more patterns that the original random one. You iterate this process many times until convergence. That's basically what they are doing but instead of waiting for the online network to converge before updating the target network, they maintain an EMA of it.

  • @eelcohoogendoorn8044

    @eelcohoogendoorn8044

    4 жыл бұрын

    ​@@BPTScalpel Yeah it sort of clicks but I have to think about this a little longer, before I convince myself I really understand it. Curious to hear if your replication code works; if it does, id love to see a github link. Not sure which is better; really clean pseudocode or trash real code. But why not both? Having messed around with quite a lot of negative mining schemes I know how fragile and prone to collapse that is for the type of problems ive worked on, and what I like about this method is not so much the 1 or 2 % this or that, but that it might (emphasis on might) do away with all that tuning. So yeah pretty cool stuff; interesting applications, interesting conceptually... perhaps only downside is that it may not follow a super efficient path to its end state, and require quite some training time. But I can live with that for my applications.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I'm not against releasing pseudocode. But why can't they just release the actual code?

  • @BPTScalpel

    @BPTScalpel

    4 жыл бұрын

    @@YannicKilcher My best guess is that the actual code contains a lot of boilerplate code to make it work on their amazing infra and they could not be bothered to clean it =P

  • @larrybird3729
    @larrybird37294 жыл бұрын

    Deepminds next paper: We solved general-intelligence === Here is the code === def general_intelligence(information): return "intelligence"

  • @RickeyBowers
    @RickeyBowers4 жыл бұрын

    Augmentations work to focus the network - it's too great an oversimplification to say the algorithm learns to ignore the augmentations, imho.

  • @sui-chan.wa.kyou.mo.chiisai
    @sui-chan.wa.kyou.mo.chiisai4 жыл бұрын

    self-supervised is hot

  • @sphereron
    @sphereron4 жыл бұрын

    First

  • @amitsinghrawat8483
    @amitsinghrawat84832 жыл бұрын

    What the hell is this why I'm here?

Келесі