Exploring Simple Siamese Representation Learning

Ғылым және технология

What makes contrastive learning work so well? This paper highlights the contribution of the Siamese architecture as a compliment to data augmentation and shows how Siamese nets + a stop-gradient operation in the negative encoder is all you need for strong contrastive self-supervised learning results. The paper also presents an interesting k-Means style explanation of the optimization problem contrastive self-supervised learning solves. Thanks for watching! Please Subscribe!
Paper Links:
SimSiam: arxiv.org/pdf/2011.10566.pdf
SimCLR: arxiv.org/pdf/2002.05709.pdf
MoCo: arxiv.org/pdf/1911.05722.pdf
SwAV: arxiv.org/pdf/2006.09882.pdf
BYOL: arxiv.org/pdf/2006.07733.pdf
kMeans: stanford.edu/~cpiech/cs221/ha...
(Not mentioned in video, but here's an interesting tutorial on coding siamese nets: www.pyimagesearch.com/2020/11...)
Thanks for watching!
Chapters
0:00 Introduction
1:16 SimSiam Architecture
2:42 Representation Collapse
3:41 SimCLR
4:50 MoCo
5:54 BYOL
7:00 SwaV (Clustering)
8:20 Unifying View
9:33 k-Means optimization problem
11:06 Results
12:20 Key Takeaways

Пікірлер: 23

@freddiekalaitzis57083 жыл бұрын
Here's my handwavy take on this: the gradstop prevents any information about the raw x-aug-2 to be accounted for when updating the encoder. The encoder update only accounts for x-aug-1 and the distance of z-aug-1 (its projected encoding) to the encodings of other augmentations (of the same x). This means that the burden now falls on the projection head, to accommodate for all the ways augmentations can manifest on the latent space, and to be as close to all of them. To do this, the projection head will converge to predicting the mean of a cluster, where the cluster centroid is (hopefully) invariant to the set of augmentations. If you remove the stopgrad, then that would like allowing the data in k-means to be shifted, which would collapse all data to a single point. In summary, the projection head plays the role of predicting the cluster centroid the image belongs to. Hence, the encoder head is learned through cluster centroids, not through the "raw" encodings themselves. This is the bottleneck. This is why no collapse occurs. Under this scheme, there is no need for an explicit memory of past Siamese copy encodings through momentum, because the memory is implicitly stored in the cluster centroid.
@connorshorten6311
3 жыл бұрын
Thank you for the insight!
@channel_panel193
3 жыл бұрын
This comment is the best insight into this model that I've found anywhere. I'm floored. Thanks.
@raphaels2103
Жыл бұрын
I still do not understand. It seems to me that the collapse is still possible by choosing a constant encoder and an identity predictor h
@anirudhkoul16593 жыл бұрын
+1 Excellent and timely explanation as usual. Have shared this with several students. Keep up the mission to making AI literature accessible!
@lucidraisin3 жыл бұрын
Great video as always Henry!
@connorshorten6311
3 жыл бұрын
Haha, thank you so much!
@LukosPC3 жыл бұрын
Neat, the recap of previous work was very useful!
@Omar-jn9zf3 жыл бұрын
Thank VERY much !! Clear and concise as always.
@connorshorten6311
3 жыл бұрын
Glad it was helpful, thank you so much!
@masakikozuki32143 жыл бұрын
awesome video, as always. Thank you.
@connorshorten6311
3 жыл бұрын
Thank you so much!
@giholste1233 жыл бұрын
The audio is really quiet for me. Otherwise, nice video!
@connorshorten6311
3 жыл бұрын
Sorry about that, I'll try to fix it going forward. Thank you!
@arjunmajumdar68746 ай бұрын
Excellent video. Just a comment, the ImageNet images are (224,224,3) and not (128,128,3) 😊
@megadero84073 жыл бұрын
Simple and powerful architecture
@megadero8407
3 жыл бұрын
You have really cool videos.
@connorshorten6311
3 жыл бұрын
Thank you so much!
@user-bi1gf9kg7u2 жыл бұрын
I need powerpiont.pls
@Faith-td8vf Жыл бұрын
Could you please share the PPT with me?