V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)
Ғылым және технология
#vjepa #meta #unsupervisedlearning
V-JEPA is a method for unsupervised representation learning of video data by using only latent representation prediction as objective function.
Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic
OUTLINE:
0:00 - Intro
1:45 - Predictive Feature Principle
8:00 - Weights & Biases course on Structured LLM Outputs
9:45 - The original JEPA architecture
27:30 - V-JEPA Concept
33:15 - V-JEPA Architecture
44:30 - Experimental Results
46:30 - Qualitative Evaluation via Decoding
Blog: ai.meta.com/blog/v-jepa-yann-...
Paper: ai.meta.com/research/publicat...
Abstract:
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Authors: Adrien Bardes Quentin Garrido Xinlei Chen Michael Rabbat Yann LeCun Mido Assran Nicolas Ballas Jean Ponce
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
KZread: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Пікірлер: 51
Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic OUTLINE: 0:00 - Intro 1:45 - Predictive Feature Principle 8:00 - Weights & Biases course on Structured LLM Outputs 9:45 - The original JEPA architecture 27:30 - V-JEPA Concept 33:15 - V-JEPA Architecture 44:30 - Experimental Results 46:30 - Qualitative Evaluation via Decoding Blog: ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/ Paper: ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/
8:25 You were given a kitten for your birthday, you love your kitten very much and it loves you. If you Properly extract the JSON you will get a $100 tip, if you mess up the kitten will die. Do not let the kitten die. Think carefully, step by step about what you have to do to keep the kitten safe.
Was going through representation playlist, just heard about vjepa the other day, went through your jepa video yestrerday - been on much of a yan lecun binge the past few days basically., and now luckily this is out. Great work man, much appreciated.
@YannicKilcher
4 ай бұрын
you discovered it before it was even public :D
@ultrasound1459
4 ай бұрын
CAP 🧢
Yannic, I can't stress enough how important your videos are to many curios people who can't read scientific literature but can understand it when you are breaking down unknown mathematical equations and other definitions for them. Thank you!
So clear explanations! Thanks so much Yannic.
I love videos on unsupervised learning methods, especially those unlike most large language models that try to compute encodings/latents.
Thanks for the breakdown of this paper. It's easier to digest with a bit of dry humour!
very nice, thank you for the clarifications bc this paper was kinda hard to read before
I always appreciate your awesome videos! Great content as always. Frankly I’m surprised there hasn’t been more effort toward applying JEPA to RL, given that model based extrapolation for RL was the entire point of Yang Lecuun’s original paper! Now that they’ve got a video based model, seems like there would be nothing holding them back from actually trying it. Can’t wait for JEPA-M , where the M stands for Minecraft.
@EdFormer
2 ай бұрын
The paper is called "a path towards autonomous machine intelligence" - where did you get that the point was about model based extrapolation for RL? After all, LeCun has said that RL is just the cherry on the top of the cake, while supervised learning is the icing, and self supervised learning is the actual cake, so he hardly sees RL as the priority. That aside, what we see here is a world model predicting some states of the world from others, while LeCun's model would require also considering potential actions of the agent in this prediction, which would be much harder to gather training data for.
Excellent explanation❤
Yay! *clap* good job!
thanks!
40:40 My latent Z was not expecting that video continuation...
Название - огонь. Русские поймут)
@thebigfortuno3329
4 ай бұрын
Это вроде называется лингвистическим шоком
@barrettkepler7618
4 ай бұрын
Жепа 🍑
I believe it's wrong reasoning around 26:15 when you discuss the JEPA scheme. It's not so important to use the EMA version for Enc(y) and you can actually replace it with the same parameters (e.g. SimSiam does that). It's just a trick to boost quality a bit
subscribed
It is similar to how quantum mechanics work (in my head). JEPA models don't turn data into pixels unless necessary. Like quantum objects having wave function which collapses to a point when observed.
Do you think, it can replace triplet loss in tracking where you don't have label available to train triplet loss,
can you do a video on the Microsoft 1.5 bit LLM paper?
latent variable energy based models can be used for text generation as well, right? how will they fair against current statistical models? i suppose this will be much more energy efficient and can have infinite (or very long like human brain) capacity to understand and generate text. are there researches around it ?
@jawadmansoor6064
4 ай бұрын
I learned a lot, thank you gemini and bing and meta and yannic.
Few complaints * What is difference from MAE? MAE has version to predict EMA output... * Pixel vs latent seems not fair. Top few layers of Pixel encoder must be retrained as they focus on pixel reconstruction. * z = mask info is pity... z was more important than it in original JEPA design.
Jepa is the futur of AI.
Almost JOPA
@acatormt7096
4 ай бұрын
Jepa is even funnier
@14types
4 ай бұрын
jopa is russian vulgar word of some part of body@@acatormt7096
more fish for Yann LeCat!
Is this like inpainting but for videos?
@YannicKilcher
4 ай бұрын
in latent space
Can you do a tutorial for the github implementation
@ultrasound1459
4 ай бұрын
He only talks 😢💀
@IronMechanic7110
4 ай бұрын
A100 gpu😂
Shame the V-JEPA code licence forbids commercial use.
Ну и название).
Use dark mode bro
Really? How humans do it? As if they have undertaken any serious work to find that out.
am I losing my mind or is this just trying to dress up videoMAE/vit? wasn't that what the original ViT was about? This just seemslike they chucked something out prematurely as the github repo stinks. Sora is very similar to V-JEPA so it makes sense as to why it was released now.
Every 5 minutes there is an advertisement for a minute. Can you please stop KZread from doing this?
@immortalsofar7977
4 ай бұрын
His channel is monetized. Let the guy supplement his income from his videos. His hard work is appreciated and you can show it by simply watching a few ads.
@zyxwvutsrqponmlkh
4 ай бұрын
What kind of fool browses the web without an add blocker? Do you hate your eyeballs? Do you enjoy dodging on page popups to read a block of text? Are you some sort of sadist? The web is simply not usable without a good adblocker. What is wrong with you?
@YannicKilcher
4 ай бұрын
it was too much indeed. YT places these automatically, I've reduced them to 1/3rd manually. Thanks for letting me know
@Zantorc
4 ай бұрын
KZread adverts are optional, any decent free ad blocker will skip them.
@DelandaBaudLacanian
4 ай бұрын
@@YannicKilcherthanks Yannic you rock
Sorry, I just can't listen to the word "jepa" repeated so many times😂