Autoregressive Diffusion Models (Machine Learning Research Paper Explained)

#machinelearning #ardm #generativemodels
Diffusion models have made large advances in recent months as a new type of generative models. This paper introduces Autoregressive Diffusion Models (ARDMs), which are a mix between autoregressive generative models and diffusion models. ARDMs are trained to be agnostic to the order of autoregressive decoding and give the user a dynamic tradeoff between speed and performance at decoding time. This paper applies ARDMs to both text and image data, and as an extension, the models can also be used to perform lossless compression.
OUTLINE:
0:00 - Intro & Overview
3:15 - Decoding Order in Autoregressive Models
6:15 - Autoregressive Diffusion Models
8:35 - Dependent and Independent Sampling
14:25 - Application to Character-Level Language Models
18:15 - How Sampling & Training Works
26:05 - Extension 1: Parallel Sampling
29:20 - Extension 2: Depth Upscaling
33:10 - Conclusion & Comments
Paper: arxiv.org/abs/...
Abstract:
We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model's adaptable parallel generation.
Authors: Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-...
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.c...
Minds: www.minds.com/...
Parler: parler.com/pro...
LinkedIn: / ykilcher
BiliBili: space.bilibili...
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribes...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 31

@YannicKilcher2 жыл бұрын
Discord link: discord.gg/4H8xxDF
@ryanbaten2 жыл бұрын
Very similar to XLnet. If I remember correctly, it was also autoregressively trained and in a permutation order similarly to this. There were extra tricks that made it train in parallel more efficiently. Paper authors claimed the autoregressive training results in a better model and that they would have a V2 soon but haven't seen it. Seemed super impressive at the time it came out but the idea also seems to not have stood the test of time since just training the MLM models longer and on comparable amounts of data beat it performance-wise.
@Kram10322 жыл бұрын
Oh I like this idea! Maybe the part where even the stuff that's already there is being predicted could be exploited to allow the generator to change its mind somehow, deleting/replacing some pixels to converge to something better overall. Could even be done on an already complete image. In fact that might be especially helpful for the text variant, so it can delete stuff that didn't work out after all.
@ChuanChihChou2 жыл бұрын
8:50 BERT is actually also trained to correct some of the input tokens (15% of the token positions chosen * 10% of the time replaced with a random token = 1.5%). I suspect they can get better generation quality if they also allow such token correction.
@SLAM29772 жыл бұрын
Love these straight to the point honest opinions :)
@sujovian5 ай бұрын
The out of order discernment of ARDM seems really useful in efficient retrieval augmentation.
@nauman.mustafa2 жыл бұрын
it is a really powerful model and imo we can specialize it to a much larger number of tasks compared to gpt or gans etc.
@CristianGarcia2 жыл бұрын
Not sure if its mentioned but there is tradeoff during training:, auto regressive models like GPT can train over a complete sample all at once, whereas here you need to pass all possible masks for it to "learn" the sample i.e. training could be slower.
@AlbertZeyer2 жыл бұрын
Just a random idea on splitting the vocabulary (32:40), you could cluster the vocab. This has been done before for hierarchical softmax. So you could still use the same idea as it is used for discretized pixel value classes.
@priancho2 жыл бұрын
Watched twice and understood it ;-) Thanks for the video!
@sacramentofwilderness66562 жыл бұрын
Thanks as always for great content! I wonder, whether it is possible to predict some optimal order of decoding. Like we generate important details of the image, sentence or any other kind of data, cats, dogs, and then refine less important parts - background. Important parts can serve as an anchors for generation.
@user-cp8uy9om7o Жыл бұрын
Yannic you're a life saver
@SuperJg0072 жыл бұрын
best channel ever.
@AlbertZeyer2 жыл бұрын
Why do you think that a model which is not restricted to left-to-right sampling would always be beaten by an auto-regressive model which is strictly left-to-right? Your argument was that the latter would focus very much on this specific task. But I also see multiple arguments the other way around: The arbitrary order could generalize better. And also, there are probably always better orders than left-to-right, and when the model can automatically use a better order, it could beat the strict left-to-right model.
@ssssssstssssssss2 жыл бұрын
I did some research on this type of machine four years ago or so. Perhaps I should have stuck with it. The purpose was much better suited for this type of machine. I believe it is still being used in the software I integrated it into.
@herp_derpingson2 жыл бұрын
10:40 This is like a regular transformer but we are predicting more than one token at once and out of order. Or a BERT but with multiple iterations. . 29:42 I wonder what would happen if at each step, each generated output pixel will have a probability of being overwritten. So, the model now has the option to reconsider its own previous predictions now that it has more input. . I would like to see how much does the output quality degrades as you decrease the number of steps.
@YannicKilcher
2 жыл бұрын
yes I've seen multiple people already wonder about the possibility of the model being able to refine its outputs, very interesting idea!
@thegistofcalculus
2 жыл бұрын
Yes, overwriting is clearly intriguing although stability becomes a concern again, and I wonder if the naive approaches would be incentivized to output very close to training samples.
@G12GilbertProduction2 жыл бұрын
I bet is a coinfidence with Bayesian autoencoders technique with multi-layer simultanical differentials, something like zero-shot but in reverse.
@marouanemaachou78752 жыл бұрын
It does remind me of the denoising diffusion models as bert like models are denoising autoencoders. Am i wrong ?
@sarvagyagupta17442 жыл бұрын
Why are we using categorical distribution? We are trying to predict pixel values which in this case are RGB values. So what categories are being used to get the pixel values?
@tripzero02 жыл бұрын
Now can we make the diffusion model predict a codebook for a vqgan?
@bluel1ng
2 жыл бұрын
Yes, it should be definitely possible to model the discrete latent code of a VQ-VAE with an ARDM. I guess the main advantage compared to VQ-GAN (which uses a classic ARM) would be the possibility of parallel decoding. Also depending on the architecture decoding of larger images might be possible (e.g. as diffusion models frequently use a u-net architecture with attention at its core).
@thomashirtz2 жыл бұрын
TTP 13:25 .. Just kidding, nice video :)
@matthieulin3352 жыл бұрын
damn looks cool
@patf97702 жыл бұрын
Been working on a similar idea for the greater part of the last year. Gotta be faster! See the wavefunctioncollapse procedural generation algorithm, it's simple yet incredibly powerful and works off the principle of generating the pixel that the "model" is the most "certain" about at each step: kzread.info/dash/bejne/ZIep2LFtd8ydpbw.html
@Gogargoat2 жыл бұрын
Kind of works similar to how when the universe decides that a particle exists in one position (when it is observed), it's as if that sucks 1.0 mass from the probability density cloud. In the back of my mind I always kind of wondered how that worked and how that consistency is achieved, and i guess this decoding method is one way.
@billykotsos46422 жыл бұрын
Not all languages are read from left to right
@herp_derpingson
2 жыл бұрын
You can just reverse it before feeding into the model and then reverse it back after generation.
@arturprzybysz6614
2 жыл бұрын
@@herp_derpingson Is it legal?

Autoregressive Diffusion Models (Machine Learning Research Paper Explained)

Пікірлер: 31

@YannicKilcher

2 жыл бұрын

@thegistofcalculus

2 жыл бұрын

@bluel1ng

2 жыл бұрын

@herp_derpingson

2 жыл бұрын

@arturprzybysz6614

2 жыл бұрын

Келесі