SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow

Ғылым және технология

The Lottery Ticket Hypothesis has shown that it's theoretically possible to prune a neural network at the beginning of training and still achieve good performance, if we only knew which weights to prune away. This paper does not only explain where other attempts at pruning fail, but provides an algorithm that provably reaches maximum compression capacity, all without looking at any data!
OUTLINE:
0:00 - Intro & Overview
1:00 - Pruning Neural Networks
3:40 - Lottery Ticket Hypothesis
6:00 - Paper Story Overview
9:45 - Layer Collapse
18:15 - Synaptic Saliency Conservation
23:25 - Connecting Layer Collapse & Saliency Conservation
28:30 - Iterative Pruning avoids Layer Collapse
33:20 - The SynFlow Algorithm
40:45 - Experiments
43:35 - Conclusion & Comments
Paper: arxiv.org/abs/2006.05467
Code: github.com/ganguli-lab/Synapt...
My Video on the Lottery Ticket Hypothesis: • The Lottery Ticket Hyp...
Street Talk about LTH: • The Lottery Ticket Hyp...
Abstract:
Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy both during training and at test time. Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design. We first mathematically formulate and experimentally verify a conservation law that explains why existing gradient-based pruning algorithms at initialization suffer from layer-collapse, the premature pruning of an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse can be entirely avoided, motivating a novel pruning algorithm Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the network at initialization subject to a sparsity constraint. Notably, this algorithm makes no reference to the training data and consistently outperforms existing state-of-the-art pruning algorithms at initialization over a range of models (VGG and ResNet), datasets (CIFAR-10/100 and Tiny ImageNet), and sparsity constraints (up to 99.9 percent). Thus our data-agnostic pruning algorithm challenges the existing paradigm that data must be used to quantify which synapses are important.
Authors: Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli
Links:
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 65

  • @srivatsabhargavajagarlapud2274
    @srivatsabhargavajagarlapud22744 жыл бұрын

    Wow!! I mean, how do you keep up and sustain this rate of reading and producing these videos man! I had come across your channel just about week and a half ago (while searching on explanations for DETR) and since have had tough time just sifting through your very interesting videos and picked up quite a bit along the way! You are indeed a role model ! :) Thanks a lot for what you are doing! _/\_

  • @ProfessionalTycoons
    @ProfessionalTycoons4 жыл бұрын

    Amazing! The secrets slowly revealing itself.

  • @raghavjajodia1
    @raghavjajodia14 жыл бұрын

    Wow, Video on this paper is out already? That was quick! Really well explained, keep it up 👍

  • @alabrrmrbmmr
    @alabrrmrbmmr4 жыл бұрын

    Exceptional work once again!

  • @visualizedata6691
    @visualizedata66914 жыл бұрын

    Simply Superb. Great work!

  • @mrityunjoypanday227
    @mrityunjoypanday2274 жыл бұрын

    This can help identify motif for Synthetic Petri Dish

  • @herp_derpingson
    @herp_derpingson4 жыл бұрын

    12:18 Its interesting how the graphs are sigmoid shaped. I would expect that the graph to start out flat-ish because of redundant connections and then fall linearly. But, it seems to flatten out near the end. Basically, it is retaining accuracy even as most of the parameters get pruned. It would be interesting to see what happens if we start adding more parameters and training in this near critical model again. Would it trace back the same path upward? Or would it do better? Or worse? . 38:38 Intuitively, this is equivalent to asking a node, "How much of the final output do you contribute to?". However, since we are taking absolute values, lets say there is a node which has a weight 1x + 9999 passes an activation of +9999 to the next node which has a weight of 1x - 9999. So, although this loss function would rate both these two nodes highly. In reality, they are always negating each other and never contribute to the final output. Then again, checking interactions between neurons is practically intractable. . I really liked the dataless approach in this paper. I think this would inspire more papers to try similar stuff. Good stuff. . 43:30 IDK. A network initialized with all equal weights to 1 would make this algorithm go crazy with excitement. Lets see if we will see a new family of network initialization policies.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    - yea indeed, I think there are dozens of questions like this that could all turn out either way and nobody knows - maybe here one can count on the gaussian initializations to basically never make this happen, because it would be quite the coincidence.d - I guess we're both waiting for the first generation of "adversarial initializations" to screw with the pruning methods :D

  • @Phobos11
    @Phobos114 жыл бұрын

    Awesome! I was waiting for today’s paper 🤣

  • @stephenhough5763
    @stephenhough57634 жыл бұрын

    Amazing video presenting an amazing paper, afaik currently sparsity doesn't have much benefit ito most real world performance gains (gpus and tpus) but that should start changing soon. 10x more layers with 90% of the weights pruned using Synflow should greatly outperform while having similar (end) param counts.

  • @vijayabhaskarj3095
    @vijayabhaskarj30954 жыл бұрын

    Yannic, you need to tell us how you manage your time and how you manage to put a video daily?, for me reading a paper itself takes 2-3 days to understand it fully, let alone explain it to others.

  • @rbain16

    @rbain16

    4 жыл бұрын

    He's been publishing papers in ML since at least 2016, so he's had a bunch of practice. Plus he's lightning Yannic :D

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Once you get a bit of practice, it gets easier :)

  • @herp_derpingson

    @herp_derpingson

    4 жыл бұрын

    @@YannicKilcher Building the cache, so to speak :)

  • @Tehom1
    @Tehom14 жыл бұрын

    The problem looks so much like a max flow or min flow cost problem.

  • @EditorsCanPlay
    @EditorsCanPlay4 жыл бұрын

    duude, Do you ever take a break? haha, love it though!

  • @DeepGamingAI
    @DeepGamingAI4 жыл бұрын

    SNIP SNAP SNIP SNAP SNIP SNAP! Do you know the toll that iterative pruning has on a neural network?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Must be horrible :D

  • @shubhvachher4833

    @shubhvachher4833

    3 жыл бұрын

    Priceless comment.

  • @siyn007
    @siyn0074 жыл бұрын

    What if you can have a constraint for the pruning to keep at least a percent (say 10%) of the connections in each layer to prevent layer collapse? Edit: here's the answer 33:50

  • @RohitKumarSingh25
    @RohitKumarSingh254 жыл бұрын

    Yannic thanks for making such videos, it really helps a lot. :D I wanted to know, these pruning techniques are not going to improve FLOPs of my model right, because we are just masking the weights in order to prune right? or is there any other way to reduce FLOPs?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Yes, for now that's the case. But you could apply the same methods to induce block-sparsity, which you could then leverage to get to faster neural networks.

  • @RohitKumarSingh25

    @RohitKumarSingh25

    4 жыл бұрын

    @@YannicKilcher I l look into it.

  • @wesleyolis
    @wesleyolis4 жыл бұрын

    I fully agree with your statement of having models being inflated, compression will still have its place. It is an unavoidable thought of why not inject additional layers with weights for matrix Q as required, such that you increase the spatial resolution/search space to capture increased detail was required for a more accurate model. The other thing best initialization of the Q matric perturbations weights pattern. I don't believe that random weight is right, my work, but don't think it is the best. one would want an equal distribution of perturbations, give the best possible probability for backpropagation to chance to enhance the weights according to the forward propagation. Clearly, if we go to inflationary models then one also be inflation perturbations more intelligently, not random. The thing for me is that might have an iterative algorithm, more intelligently build the structure by swapping out the section of weights for different weights structures, that would allow different types of mathematical equations to be represented. A^b or like X^2 + Y^2 could be capture here, with linear expansion formula. At on point, I was searching the internet for X^B formulary for computation didn't find it, to look how to constructive set of weights to model that equation in NN. Did discovery it few days back kinda from my understanding in probability book, by the looks of it. Strangely enough if to inflate models, with a different section of weights, that resemble different mathematical relationship equations. Then we have more insight into mathematics going on, as we could hold parallel mathematics constructive for Q matrix(for how long it remain this). The next step I guess is going to be how one improves the abstraction of matrics for hardware computation. Lost of empty spaces and missing weights, ability to restructure Q matrix, without accuracy loss, such its equivalent, merge and splitting layers into multiple layers, such that one has dens matric of weights for hardware computation. With regards to the Q matrix with perturbations, I think for better matching should be a matrix with 2 incremental symmetry around at F0/FN upper and lower bound and F(n/2), so basically values incrementally increase by small values of same amounts, this means won't' have random weight lines, should get better patterns in my mind versus random jumping all around the Q matric of weights.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    That all seems predicated on these mathematical functions actually being the best kind of representations for these kinds of problems, which is a very strong hypothesis

  • @LouisChiaki
    @LouisChiaki4 жыл бұрын

    Can we just add the new loss function into the original loss function of the model and train the original network with this pruning cost (and the one shot data) included? Like we prune the model while training it.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Yea I guess that's a possibility

  • @Kerrosene
    @Kerrosene4 жыл бұрын

    I was wondering why they use the hadamard product instead of the dR/d(theta) score alone as a metric to evaluate a parameter's effect on the loss? I understand that this new score won't obey the conservation theorem but if the prime issue was to avoid layer collapse, could we just chuck the conservation part out and try this score in a way that prevents layer collapse (like provide a contingency in the algorithm that avoids it maybe using a local masking technique (which is sub par in performance, i know)). Has this been done? any thoughts?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    True. I think the parameter magnitude itself might carry some meaningful information. Like, when it's close to 0, a large gradient on it shouldn't "mean" as much. But that's just my intuition

  • @AsmageddonPrince
    @AsmageddonPrince Жыл бұрын

    I wonder if instead of all-ones datapoint you could use a normalized average of all your training datapoints.

  • @zhaotan6163
    @zhaotan61634 жыл бұрын

    it works for image with first layer as CNN. how about a MLP with different features as inputs? it will be problematic for first layer, i.e. feature selection , since it never sees the data and have no idea which feature is more important.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I guess we'll have to try. But you'd leave all neurons there, just prune the connections.

  • @billymonday8388

    @billymonday8388

    Жыл бұрын

    the algorithm essentially improves the gradient of a network to make it train better. It does not solve everything.

  • @Zantorc
    @Zantorc4 жыл бұрын

    I wonder what pruning method the human brain uses. At birth, the number of synapses per neuron is 2,500 and grows to 15,000 by about age 2. From then on they get pruned mostly between ages 2-10 but continuing at a slower rate til the late 20s. The adult brain only retains 50% of the synapses it had as a 2 year old.

  • @NicheAsQuiche

    @NicheAsQuiche

    4 жыл бұрын

    This has been the most interesting part of the lottery ticket thing to me - it's amazing how many parallels there are between biological neurons and artificial ones. I think the lottery ticket hypothesis paper found good performance between 50% and 70% pruning

  • @bluel1ng

    @bluel1ng

    4 жыл бұрын

    I guess "what fires together wires together" is also a good intuition in the reverse sense for pruning. Like muscles the body will likely also try to optimize the brain based on usage/functional relevance. But there is definitely some stability in the system otherwise we would quickly lose all memories that are not recalled frequently. ;-)

  • @kpbrl
    @kpbrl4 жыл бұрын

    Great video once again! Just 1 q, do you have a goal of making at least one video a day? I found this channel while I was searching if anyone had an idea - "reading a paper" - to make a video. Now, I have another idea. Will implement soon and share it here. :)

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I try to make one every day, but I'll probably fail at some point

  • @jonassekamane
    @jonassekamane4 жыл бұрын

    So -- if I understood this correctly -- you would in principle be able to 1) take a huge model (which normally requires an entire datacenter to train), 2) prune it down to some reasonable size -- and presumably prune it on relatively small computer, since the method does not use any data in the pruning process, and 3) finally train the smaller pruned model to high accuracy (or SOTA given the network size) -- presumably also on a relative small computer

  • @jwstolk

    @jwstolk

    4 жыл бұрын

    I think that would be correct, if training on a CPU. I don't know how current GPU's handle pruned networks or how much it benefits them. GPU's may need some additional hardware features to really benefit from using a pruned network.

  • @jonassekamane

    @jonassekamane

    4 жыл бұрын

    This method applied in reverse could also be quite interesting, i.e. for model search. Assuming the accuracy of a pruned network is reflective of the accuracy of the full network, then you could use Synflow to train and test various pruned models, before scaling up the best performing model and training that... but yes, new hardware might need to be developed.

  • @bluel1ng
    @bluel1ng4 жыл бұрын

    Nearly identical accuracy with 1% or even 0.1% of the weights at initialization? That is fascinating. A bit mind-bending for me seems the fact that this pruning can be done DATA independent - only by feeding 1s through the network? Crazy - maybe the future poised to be sparse and fully-connected initialization become a thing of the past. ;-) If layer-collapse (aka layer dependent average synaptic saliency score magnitude) is the problem: Why not perform pruning layer-wise in general? How would the base-line methods perform if the pruning-selection would be done for each layer individually instead sorting the scores for the network globally?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    in the paper they claim that layer-wise pruning gives much worse results

  • @bluel1ng

    @bluel1ng

    4 жыл бұрын

    @@YannicKilcher I see, they reference "What is the State of Neural Network Pruning?" arxiv.org/abs/2003.03033 ... maybe layerwiese (or fan-in/fan-out dependent) normalization of the saliency scores might be a thing to compensate the magnitude differences. ;-) btw the "linearization" trick they use for ReLUs (W.abs() and then passing 1-Vector) is nice ... for other activation functions this will probably require a bit more work.

  • @MarkusBreitenbach
    @MarkusBreitenbach4 жыл бұрын

    How can this work for architectures like Resnet, which have bypass connections for layers, without looking at the data? They show results in the paper for resnet, but somehow that doesn't make sense to me. Anybody know what I am missing?

  • @bluel1ng

    @bluel1ng

    4 жыл бұрын

    You might take a look at their code at github.com/ganguli-lab/Synaptic-Flow . Why do you think that using the training-data would be required for dealing with the shortcut connections?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    I think the same analysis still applies, though you're right the interesting part in ResNets is their skip connection, so technically they never have to deal with layer collapse.

  • @robbiero368
    @robbiero3683 жыл бұрын

    Is it possible to iteratively grow the network rather than pruning it, or does that collapse to be essentially the same thing?

  • @robbiero368

    @robbiero368

    3 жыл бұрын

    Oh just heard your similar comments right at the end of the video. Cool.

  • @hungryskelly
    @hungryskelly4 жыл бұрын

    Phenomenal. Would you be able to step through the code of one of these papers?

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    sure, but this one is just like 3 lines, have a look

  • @hungryskelly

    @hungryskelly

    4 жыл бұрын

    @@YannicKilcher Fair point. Would look forward to that kind of thing on other papers. Thanks for the incredibly insightful content!

  • @blanamaxima
    @blanamaxima4 жыл бұрын

    Not sure what this thing learn , the dataset or the architectures...

  • @sansdomicileconnu
    @sansdomicileconnu4 жыл бұрын

    this is pareto law

  • @kDrewAn
    @kDrewAn4 жыл бұрын

    Do you have a PayPal? I don't have much but I at least want to buy you a cup of coffee.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    thank you very much :) but I'm horribly over-caffinated already :D

  • @kDrewAn

    @kDrewAn

    4 жыл бұрын

    Nice

  • @jerryb2735
    @jerryb27354 жыл бұрын

    This paper contains no new or deep idea. They do use data when pruning the network. It is the data on which the network was trained. Moreover, the lottery ticket hypothesis is trivial. Once stated rigorously, it takes less than four lines to prove it.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    Enlighten us, please and prove it in four lines :D

  • @jerryb2735

    @jerryb2735

    4 жыл бұрын

    @@YannicKilcher Sure, send me the statement of the hypothesis with the definitions of all technical terms used in it.

  • @YannicKilcher

    @YannicKilcher

    4 жыл бұрын

    @@jerryb2735 no, you define and prove it. You claim to be able to do both, so go ahead

  • @jerryb2735

    @jerryb2735

    4 жыл бұрын

    @@YannicKilcher False, read my claim carefully.

  • @GoriIIaTactics
    @GoriIIaTactics4 жыл бұрын

    this sounds like it's trying to solve a minor problem in a really convoluted way

  • @MrAmirhossein1
    @MrAmirhossein14 жыл бұрын

    First :D

  • @StefanReich
    @StefanReich4 жыл бұрын

    They are sooo on the wrong track...

Келесі