Next-Gen AI: RecurrentGemma (Long Context Length)

Ғылым және технология

A brand new Language Model Architecture: RecurrentLLM with Griffin. Moving Past Transformers.
Google developed RecurrentGemma-2B and compares this new LM architecture (!) with the classical transformer based, quadratic complexity of a self-attention Gemma 2B. And the new throughput is: about 6000 tokens per second. Plus two new architectures: GRIFFIN and HAWK, where already HAWK performs better than State Space Models (like Mamba - S6).
Introduction and Model Architecture:
The original paper by Google introduces "RecurrentGemma-2B," leveraging the Griffin architecture, which moves away from traditional global attention mechanisms in favor of a combination of linear recurrences and local attention. This design enables the model to maintain performance while significantly reducing memory requirements during operations on long sequences. The Griffin architecture supports a fixed-size state irrespective of sequence length, contrasting sharply with transformer models where the key-value (KV) cache grows linearly with the sequence length, thereby affecting memory efficiency and speed.
Performance and Evaluation:
RecurrentGemma-2B demonstrates comparable performance to the traditional transformer-based Gemma-2B, despite the former being trained on 33% fewer tokens. It achieves similar or slightly reduced performance across various automated benchmarks, with a detailed evaluation revealing only a marginal average performance drop (from 45.0% to 44.6%). However, the model shines in inference speed and efficiency, maintaining high throughput irrespective of sequence length, which is a notable improvement over transformers.
Technological Advancements and Deployment:
The introduction of a model with such architectural efficiencies suggests potential applications in scenarios where computational resources are limited or where long sequence handling (!) is critical. The team provides tools and code (Github repo, open source) for community engagement, compares the simpler Hawk architecture to state space models (like S4) and also to classical LLama models.
All rights w/ authors of the paper:
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
arxiv.org/pdf/2404.07839.pdf
00:00 Llama 3 inference and finetuning
00:23 New Language Model Dev
01:39 Local Attention
04:22 Linear complexity of RNN
06:05 Gated recurrent unit - GRU
07:56 Linear recurrent Unit - LRU
14:25 GRIFFIN architecture
15:50 Real-Gated Linear recurrent unit RG-LRU
21:20 Griffin Key Features
25:15 RecurrentGemma
26:24 Github code
27:13 Performance benchmark
#ai
#airesearch

Пікірлер: 6

  • @miikalewandowski7765
    @miikalewandowski7765Ай бұрын

    Finally! It’s happening. The Combination of all your beautiful findings.

  • @BradleyKieser
    @BradleyKieserАй бұрын

    Absolutely brilliant, thank you. Exciting. Well explained.

  • @po-yupaulchen166
    @po-yupaulchen166Ай бұрын

    Thank you. in RG-LRU, h_{t-1} should be not inside the gates ( inside the sigmoid function) in the original paper, right? it should slow down the training processes. I am so surprised that finite memory can meet the performance of transformers with crazy infinite memory. Also, it seems traditional rnns like lstm will soon be replaced by RG-LRU. So curious if some people can compare those rnn and show what is wrong in the old design.

  • @codylane2104
    @codylane2104Ай бұрын

    How can we use it locally? Can we at all? LM Studio can't download it. 😞

  • @Charles-Darwin
    @Charles-DarwinАй бұрын

    Surely this provides or could provide massive efficiency gains. If I touch a hot plate and feel the heat, the state is sent to my relevant limbs to retract...but then shortly thereafter this state fades and I can then proceed to focus on other states. What are neurons if not a response network to environmental factors. Google will probably be the first to an organic/chemical computer

  • @MattJonesYT
    @MattJonesYTАй бұрын

    It's made by google which means it will have all the comical corporate biases that the rest of their models have. It will produce useless output really fast. When someone makes a de-biased version this tech will be much more interesting.

Келесі