Snowflake Arctic 480B LLM as 128x4B MoE? WHY?

Ғылым және технология

Snowflake offers unique tech insights in new architecture designs of LLM, particular its new 128x3.66B Mixture of Expert (MoE) system. A dedicated, highly specialized LLMs for enterprise tasks?
As of May 1, 2024, Snowflake - Arctic - Instruct 480B (MoE 128x3.66B) w/ more than 28600 votes is officially at position 37 (behind an MoE Mixtral - 8x7B - Instruct at pos 31) in the latest benchmark data by the AI community, as published on the reference LMsys.org leaderboard (see chat.lmsys.org/?leaderboard ).
This official ranking (37) of Snowflake - Arctic - Instruct by the AI community supports my live causal-reasoning performance testing and my findings in this video.
Short introduction to MoE and then a comparison between different model architectures, followed up by a causal reasoning test (following test suite published by Stanford Univ).
Can a relatively small LLM, with below eg 5 Billion free trainable parameters, solve complex reasoning tasks. We evaluated this in my last video on PHI-3 MINI.
Where is the sweat spot of performance vs cost, of cost efficiency vs performance efficiency for AI systems, based on MoE architectures?
Can an extended 128x3.66B MoE solve complex reasoning tasks for a multitude of data? Does it outperform a dense transformer or a dynamic routing to multiple active gates (greater than 2)?
All rights w/ authors:
www.snowflake.com/blog/arctic...
00:00 Snowflake New LLM 480B
00:52 Mixture of Expert - MoE
02:14 my background research
02:48 Benefits of a MoE over a dense Transformer
05:06 Why a new LLM as MoE?
08:30 Architecture and Gating mech
11:04 Focus on reasoning MoE efficiency
12:35 Official benchmark data
15:24 Snowflake AI research cookbook
16:30 Real time testing of Snowflake Arctic
#airesearch
#ai
#newtech

Пікірлер: 9

@iham131324 күн бұрын
i like the concept of basic reasoning and specialization and i guess, this could lead to generally smaller models, which are trained in specific abilities (like coding in a language, sql, summarization of text, generating text), while understanding what is needed (reasoning) based on text, images, audio, … those can be coordinated in teams (think of crew ai, langgraph or other agentic setups). there is not a huge benefit in ultimate knowledge bases, where one has to know it all (every coding language, every text type, …).
@yuriborroni549025 күн бұрын
I'm excited for when you put the new Chinese SenseNova 5.0 model to the test :)
@ericsabbath
25 күн бұрын
What about Qwen? The 14B quantized version runs smoothly on free colab (T4)
@gileneusz24 күн бұрын
The effectiveness of AI models often correlates with their size: larger models typically exhibit better reasoning capabilities. Consequently, simply increasing the number of smaller, less capable models does not necessarily compensate for their individual limitations in performance
@MattJonesYT25 күн бұрын
I wish they would do benchmarks that compare performance of Mixture Of Expert LLMs vs regular old agent systems. An agent system is more easily distributed across many computers and can be more easily debugged and steered by looking at the conversation of prompts. The agent system also doesn't take a lot of new training horsepower. Overall, in the future I expect agent systems to generally dominate the MOE approach except for the fact that they are rarely directly compared in the benchmarks.
@coldlyanalytical135125 күн бұрын
This uses 128 retail GPUs, one per model. Not cheap - but maybe great value for money? TBH this only makes sense in a multi-user system, in order to keep all those GPUs busy.
@joserobles1122 күн бұрын
1:58 correction: there are 132 (i)phones in there Counted them😂
@propeacemindfortress25 күн бұрын
hmmmm... 🤔
@brandon190225 күн бұрын
Filtering for "Enterprise Intelligence" is grossly misleading. When you factor in all basic capabilities, including MMLU, Arc, and other tests measuring basic intelligence, knowledge and language skills, then snowflake performs horribly at most major LLM tasks and is easily beaten by the much smaller and about equally fast LLMs like Mixtral 8x7b. And it really doesn't take all that long to train an 8x7b MOE, so the added training time is much cheaper overall than the added memory resources needed during inference with a large number of client machines. This 128 expert design is basically worthless unless you're able to make it run with a much smaller memory footprint, perhaps by loading only a handful of experts into memory at a time.

Snowflake Arctic 480B LLM as 128x4B MoE? WHY?

Ғылым және технология

Пікірлер: 9

@ericsabbath

25 күн бұрын

Келесі