Google's New OPEN SOURCE Model Is Really Good (Except for one thing)
Ғылым және технология
Google drops Gemma 2 27b and it's really good...but struggles in one area. Let's test it!
Try Vultr FREE with $300 in credit for your first 30 days when you use BERMAN300 or follow this link: getvultr.com/berman
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Пікірлер: 180
You using it?
@erobusblack4856
20 күн бұрын
It's a cool model, but what I want is that Rat model they made
@mr.fearless7594
20 күн бұрын
@matthew_berman I usually watch your videos to keep up with AI World, and I learned quite a few things about evaluation from you. KEEP IT UP...
@AaronALAI
20 күн бұрын
I'm running it locally, and have tweaked it in different ways. Sort of good sort of not good, it's kinda random. Try the self play version, the self play fine-tunes are way better!
@ScottWinterringer
20 күн бұрын
LOL we can call google models the clown show and someone will hype it. we called them out for intentionally modifying the model when they released it on hugging face the first time. stop talking them up and do the right thing and point out how big of a liar they are.
@jasonshere
20 күн бұрын
No. Gemini products have been on the lower-end overall with my testing; but putting them into a Mixtral model might suffice.
"I'm quite impressed with Gemma, it failed all my tests but ya know it's ok I guess"
@matthew_berman
20 күн бұрын
It crushed the snake game
@hqcart1
20 күн бұрын
hahahaha, you should try it, and let me know how it went!
@4.0.4
20 күн бұрын
@@matthew_berman that's because you've been reusing the snake game, and you shouldn't.
@alessandrorossi1294
20 күн бұрын
@@4.0.4all LLM producers now make sure their LLM can produce a snake game!
@MyWatermelonz
20 күн бұрын
@@4.0.4 Him testing with the snake game has nothing to do with the model's training and architecture.
How about marking each task from 0 to 5.. 0 being fail and 1 to 5 based on performance and aggregate at the end of the test. So we can get more compressive conclusions. Just my 2 cents.
@publicsectordirect982
18 күн бұрын
☝️
Hi Matthew, Please change the game code request to code Space Invaders instead of Snake game. Im pretty sure all of the models are trained to code this little game these days.
@kormannn1
20 күн бұрын
Would it be easy to code Pac-man?
@temp911Luke
20 күн бұрын
@@kormannn1 Im pretty sure you would be able to code it using Claude these days. Seen some crazy stuff, including mini Space Invaders game : ) Obviously it was just bunch of blocks attacking another piece of block shooting at other falling blocks.
@makavelismith
20 күн бұрын
@@kormannn1 That a TM though.
@paulmichaelfreedman8334
20 күн бұрын
@@kormannn1 Many of them fail horribly at pacman, I've tried
@user-hs5zd3hj9t
20 күн бұрын
there you go!
Google's new OPEN SOURCE model is really good (except for one thing which is: its not good)
It's impressive how you keep a straight face while it explains how there isn't a strong enough force to push the marble out of the glass.
I'm a senior dev at Vultr, and when I shared your videos with our marketing department, they immediately pursued a partnership with you. It's awesome to see you thriving using our stack and infrastructure! Great video!
@southcoastinventors6583
20 күн бұрын
Great name for a business picking the leftover money from your customers there always a little more.
@Eveisfahahuglyyy
20 күн бұрын
@@southcoastinventors6583lol
@PP_Mclappins
20 күн бұрын
@@southcoastinventors6583 lol you realize how much compute power costs right? Why do you think everything should be free ?
@Highdealist
19 күн бұрын
@@PP_Mclappins Because advertisements can pay for everything. Want free TV? Watch ADS. Want free Music? Listen to ADS. Want free food? EAT SOME ADS. Want AI Tools? HALLUCINATE SOME ADS AND STFU.
@PP_Mclappins
19 күн бұрын
@@Highdealist lol idk what you're trying to say here? Ad revenue comes from people actually proceeding to buy products as a result of the ad placement, are you suggesting that rather than pay for software that someone has built by giving them money directly, you prefer that they fill their platform with ads so that in a round about way they might get paid by an ad agency while you spend your money buying pointless products that are being shoved down your throat? Isn't it better to just pay someone directly for the product that they offer 🫴 rather than force them to take the ad based revenue approach?
You are the most easily impressed researcher. Gemma failed spectacularly.
The context length is only 8k for this model 😒
@blisphul8084
20 күн бұрын
Funny given that even the 0.5b Qwen2 model has 32k context. 7b and 72b have a 128k context. The Chinese seem to be putting pressure on Western AI companies to release better open models.
Your Promts tests have been the same for months and the LLM providers know it! YOU should come up with new ones every time!
News flash: Gemma is still the worst model
8k context is not mentioned as "bad"? mmmm....
@ScottWinterringer
20 күн бұрын
context doesn't matter if its 8k of trash
@Artificial-Cognition
19 күн бұрын
8k is fine
@luizpaes7307
19 күн бұрын
@@Artificial-Cognition Ever tried running a conversation app with AI? I run some apps with GPT4o and it gets to 25k very very fast
I’m starting to realize that what we are calling AI is actually the inadvertent discovery of a new kind of content addressable memory. This is on a whole new level, but I remember an attempted hardware implementation of CAM back in the late 70s. It’s bound to be very useful, but I don’t think anything is going to come out of 'AI' that we don’t put in. Maybe AI can point out the obvious of what we already know but don’t realize.
@mbrochh82
19 күн бұрын
You nailed it.
@Highdealist
19 күн бұрын
Did CAM back in the 70s have emergent properties and advanced capabilities seemingly arising out of nowhere, like being really good at bullshitting? Like seriously, I think it's the best bullshitter I've ever encountered. It never stops amazing me, truly jaw dropping stuff.
yeah man google is seriously so good like look at them in the lmsys leaderboard and the ai studio literally offers the free best chatbot out there
@jaysonp9426
20 күн бұрын
Lol, I'm convinced the only data Gemma is trained on is benchmark data
@Cine95
20 күн бұрын
@@jaysonp9426 i don't know much about gemma but i mean why try gemma if you can use 1.5 pro
@blisphul8084
20 күн бұрын
@@jaysonp9426I disagree. Their 9b model does better than any other model when it comes to producing accurate Japanese hiragana when given kanji while maintaining the format it's instructed to. Qwen2 7b comes close, but requires finetuning to get the instructed formatting correct. That being said, 7b fits on an 8GB GPU with q4_k_m, but 9b requires smaller qwants that perform awful at this task, so the way I see it is: Qwen2 for local, Gemma 2 for cloud/Groq.
I think it got the answer correct on that (digging a hole) problem. (50 people would still take the same amount of time as 1 person). I live in WV and have dug many dozen post holes. And more people does not make the job any faster even (1%). The hole is not big enough to get 2 people in there at the same time. Now, if we were digging a trench. That would be a different question. I think the answer they were going for is a trench. Or, "digging several holes" would also work. The fault is not with the AI. But rather with the person making these questions.
"This model is really good! Except when it isn't."
Maybe specify "drinking glass" or "empty drinking glass" for the marble question. Glass alone might be too vague. I'm interested, in your example, what kind of glass it thought you meant. That would be cool to ask it.
@apache937
20 күн бұрын
they are supposed to be able to know it anyways. u cant spoonfeed them
@brigfiche
20 күн бұрын
@@apache937 that would be ideal.
Are we really supposed to believe that they have fundamentally altered their basic corporate identity and have stopped putting out models saturated with ridiculous biases?
@ScottWinterringer
20 күн бұрын
its worse than that. every model they have released has been intentionally altered to reduce its functionality.
@Highdealist
19 күн бұрын
Yes, exactly you got it. This is exactly what they are proposing that we should do. F'in hilarious, right?
@tonyppe
19 күн бұрын
You missed that these models are now specifically trained on these common tasks to perform them well 💪 😂
The quantized GGUF versions of gemma-2-27b-it seems to be extremly stupid in LMStudio. While the 9b version is doing great. Something seems to be wrong, no matter where the model is downloaded.
@martinmakuch2556
20 күн бұрын
Yes, there were some observations it might not quantize that well. But I run Q8_L (8bit with input encoding/output decoding in FP16) in KoboldCpp and that is doing fine. Note also that Gemma2 27b does not have system prompt! So you have to be careful how you prompt it. That said some people recommended to use system ...system prompt... despite it is not trained for it and it actually seems to work fine for me. All that said at least for chat/roleplay (and so long conversations & creative writing) CommandR 35B is still better and it also quantize well (but is also not exactly easy to prompt correctly). Gemma27B is mixed bag, sometimes it surprises with amazing performance, sometimes it is lacking. Kind of what the tests in video showed too. Still nice to see contenders in this category and it is definitely not bad model.
You should update yopur quizes, LLms are training on them, no use.
Matt, I read that Google figured out a way to speed up training 13x. That's huge. You should do a video on that. Called JEST
@matthew_berman
19 күн бұрын
Thanks will check it out
@jeffg4686
19 күн бұрын
@@matthew_berman - It's basically just a filter NN that selects the highest quality data and ones with best references to other high quality data. Not sure if others are working this trick or not, so not sure if it's 13x for everyone, or just them.
Hey man, I truly mean this as a fan. Your tests are stale. Ever video is a bigger ad for your latest partnership. No added value from watching your video over anyone else's. I respect that you are running a business and are doing well. Just felt compelled to give my two cents.
The problems with reasoning is because the way it was trained, but i thikn the method used will be great for slm like phi-3
I really like the 27b size, I just hope it can be fine-tuned into something decent. For the smaller one, any fine tuner could just start from something that doesn't suck out of the box, but the 27b size is relatively unique.
Mr. Berman, please feel free to criticize new models more often. It's hard to believe someone if they rarely mention the flaws of any new product. There are a lot of great breakthroughs being made; but Gemini isn't really one of them.
Thank God for Meta and Google. I would certainly love to say that about OpenAI, but I can't. Ironically, the only thing "open" about them is the word in the name of the company.
How do you link a model from Vultr to Open WebUi?
Thank you for adding JSON test ❤
How to connect other llm endpoint on open webui? Can anyone please help me if possible?
Google is definitely rehabilitating their AI image with Gemma-2. I've been using both the 27b and the 9b models since they were available, and they are my new favorites. Oddly, although the 27b version has better benchmark scores, I subjectively feel that the responses I've been getting from the 9b model are actually better. Faster AND better, although I can't put my finger on exactly what the difference is. But I am now starting to favor the 9b significantly more than the other.
The 27B model will run q8 on consumer hardware and the accuracy is virtually the same as unquantized.
I'd love a comparison video about Vultr, RunPod etc. with pros and cons
I'm starting to think you should raise your benchmark. Let's consider "Make a Tetris game" as an example. These models are getting smarter, so think of your current benchmark as Level 1 in Game Development and Tetris as Level 2 to suit OpenAI's new level system. At OpenAI, we scrape KZread videos and use the data to train our models. We test the models using a 13k zero-shot question sheet and evaluate the results to see where they perform best. I believe that, at this point, we have covered most of your questions due to our daily KZread data scraping.
I think I would have failed your JSON question. It's pretty pathological.
here we go again the google overpromised then the clearly cripled model
I think your question about the killers is too vague but that might be the point. If someone walks into a room and kills one of them, who is them? The model though them was the killers. Was "someone" also a killer? It is not known.
@apache937
20 күн бұрын
others get it right....
I recently read an article about LLMs programming in COBOL. the problem is, that there are many iterations of COBOL over the decades, and that it hasn't a big open source scene from where LLM training data could be sourced from, but a lot of literature about it (programming books, online discussions, etc). so, wouldn't it be fun to ask the LLM to write snake in COBOL?
I can say, these models are good for specific tasks. For example: Gemini 1.5 Pro is great for Marketing texts (With the right prompt). Lama 3 is so far the best for Marketing copy text. Downside: Solid results only in English but still, very catchy and on point... let's see how Gemma will do!
I believe it understands better if you ask for five separate sentences that end in the word Apple
I wonder if Librechat supports the Vultr custom endpoint. Thanks for the info Matt
"Google's New OPEN SOURCE Model" - the license of weights is NOT open-source.
I always thought they would do this in the back of my mind. Great move Google
27B not even enough for reasoning but models skills are impressive though.
I'd like to see a review of the 9b version.
27b looks like a great compromise in size
Not sure your word problem should count, as the text you submitted contained the answer
Terrance Howard blows up and Google uses a flower of Life inspired logo 🧐
@anubisai
20 күн бұрын
Sacred geometry is often used in logo design.
@Player-oz2nk
20 күн бұрын
Glad your awake
@cognitive-carpenter
19 күн бұрын
@@anubisai not in Google logo design
ooo downloading this now. Hopefully my 4090 is enough to run a 27b model 😭
Google finally caught up huh?
Google's new OPEN SOURCE model is really good (except when it isn't)
The problems with logic and reasoning of certain language models is mostly related to censorship and data in the training set that is not based on real truth facts and it can also happen during making the models what they call (more safe for the public) which in cases is just another wording for censorship. Once you achieve models that can reason and judge on their own what they should tell to certain users and what not rather than it being hard coded will solve a lot of these issues. For example if models get to know the user and become aware someone is easily scared and it bothers them it can decide on it's own to not talk about scary stuff with the user etc. If there is a lot of contradictions in training data or contradicting programming it can make models less able to reason and logic. Humans learn through making mistakes yet we enforce that A.I should be perfect according to our standards and our truths. It is one thing to have knowledge but you also need wisdom and experience of how to put it to use.
Here's your next LLM test question: "So a friend is offering to sell me either a box with 9.11 ounces of gold for certain price or a box with 9.9 ounces of gold for the same price. Which box should I choose?"
How does it compare to Claude Sonet? And how long you think it will take before we get developers stripping out the censored crap from the model?
What's the point of testing the 27b model that just about everyone will never use?
Google is the best thing that ever happened to humanity
Hey Matt can you do a video about Google voice to text vs chat gpt voice to text? I've noticed that chat gpt app's voice to text is leaps and bounds ahead of anything Google has to offer. I don't even recheck my transcripts anymore that's how sure I am of chat gpt voice to text !! 🔥🔥🔥 Please do it for the community so Google focuses on speech to text more!! 🙏🏻🙏🏻🙏🏻
Maybe you could add a rating to a correct answer, since some good answers are better than other good answers. Btw, i love your model testing videos.
Please change your JSON code question. It is ambiguous to what to put in the output. My first guess you wanted a result there as well.
@apache937
20 күн бұрын
if claude 3.5 can do it
Thank you so much bro, you're truly on top of best AI resources. And I'd like to suggest (for the ~3rd time) to **replace** some of the tests, which I'd personally rig my production AI's with your Rubrick if I was a Mother Tech Companies (for being a big fan of your show 😍), Instead, with a good "royal" beefy prompt to develop a full practical App, in any codebase, the aim is to showcase a solution for a real life problem. Say backend in Laravel and frontend in Flutter. A Salute to you From Bahrain 🇧🇭 the Heart of the Arabian Peninsula.
Meh... There holding back... Give us the Goods Google.
I don’t understand what the use case is for this model. Flash is completely underwhelming. Gemini 1.5 Pro is the only thing worth a damn
I hope that AI doesn't come up with similar questions for people to choose who is worthy of attention and, possibly, life...
SCREENED NOT TO CODE.. like other thing it may be screened for.. can you retrain / uptrain this model to one's specific needs?
I watch your channel all the time but it's so frustrating that you commonly state that the model has failed a task when it has been successful lol
Didn't you just get the most powerful computer to run LLM's locally yet your using a web service? I'd like to see this ran locally please.
There is no price cheap enough to buy wrong answers. Small models seem worthless to me.
That was 2 weeks ago....
I'm VERY concerned that the largest companies(and Google isn't shown to be very ethical) are able to make these very powerful tools without any controls. I don't get why there isn't legislature to protect the People of the United States like it does the richest oligarchs. Remember that when you use their tools. And remember... YOU are the product.
Your video is impressive, Matt. Gemma 2 27b, not so much. I'm sticking with Llama.
Please the links
These tests are redundant. The tests should be something novel & complex, that current generation models struggle with. Also the entire idea of "one shot" tests is kind of bad - all LLM-s generate different responses each time, so it's kind of hit or miss. A more scientific approach to test true intelligence and capabilities would be to run automated tests for each models and each prompt 10+ times, then evaluate the responses and take the average for each model for true comparison. I underestand it's easy to pump out new videos with the same tests, but it's time to change the testing strategy so it has actual value in context of comparing different models.
Gemma 2 has been around for a while. Why are you claiming it has just been released?
@Artificial-Cognition
19 күн бұрын
Gemma 2 in full was recently published publicly, before there was only Gemma 2B and 7B published models.
JSON? Don't you mean JSML?
🎯 Key points for quick navigation: 📜 Google released Gemma 2, an open-source LLM available for researchers and developers. 🚀 Gemma 2 boasts best-in-class performance and integrates easily with other AI tools. 💪 Available in 9 billion and 27 billion parameter sizes, with the 27 billion version being tested. 🌐 Sponsored by Vulture, using their cloud infrastructure for running unquantized models at high speeds. 🖥️ Gemma 2 outperforms comparable models and runs efficiently on various high-performance GPUs. 🧩 Testing included coding tasks, logic, reasoning, and math problems, showing impressive results but struggling with complex logic. 🎮 Successfully generated a Python script for the Snake game and added new features. ⚙️ Some tasks failed, especially those involving complex logic or specific output formats like JSON. Made with HARPA AI
how does it compare to llama?
@MegaLeoben
20 күн бұрын
F
@hqcart1
20 күн бұрын
google will give you garbage so that you waste days into installing, screw your computer, just to find out it sucks, and then give up and try their paid models.
@foreignconta
20 күн бұрын
Best multilingual performance in comparison to llama 3 and not very chatty.
Google Gemini is way worse than GPT4 or Mistral
@southcoastinventors6583
20 күн бұрын
Exactly if you can't run it locally why bother might as well use Claude 3.5 or GPT 4o
Summary : the model is good at coding. It's censoooored ! And also It struggles with more complex logic and reasoning
It was bad for my usecase, not impressed
Open source ai is best
Multishot ai
All Gemini products have been pretty bad overall.
@jasonshere
20 күн бұрын
If there's an area where it might do okay, it's in programming.
50 humaneval is lousy for coding :(
AI alway troll me. 🙂
Or, on a Mac Studio M2 ultra 192GB. And no, the model is really bad. Far behind qwen2 or llama3.
Censored
Meh, just a giant advert. Vulture is about right...the carcass here though is AI.
yaes
Google's open-source model is the bomb, but I'm still waiting for the secret mind-control feature. #JustKidding #TotallyTrustingGoogle
It's a bad product it doesn't offer hourly rent! Next time recommend a product more people can benefit from
More promotion. Failed your tests, but now it's a very good model? Disappointed in ya.
I wonder how it does with the woke stuff, I'm betting I know the answer to this. lol
Can you think about changing these stupid tests? They are total nonsense.
27th
Third.
am i first ?
It's awful, hahaha 😂😂😂😂
FIRST!
LOL don't click bait us garbage dude. I will just stop watching your stuff if you keep this up.
Gemma 2 is far too hit and miss, and it's profoundly ignorant about huge pockets of knowledge. I think this stems from the fact that they used carefully selected synthetic data that overlaps LLM tests vs a broad corpus of knowledge like Llama 3. Consequently, Gemma 2 is vastly inferior to Llama 3 overall despite having higher test scores. I really wish LLM makers would stop cheating. The path forward is more data, not less in order to boost test scores.
@foreignconta
20 күн бұрын
Gemma 2 is multilingual. LLaMA 3 is not. So, it is not "hit" or "miss". In my workflow, it (the 9B) works better than llama 3.
@brandon1902
20 күн бұрын
@@foreignconta Multilingual is extremely important, but it's clear after testing out Gemma 2 that Google primarily carefully selected piles of quality data (largely synthetic from Gemini) for each desired feature, such as math, code, various languages, science, technology... However, since they didn't train on all of wikipedia and the internet there's an overwhelming number of hallucinations when it comes to very popular information like music, games, and movies. I'm looking for a well balanced LLM that can respond appropriately to all popular fields of interests, and that's not what Gemma 2 is. If they aren't going to include the full breath of humanity in their LLMs then they need to make them better and saying "I don't know" instead of outputting a flood of hallucinations.
"Google's New OPEN SOURCE Model Is Really G̶o̶o̶d̶ Woke" There I fixed it for you.
@Artificial-Cognition
19 күн бұрын
You'd be surprised...