Matthew Berman
Ай бұрын
72,723
1

LLaMA 3 “Hyper Speed” is INSANE! (Best Version Yet)

Ғылым және технология

What happens when you power LLaMA with the fastest inference speeds on the market? Let's test it and find out!
Try Llama 3 on TuneStudio - The ultimate playground for LLMs: bit.ly/llama-3
Referral Code - BERMAN (First month free)
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
groq.com
llama.meta.com/llama3/
about. news/2024/04/met...
meta.ai/
LLM Leaderboard - bit.ly/3qHV0X7

Пікірлер: 590

@matthew_bermanАй бұрын
Reply Yes/No on this comment to vote on the next video: How to build Agents with LLaMA 3 powered by Groq.
@MoosaMemon.
Ай бұрын
Yesss
@StephanYazvinski
Ай бұрын
yes
@hypercoder-gaming
Ай бұрын
YESSSS
@MartinMears
Ай бұрын
Yes, do it
@paulmichaelfreedman8334
Ай бұрын
F Yesssss
@marcussturup1314Ай бұрын
The model got the 2a-1=4y question correct just so you know
@Benmenesesjr
Ай бұрын
Yes if thats a "hard SAT question" then I wish I had taken the SATs
@picklenickil
Ай бұрын
American education is a joke! That's what we solved in 4th standard I guess..!
@matthew_berman
Ай бұрын
That’s a different answer from what was shown in the SAT website
@yonibenami4867
Ай бұрын
The actual SAT question is : "if 2/(a-1) = 4/y , where y isn't 0 and a isn't 1, what is y in terms of a : and then the answer is: 2/(a-1) = 4/y 2y = 4(a-1) y = 2(a-1) y = 2a-2 My guess he just copied the question wrong
@hunga13
Ай бұрын
⁠⁠@@matthew_bermanthe models answer is correct. If SAT showing different one, they’re wrong. You can do the math by yourself to check it
@floriancastelАй бұрын
4:55 The answer was actually correct. I don't think you asked the right question because you just need to divide both sides of the equation by 4 to get the answer.
@asqu
Ай бұрын
4:55
@floriancastel
Ай бұрын
@@asqu Thanks, I've corrected the mistake
@R0cky0
Ай бұрын
Apparently he wasn't using his brain but just copying & pasting then looking for some answer imprinted in his mind
@Liberty-scoots
Ай бұрын
Ai will remember this treacherous behavior in the future 😂
@notnotandrewАй бұрын
The model does better when you prompt it twice in the same conversation because it has the first answer in its context window. Without being directly told to do reflection, it seem that it reads the answer, notices its mistake, and corrects it subconsciously (if you could call it that).
@splitpierre
Ай бұрын
Either that, or just has to do with temperature. I believe, by the groq documentation, their platform does not implement memory like chat gpt, temperature by default is 1 on groq which is medium and will give varying responses, so I believe it has to do with temperature. Try again with deterministic results, temperature zero.
@geno5183Ай бұрын
Heck yeah, Matt - let's see a video on using these as Agents. THANK YOU! Keep up the amazing work!
@tigs9573Ай бұрын
Thank you, I really appreciate your content since it is really setting me up for when I ll get the time to dive into LLM.
@Big-CheekyАй бұрын
PLEASE MAKE THAT VIDEO! :) This one was also great
@matteominellono
Ай бұрын
Agents, agents, agents! 😄
@vickmackey24Ай бұрын
4:28 You copied the SAT question wrong. This is the *actual* question that has an answer of y = 2a - 2: "If 2/(a − 1) = 4/y , and y ≠ 0 where a ≠ 1, what is y in terms of a?"
@albertakchurin4746
Ай бұрын
Indeed👍
@juanjesusligero391Ай бұрын
I'm confused. Why is the right answer to the equation question "2a-2"? If I understand it correctly and that's just an equation, the result should be what the LLM is answering, am I wrong? I mean: 2a-1=4y y=(2a-1)/4 y=a/2-1/4
@marcussturup1314
Ай бұрын
You are correct
@OccamsPlasmaGunАй бұрын
I think the reason for the alternating right and wrong answers is that it assumes that you asked it again because you weren't happy with the previous answer. It picks the most likely answer based on that.
@fab_spaceinvaders
Ай бұрын
absolutely a context related issue
@collectivelogicАй бұрын
Your chat window is "context". That's why it's "learning". We need to see how they have the overflow setting configured, then you'll be able to know if it's a rolling or cut the middle sort of compression. Love your channel!
@ideavr26 күн бұрын
At the marble and cup prompt. If we consider that Llama 3 recognizes successive prompts as successive events, then Llama 3 may have interpreted the events as follows: (1) inverting the cup on the table. So the marble falls onto the table. The cup goes into the microwave and the marble stays on the table. (2) in a second response to the same prompt, when we turn the cup over, Llama can have interpreted it as "going under the table". Thus, the marble, due to gravity, would be at the bottom of the cup. Then, the cup goes into the microwave with the marble inside. And so on.
@dropbear9785Ай бұрын
Yes, hopefully exploring this 'self-reflection' behavior. It may be less comprehensive than "build me a website" type agents, but showing how to leverage groq's fast inference to make the agents "think before they respond" would be very useful...and provide some practical insights. (Also, estimating cost of some of these examples/tutorials would be a nice-to-know, since it's the first thing I'm asked when discussing LLM use cases). Thank you for your efforts ... great content as usual!
@AtheistAdamАй бұрын
Yes, and thanks for sharing.
@existenceisillusion6528Ай бұрын
4:49 using '2a-2' implies a = 7/6, via substitution. However, it can not be incorrect to say (2a-1)/4 = y, because the implication is that all of mathematics is inconsistent.
@ps0705Ай бұрын
Thanks for a great video as always, Matthew! Would you consider running your questions 10 times (not on video) if the inference speed is reasonable of course, to check the percentage of how often it gets questions right/wrong ?
@MeinDeutschkursАй бұрын
I can‘t help myself, but I think there are 4 killers in the room: 3 alive and one dead.
@sbacon92
Ай бұрын
"There are 3 red painters in a room. A 4th red painter enters the room and paints one of the painters green." How many painters are in the room? vs How many red painters are in the room? vs How many green painters are in the room? From this perspective you can see there is another property of the killers being checked, whether are they living, that wasn't asked for and it doesn't specify if a killer stops being a killer upon death.
@LipoSurgeryCentres
Ай бұрын
Perhaps the AI understands about human mortality? Ominous perception.
@matthew_berman
Ай бұрын
That’s a valid answer also
@henrik.norberg
Ай бұрын
For me it is "obvious" that there are only 3 killers. Why? Otherwise we would still count ALL killers that ever lived. Otherwise, when do someone stop count as a killer? When they have been dead for a week? A year? Hundred years? A million years? Never?
@alkeryn1700
Ай бұрын
@@henrik.norberg Killers are killers forever, wether dead or alive. you are not gonna say some genocidal historical figure is not a a killer because he's dead. you may use "was" because the person no longer is, but the killer part is unchanged.
@taylorromero9169Ай бұрын
The variance on T/s can be explained by using a shared environment. Try the same question repeatedly after clearing the prompt and I bet it ranges from 220 to 280. Also, yes, too lenient on the passes =) Maybe create a Partial Pass to indicate something that doesn't zero shot it? It would be cool to see the pass/fails in a spreadsheet across models, but right now I couldn't trust the "Pass" based on the ones you let pass.
@ministerpillowesАй бұрын
8:22 Is the marble in the cup, or is the marble on the table: the question of our time 🤣
@Sam_Saraguy
Ай бұрын
and the answer is: "Yes!"
@victorc777Ай бұрын
As always, Matthew, love your videos. This time, though I followed along running the same prompts on **Llama 3 8B FP16 Instruct** model on my Mac studio. I think you'll find this a bit interesting, if not you then some of your viewers. When following along if both your run and mine failed or passed, I am ignoring them, so you can assume if I'm not bringing it up here then mine did as well or as bad as the 70B model on Groq, which is saying something! I almost wonder if Groq is running a lower quantization, which may or may not matter, but considering the 8B model on my Mac is nearly on par with the 70B model is strange to say the least. The only questions that stick out to me are the Apple prompt, the Diggers prompt, and the complex Math Prompt (Answer is -18). - The very first time I ran the Apple prompt it gave me the correct answer, and I re-ran it 10 times with only one of them providing me with an error of a single sentence, not ending in Apple. - Pretty much the same thing with the Diggers prompt, I ran it many times over and got the same answer, except for once. It came up with a solution that to dig the hole would not take any less time, which would almost make sense, but the way it explained it, it was hard to follow and made it seem like 50 people were digging 50 different holes. - The first time I ran the complex math prompt it got it wrong, close to the same answer you got the first time, but the second time I ran it I got the correct answer. It was bittersweet since I re-ran it another 10 times and could never get the same answer again. I'm beginning to wonder if some of the prompts you're using are uniquely too hard or too easy for the Llama 3 models regardless of how many parameters they have. EDIT: when running math problems, I started to change some inference parameters, which to me seems necessary, considering math problems can have a lot of repetitiveness. So I started reducing the temperature, disabling the repeat penalty, and adjusting Min and Top P sampling. Although I am not getting the right answer, or at least I think I'm not, since I don't know how to complete the advanced math problems, but for the complex math prompt where -18 is supposedly the answer, I continue to get -22. Whether or not that is, the wrong answer is not my point, but that by reducing the temperature and removing the repetition penalty, it is at least becoming consistent, which for math problems seems like that is what our goal should be. Through constant test and research, I THINK the function should be written with the "^" symbol, according to wolfram, like this: f(x) = 2 x^3 + 3 x^2 + c x + 8
@I-DophlerАй бұрын
For sure! I'm astonished by the improvements in llama 3's performance on Grock. Can't wait to discover what revolutionary advancements lie ahead for this technology!
@DeSincАй бұрын
The hole digging question was made not to be a maths question, but to see if the model can fathom the idea of real-world space restrictions cramming 50 people into a small hole. The point of the question is to trick the model into saying 50 people can fit into the same hole and work at the same speed which is not right. I would personally only consider it addressing the space requirements of a hole for the amount of people as a pass. Think if you said 5,000 people digging a 10 foot hole, it would not take 5 milliseconds. That's not how it works. That's what I would be looking for in that question.
@phillipweber7195
26 күн бұрын
Indeed. The first answer was actually wrong. The second one was better, though not perfect. Although that still means it gave one wrong answer. Another factor to consider is possible exhaustion. One person working five hours straight is one thing. But if there are more people who can't work simultaneously but on a rotating basis...
@csharpner23 күн бұрын
I've been meaning to comment regarding these multiple different answers: You need to run the same question 3 times to give a more accurate judgement. But clear it every time and make sure you don't have the same seed number. What's going on: The inference injects random numbers to prevent it from repeating the same answer every time. Regarding not clearing, and asking the same question twice, it uses the entire conversaion to create the new answer, so it's not really asking the same question, it's ADDING the question to a conversation and the whole conversation is used to trigger a new inference. Just remember, there's a lot of randomness too.
@ArtificialintelligenceoАй бұрын
Great video. Nice speed.
@MrStarchild3001Ай бұрын
Randomness is normal. Unless the temperature is set to zero (which is almost never the case), you'll be getting stochastic outputs with an LLM. This is actually a feature, not a bug. By asking the same question 3 times, 5 times, 7 times etc. And then reflecting on it, you'll be getting much better answers than asking just once.
@roelljr
Ай бұрын
Exactly. I thought this was common knowledge at this point. I guess not.
@easypeasy2938Ай бұрын
YES! I want to see that video! Please start from very beginning of process. Just found you and I would like to set up my first agented AI. (I have an OpenAI pro account, but I am willing to switch to whatever you recommend....looking for AI to help me learn Python, design a database and web app, and design a Kajabi course for indie musicians. Thanks!
@JimMendenhallАй бұрын
YES! This plus Crew AI!
@AINEETАй бұрын
The guys from rabbit really need the groq hardware running the llm on their servers
@ThaiNeuralNerdАй бұрын
Yes, an autonomous video showing an example using groq and whatever agent model you choose would be awesome
@roelljrАй бұрын
A new logic/reasoning question for you test that is very hard for LLMs: Solve this puzzle: Puzzle: There are three piles of matches on a table - Pile A with 7 matches, Pile B with 11 matches, and Pile C with 6 matches. The goal is to rearrange the matches so that each pile contains exactly 8 matches. Rules: 1. You can only add to a pile the exact number of matches it already contains. 2. All added matches must come from one other single pile. 3. You have only three moves to achieve the goal.
@chrisnatale5901Ай бұрын
Re: how to decide which of multiple answers is correct, there's been a lot of research on this. Off the top of my head there's a "use the consensus choice, or failing consensus choose the choice the LLM has the highest confidence score." That approach I used in Google's Gemma paper if I recall correctly.
@MagnusMcManamanАй бұрын
I think the problem with the cup is that LLaMA "thinks" that every time you write "placed upside down on a table" you are actually turning the cup upside down, which is the opposite of what it was before. So, as it were, every other time you put the cup "normally" and every other time upside down. LLaMA takes into account the context, so if you delete the previous text, the position of the cup "resets".
@joepavlos3657Ай бұрын
Would love to see the Crew ai with Groq idea, I would also love to see more content on using crew ai, agents to be used to train and update models. Great content as always, thank you.
@mhillary04Ай бұрын
It's interesting to see an uptick in the "Chain-of-thought" responses coming out of the latest models. Possibly some new fine tuning/agent implementations behind the scenes?
@wiltedblackroseАй бұрын
My man, in what world is y = 2a - 2 the same expression as 4y = 2a - 1 ? That's not only a super easy question, but the answer you got is painfully obviously wrong!! Moreover I suspect you might be missing part of the question, because the additional information you provide about a and y are completely irrelevant.
@matthew_berman
Ай бұрын
I used the answer in the SAT webpage
@wiltedblackrose
Ай бұрын
@@matthew_berman Well, you too can see it's wrong. Also, the other SAT question is wrong too. Look at my other comment
@dougdouglass6126
Ай бұрын
@@matthew_bermanthis is alarmingly simple math, if you’re using the answer from an SAT page then there are two possibilities: You copied the question incorrectly, or the SAT page is wrong. It’s most likely that you copied the question wrong because the way the second part of the question is worded does not make any sense.
@elwyn14
Ай бұрын
@@dougdouglass6126 Sounds like its worth double checking, but saying things like "this is alarmingly simple math" is a bit disrespectful and assumes Matt has any interest in checking this stuff, no offense but math only becomes interesting when you've got an actual problem to solve, if the answer is already there from the SAT webpage as he said, he's being a total normal person not even looking at it.
@wiltedblackrose
Ай бұрын
@@elwyn14 That's nonsense. Alarming is very fitting, because this problem is so easy it can be checked for correctness at a glance, which is what we all do when we evaluate the model's response. And this is A TEST, meaning, the correctness of what we expect as an answer is the only thing that makes it valuable.
@micknamens8659Ай бұрын
5:20 The given function f(x)=2×3+3×2+cx+8 is equivalent to f(x)=8+9+cx+8=cx+25. Hence it is linear and can cross the x-axis only once. Certainly you mean instead: f(x)=2x^3+3x^2+cx+8. This is a cubic function and hence can cross the x-axis 3 times. When you solve f(-4)=0, you get c=-18. But when you solve f(12)=0, you get c=-324-8/12. So obviously 12 can't be a root of the function. The other roots are 2 and 1/2.
@HarrisonBorbarrisonАй бұрын
1:53 Haha Comic Sans! That was funny.
@wiltedblackroseАй бұрын
Also, your other test with the function is incorrect (or unclear) as well. As a simple proof check that if c = -18, then the function f doesn't have a root at x = 12: f(12) = 2 · 12^3 + 3 · 12^2 - 18 · 12 + 8 = 3680. Explanation: f(-4) = 0 => 2 · (-4)^3 + 3 · (-4)^2 + c · (-4) + 8 = 0 => -72 - 4c = 0, which in an of itself would imply that c = -18. f(12) = 0 => 2 · 12^3 + 3 · 12^2 + c · 12 + 8 = 0 => 3896 + 12 c = 0 which on the other hand implies that c = -324 Therefore there is a contradiction. This would actually be an interesting test for an LLM, as not even GPT-4 sees it immediately, but the way you present it, it's nonsense.
@Sam_Saraguy
Ай бұрын
garbage in, garbage out?
@wiltedblackrose
Ай бұрын
@@Sam_Saraguy That refers to training, not inference.
@azurenachoАй бұрын
For the equation _2a - 1 = 4y_, solving it as _y = 2a - 2_ doesn’t align with the algebraic process. Instead, dividing by 4 gives: - _y = (2a - 1)/4_ - _y = 0.5a - 0.25_ Substituting _y = 2a - 2_ back into the original equation, we see: - _2a - 1 ≠ 4(2a - 2)_ - _2a - 1 ≠ 8a - 8_ - _6a ≠ 7_ This clearly shows the solution _y = 2a - 2_ is incorrect for "getting y in terms of a". Also, the constraints _a ≠ 1_ and _y ≠ 0_ don’t influence the direct algebraic solution and appear to be unnecessary for this context. *On the Cubic Function:* The video stated _c = -18_ for the cubic function _f(x) = 2x³ + 3x² + cx + 8_ at roots _x = -4_, _x = 12_ and _x = p_. However, a polynomial should have a constant _c_ across all roots, not different values. When calculated: - For _x = -4_, _c = -18_ - For _x = 12_, _c = -324.67_ (or _-974/3_) This discrepancy indicates that a single polynomial cannot have two different constants for _c_, which suggests an inconsistency in the video's explanation. Furthermore, the general solution for _c_ is expressed as: - _c = -(2x³ + 3x² + 8)/x_ (with an asymptote at _x = 0_ since division by zero is undefined.)
@tvwithtiffaniАй бұрын
The reason you get the correct answer after asking a 2nd and 3rd time is the same reason chain of thought, chain of whatever works. The subsequent inference requests are taking the 1st output and using it to reason, finding the mistake and correcting it. This is why the Agent paradigm is so promising. Better than zero-shot reasoning.
@tvwithtiffani
Ай бұрын
I think you are aware of this though because you mentioned, getting a consensus of outputs. This is the same thing in a different manner.
@TheHardcardАй бұрын
One important factor to know are the parameter specifications. Are they floating point or integer? How many bits 16, 8, 4, 2? If fast inference speeds are coming from heavy quantization it could affect the results. This would be fine for many people a lot of the time, but it should also always be disclosed. Is Groq running full precision?
@jgoldfedder8 күн бұрын
I asked ChatGPT-4o the cup question at 7:38, and it got it wrong ("The marble is still inside the cup, stuck to the inner surface at the highest point inside the inverted cup."). I corrected it, and it agreed with me. When I asked why it got it wrong the first time, it responded: "My initial reasoning was based on the assumption that the marble would remain within the cup due to some form of adhesion or static positioning, which was an incorrect interpretation of the problem. The accurate analysis should consider gravity's effect, which would cause the marble to fall out of the cup when it is inverted." So, is it assumed that we're using adhesive tape in our cups? When I asked it for a revision of the prompt so that ambiguities would be removed, I got this: "Assume the laws of physics on Earth. A small marble is put into a normal cup, and the cup is placed upside down on a table. Gravity will cause the marble to fall out of the cup and land on the table. Someone then takes the cup without changing its orientation and puts it inside the microwave. Where is the marble now?" Rephrasing it defeats the purpose of the question, but at least we can assume now why it gets it wrong.
@jp137189Ай бұрын
@matthew_berman A quantized version of Llama 3 is available on LM Studio. I'm hoping you get a chance to play with it soon. There was a interesting nuance to your marble question on the 8B Q8 model: "The cup is inverted, meaning the opening of the cup is facing upwards, allowing the marble to remain inside the cup." I wonder how many models assume 'upside down' indicates the cup open is up, but just don't say it explicitly?
@airedavАй бұрын
Thank you, Matthew. Please show us the video of Llama 3 on Groq
@KabbinjАй бұрын
Groq is set to cache results. Any prompt + chat history gives you the same result for as long as the cache lives. So for your case, both the first and second answer is locked in place by the cache. Also keep in mind that the default setting of groq is a temperature higher than 0. This means there will be variations in how it answers(assuming no cache). From this at can conclude that it's not really that confident in its answer, as even the small default temperature will trip it. May I suggest you run these non creative prompts with temperature 0?
@DWJT_MusicАй бұрын
I've always interacted with LLM'S with the assumption that multi shot prompting or recreating a prompt is stored in a 'close but not good enough' parameter thus the reasoning (logic gates) will attempt to construct a slightly different answer using the history of the conversation as a guide to achieve users goal with the most recent prompt having the heaviest weights. One shot responses are the icing on the cake but to really enjoy your desert you might consider baking, which is also a layered process and can also involve fine tuning, note: leaving ingredients on the table, will slightly alter your results and may contain traces of nuts..
@christiandarkinАй бұрын
I think when you prompt a second time it's reading the whole chat again, and treating it as context. So, when the context contains an error, there's a conflict which alerts it to respond differently
@OscarTheStrategistАй бұрын
The prompt question becomes invalid because the model takes the system prompt into account as well. You could argue it should know when the user’s question starts and its original system prompt ends. Also, the reason you see better answers on second shot is probably because the context of the inference is clear the second time around. This is why agentic systems work so well. It gives the model clearly defined roles outside of the “you are an LLM and you aren’t supposed to say XYZ” system prompt that essentially pollutes the first shot. It’s amazing still how these models can reason so well. Yes, I’m aware of the nature of transformers also limiting this but I wouldn’t give a model a fail without a fair chance and it doesn’t have to be two shots at the same question, it can simply be inference post-context (after the model has determined the system prompt and the following inference is pure)
@JanBadertscherАй бұрын
Thanks Matthew for the eval. Some thoughts, ideas and comments: 1. For an objective I always remove the history. 2. If I didn't set temp to 0, I run every question multiple times, to stochastically get more comparable results and especially measure the distribution to get a confidence score for my measured results. 3. Trying exactly the same promt multiple times over an API like Groq? I doubt they use LLM caching or temp is set to 0. Better check twice, if they cache things.
@dhruvmehta2377Ай бұрын
Yess i would love to see that
@TheColonelJJАй бұрын
Which LLM, that can be run on a home computer, would you recommend for helping refine prompts for Stable Diffusion -- text to image?
@mshonleАй бұрын
It’s possible you are getting different samples when you prompt twice in the same session/context due to a “repetition penalty” that affects token selection. The kinds of optimizations that groq performs (as you made in reference to your interview video) could also make the repetition penalty heuristic more advanced/nuanced. Cheers!
@zandor117Ай бұрын
I'm looking forward to the 8b being put to the test. It's absolutely insane how performant the 8b is for it's size.
@StefanEnslinАй бұрын
Yes, Would love to see you doing this, still getting used to the CrewAI system
@rezerajАй бұрын
Second problem was also incorrectly copied it's The function f is defined by f(x) = 2x³ + 3x² + cx + 8, where c is a constant. In the xy-plane, the graph of f intersects the x-axis at the three points (−4, 0), (1/2, 0), and (p, 0). What is the value of c? Not 2x3+3x2+cx+8, and in my tests it solved correctly
@mazensmzАй бұрын
Hi Nooby, you need to consider the following: 1. any statements, words added to the context will effect the response, so ensure only direct relevant context only. 2. When you ask "How many words in the response?" the system prompt statement effect the number given to you, you may request the llm to count and mention the response words and you will be surprised. Thx!
@JimSchreckengastАй бұрын
Write ten sentences that end with the word "apple" followed by a period. This worked for me.
@WINDSORONFIREАй бұрын
I ran this on ollama 70b And I get the same behavior. In my case and not just for this problem but other logic problems it would give me the wrong answer. Then I tell it to check The answer and it always gets it right the second time. This model is definitely a model that would benefit from self-reflection before answering
@falankebills7196Ай бұрын
hi, how did you run the snake python script from Visual Studio? I tried but couldn't get the game screen to pop up. Any hints/help/pointers much appreciated.
@Luxiel610Ай бұрын
its so insane that it actually wrote "Flappy bird" with a GUI. it does error in first and 2nd output and the 3rd it's so flawless. daang
@user-cw3jg9jq6dАй бұрын
Thank you for the content. Do you think you can point to create procedures for running LLaMA 3 on Groq please? I might have missed something but why did you fail LLaMa3 for the question about breaking into a car. I think it told you it can not provide that info, which is what you want; no?
@frederick6720Ай бұрын
That's so interesting. Even Llama 3 8B gets the "Apple" question right when prompting it twice.
@TheReferrer72
Ай бұрын
Yes and on the first prompt it only got the 6th sentence wrong! 6. The kids ran through the orchard to pick some Apples.
@mlsterlous
Ай бұрын
Not only that question. It's crazy smart overall.
@vankovilijaАй бұрын
This is my assumption on why you are seeing the model behave differently when posing the same prompts multiple times: 1. Hallucination: This is a very missunderstood behavior of LLM, I attempt to explain it in my talk here: m.kzread.info/dash/bejne/X2iulJJyZ5bQoNI.html at timestamp 18.45. In short hallucinations come from predicting one wrong word due to semi-random sampling of each next word that then influences the rest of the generation. 2. Chat context: When you send multiple messages you are not clearing the chat, this means that all of your chat and the new prompt ends up as new input to the model. This causes a difference in what the models attention heads will focus on, and what latent space you activate, allowing the model to generate a different answer, and in many cases self correct, similar to self reflection. Great video, keep up the good work! One suggestion you may want to create some script that will allow you to run the same prompt 10 times on this models, with a decent temprature setting (randomness selection), and count how many times it gets the answer right or wrong, this will give you more granular scores from pass/fail on every test, allowing for a more accurate test.
@d.d.z.Ай бұрын
Absolutely, I'd like to see the Autogen and Crew ai video ❤
@Mr_Tangerine609Ай бұрын
Yes, please Matt, I would like to see you put llama three into an agent framework. Thank you.
@accountantguy2Ай бұрын
The hole digging problem depends on how wide the hole is. If it's wide enough for 50 people to work, then speed will be 50x what one person could do. If the hole is narrow so that only 1 person can work, then there won't be any speed increase by adding more people. Either answer is correct, depending on the context.
@HaraldEngelsАй бұрын
Yes I would like to see the video you proposed 🙂
@hugouasd1349Ай бұрын
giving the LLMs the question twice I would suspect works due to it not wanting to repeat itself if you had access to things like the temperature and other params you could likely get a better idea of why but that would be my guess.
@ThePawel36Ай бұрын
I'm just curious. What is the difference in quality responses between for example 4q and 8q models? Lower quantization means lower quality or higher possibility of error?
@KiraIsGodАй бұрын
if you ask the same question 2 times that are somewhat hard I think the LLM assumes the first one was incorrect so it tries to fix the answer leading to an incorrect answer the 2nd time.
@Scarage21Ай бұрын
The marble thing is probably just the result of reflection. Models often get stuff wrong bc an earlier more-or-less-random token pushes it to the wrong path. Models cannot selfcorrect during inference, but can on a seconn iteration. So it probably spotted the incorrect reasoning of the first iteration and never generated early tokens that pushed it down the wrong path again.
@asgorath5Ай бұрын
Marble: I assume that it doesn't clear your context and that the LLM assumes the cup's orientation changes each time. That means on every "even" occasion the orientation of the cup has the opening downwards and hence moving the cup leaves the marble on the table. On every "odd" occasion, the cup has its opening face upwards and hence the marble is held in the cup when the cup is removed. I therefore assume the LLM is interpreting the term "upside down" as a continual oscillation of the orientation of the opening of the cup.
@michaelrichey8516Ай бұрын
Here's an idea for a video - take your top 2 competitor LLMs, give them opposing viewpoints, and make them debate. If you did it on Groq - you'd have to review at a later time because it would scroll the screen too quickly to read!
@unclecodeАй бұрын
1/ IMO, this evaluation method has a key flaw - it doesn't consider inference parameters like temperature and top_p. Then you have no way to make any conclusion out of this. For example just because you got the snake game in first try doesn't mean Llama3 in Groq is different. The fair approach is to use their playground, establish a consistent setup for different evaluation categories. For math, set temperature to 0; for creativity, increase it. Keep the setup the same for all models you assess. I'm a fan and just wanted to share my opinion! 2/ Regarding why when u rerun the same query, u get a diff response, that is due to the temp settings. If they're cranked up high, it totally makes sense. Play around with a temp of 0 in the playground to see the outcome. 3/ When u ask the same question in the same chat history and get diff answers for the second one, it's cuz how the model is fine-tuned. Some models think there's a mistake in the 1st answer, so u ask again. Others, with better fine-tuning, check if anything was wrong. Thanks for your content.
@MrEnriqueagАй бұрын
I believe that by default the temperature is 0 which means that with the same input you are always gonna get the same output, if you ask the question twice thou, the input is different because it contains the original question, thats why the response is different. If you increase the temperature a bit, the output should be different every time, and then you can use that to generate multiple answers via api, then ask another time to reflect on it, and then provide the best answer. If you want I can create a quick script to test that out
@angryktulhuАй бұрын
KZread provides subtitles, and lotta services already use it for video summarization. Chance the model was trained on some youtube videos too, this way. So it "knows" your questions and was trained on the answers, that's why the history in the chat matters. Some important consequence is that those models might be way less smart than it seems. It's not a conspiracy theory, it's just a result of the input data. But it's just a theory. I'd recommend slightly changing tests every week or so, it won't be that hard to just add little changes here and there
@djglxxiiАй бұрын
For the microwave marble problem, would it be helpful if you were explicit in stating that the cup has no lid? Is it possible it doesn't quite understand that the cup is open?
@Maltesse1015Ай бұрын
Looking forward for the Agent video with Llama3 🎉!
@axeesАй бұрын
I've tried creating Snake with zero-shot too. Got pretty much the same result :) Maybe should try testing it by asking to create Tetris :)
@abdelhakkhalil7684Ай бұрын
You know you can specify custom prompts in Grog, right? Try that for more consistency. And, the fact that the model gives different output for math questions means that it has a higher temperature.
@dkozlov80Ай бұрын
Thanks. Let's try local agents on llama 3? Also please consider self corrective agents, maybe based on langchain graphs. On llama3 they should be great.
@christiansroyАй бұрын
@matthew_berman remember that asking the same question to the same model will give you different answers because there are randomness to it unless you specify a temperature of zero, which I don’t think you are doing here. Also, assuming the inference speed depends on the question you ask is a bit far-fetched. You have to account the fact that the load on the server will also impact the inference speed. If you ask the same question times at different time period of the day you will get different inference speed. good science is not about making quick conclusions on sparse results.
@brianWreavesАй бұрын
In my mind, adding quotation marks around "apple" creates a new question and the answer provided cannot be compared to the answers of other LLMs. The questions must remain consistent to compare answers.
@johnflux1Ай бұрын
@matthew The reason you get different answers when you run is because of temperature. A non-zero temperature means that the website (so to speak) intentionally chooses NOT the best answer from the AI. The purpose is to add a bit of variance and 'creativity' to the answer. If you want the answer that the AI really considers to be the most correct, you need to set the temperature to 0. Most AI websites will have an option to do this.
@arka7869Ай бұрын
here is another criteria for reviewing models: reliability or consistency. does the answer change if prompt was repeated? I mean, if I dont know the answer and I would have to rely on the model (like the math problems) how could I be sure that the answer is correct? we need STABLE answers! thank you for your testing!
@EnricoRosАй бұрын
Is llama3-70B on Groq running quantized (8-bit?) or F16? To understand if this is the baseline or less.
@theflightwrightsprogrammin4410Ай бұрын
4:50 how is it 2a-2? the answer it gave is spot on. Probably there is some error while pasting the question from SAT but the answer it gave is right
@thelegend7406Ай бұрын
Some readymade coffee cups have lid so llama gambles between the bith response.
@tlskillman
Ай бұрын
I think this is poorly constructed question, as you point out.
@AnthonyGarlandАй бұрын
wow! amazing
@user-zh3zb7fw2j29 күн бұрын
In the case where the model gives wrong answers alternating with correct answers If we give the model an additional "Prompt" like "Please think carefully about your answer to the question," I think it would be interesting what would happen to the answer? Mr. Berman
@zinexeАй бұрын
perhaps the temperature settings are different in the online/groq version, for math it's probably best to have very low temp, maybe even 0
@TheFlintStrykerАй бұрын
Let’s build the agents!!
@davtechАй бұрын
Would love to see a video on how to setup agents.
@elyakimlevАй бұрын
The "2a - 1 = 4y" question was answered correctly. Anyway, I don't think these kind of questions are interesting. These models are trained on such questions. Ask more interesting ones for entertainment purposes, like the following: I have a 1.5 kilometer head start over a hungry bear. The bear can keep running at a constant speed of 25 km/h for 8 minutes, after which it gives up the chase. How fast should I run if I want to be at least 100 meters ahead of the bear when it gives up the chase?
@dankrue2549
Ай бұрын
Just had some fun testing your exact question against a few AI, and yeah, they really struggled. Meta was closest, only messing up by adding 100 meters to your starting distance, not the end, putting the bear 100m ahead of you at the end, and saying 13 km/h. GPT was doing some seriously schizo stuff, and eventually just divided its wrong answer by its self, getting 1 km/h. And Gemini was too busy telling me it's not possible for humans to out run grizzly bears, no mater how fast i think i can run. HAHA (although, when i finally convinced it it was a logic puzzle, it said i need to run 25 km/h, accidentally putting me 1.5 km behind the bear at the start, if i understood its error correctly.) Even with serious hints, and pointing out their mistakes, they only kept getting worse answers as i tried to help them. Funny how it all worked out.
@hamidgАй бұрын
I think the marble and cup question needs to be redesigned, I think the wording that starts with assume laws of physics on earth makes it think of the question as a physics question rather then a logical question, when you tested it in the meta platform the explanation it gave for the marble being in the microwave is that the marble didn’t have time to fall down when you flipped the cup there for it was in the microwave and some explanation about electromagnetic not effecting gravity. In groq it bases the explanation based on density of air and the ball itself, so I think the best way to redesign this question is adding more physical variables such time speed density gravity etc instead of just saying earth, or have the “assume the laws of physics on earth” after the question is presented, I haven’t tested these variations but this the my thoughts
@eyemazedАй бұрын
Maybe you should modify the LLM test suite so that for each question, you do 10 runs and take the average correct/incorrect ratio. In the video you edit it of course. But it would give a much, much more complete picture of the models abilities
@steventan6570Ай бұрын
I think the model always gives a different answer when the same prompt is asked due to the frequency penalty and presence penalty.
@RonanGrantАй бұрын
I think it would be better to try the same prompt for each section of the test several times, and see how many of those times it worked. Sometimes when things don’t work for you, it works for me and the other way around.
@Interloper12Ай бұрын
Inb4 the Llama 3 devs are training it specifically on Snake because of your videos.
@SanctuaryLifeАй бұрын
With the marble in the cup dilemma, could be that the temperature settings are a little too high on the model leading it to be creative?
@roelljr
Ай бұрын
it's exactly what it is. Randomness is normal. Unless the temperature is set to zero (which is almost never the case), you'll be getting stochastic outputs with an LLM. This is actually a feature, not a bug. By asking the same question 3 times, 5 times, 7 times etc. And then reflecting on it, you'll be getting much better answers than asking just once