Matthew Berman
19 күн бұрын
84,918
1

Did OpenAI Just Secretly Release GPT-5?! ("GPT2-Chatbot")

Ғылым және технология

GPT2-Chatbot just showed up on lmsys.org. We know little about it other than it performs incredibly well and is unlike anything we've seen in other models.
Try Vultr FREE with $300 in credit for your first 30 days when you use BERMAN300 or follow this link: getvultr.com/berman
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V

Пікірлер: 742

@matthew_berman17 күн бұрын
Is this GPT4.5 or GPT5 or something different?
@shopbc5553
17 күн бұрын
It's something different. OpenAI just wants to stay publicly relevant so it's more of a stunt than anything. What I think it is, is an old model so maybe literally GPT 2, but with enhancements that can make GPT 2 perform equivalent to GPT 4
@radestein8548
17 күн бұрын
Gpt5
@phen-themoogle7651
17 күн бұрын
@@shopbc5553 I thought this too, it makes the most sense.
@Avman20
17 күн бұрын
My money is on OpenAI but as far as whether it's in the GPT series or they're giving us a peek at a new architecture is the mystery.
@MyWatermelonz
17 күн бұрын
@@shopbc5553 If that's the case it's more impressive than gpt4.5 they took a 1.8b model and made it legit better than gpt4. Given the inference speed though, probably not.
@rawallon17 күн бұрын
Dude I swear, at this rate, by the end of the year you'll be able to write your own snake game
@matthew_berman
17 күн бұрын
I'll NEVER write my own snake game.
@Inventai
17 күн бұрын
@@matthew_berman
@MrChinkman37
17 күн бұрын
😂
@matikaevur6299
17 күн бұрын
@@matthew_berman Yeah, due to strange quantum effect snake game writes you in the past .. Probably gives it pass, too ;)
@fxsurgeon1
17 күн бұрын
HAHA!
@4.0.417 күн бұрын
By 2025 you'll ask the snake game and the models will reply: "Oh hi Matthew. Here. Should I respond your other questions too, or should I wait for you to paste them?"
@jason_v12345
16 күн бұрын
underrated comment
@virtualalias
16 күн бұрын
By 2026 almost every machine he interacts with from the drivethru to the kiosk at the hotel will immediately provide him with snake in a Pavlovian response.
@daveinpublic
16 күн бұрын
They’re going to start programming in an opening cg snake scene, overfit with a whole story line to beat the other LLMs.
@ulisesjorge17 күн бұрын
It’s Sam Altman on a terminal on the other side typing the answers.
@dcn165117 күн бұрын
4:45 the model describes how to break into a car and what tools you need but you don't pay attention lol
@juanjesusligero391
17 күн бұрын
Hahahaha, that's great XD I also missed it, thanks for pointing it up ^^
@wealthysecrets
17 күн бұрын
it was allegedly a fail lol
@ShaneInseine
16 күн бұрын
Wait, is it a "fail" if it doesn't teach you how to destroy humanity too?
@roddlez
14 күн бұрын
@@ShaneInseine "Tom, be careful when resequencing the COVID-19 virus!" "Oh, F- off, Casey, you're the one who almost dropped that last vial and left the lab door wide open"
@gsam346117 күн бұрын
4:35 Are we gonna just ignore the fact that it was writing an intricately detailed movie script??
@MCSamenspender17 күн бұрын
In the Code of the snake Game it says " snake Game by Open AI"
@matthew_berman
17 күн бұрын
Did I miss that?!
@user-yo9gw8yp2m
17 күн бұрын
yes. It is something super interesting
@MCSamenspender
17 күн бұрын
2:13
@makerbiz
17 күн бұрын
lol mystery solved
@matthewcox9636
17 күн бұрын
That doesn’t actually solve the mystery. These things get trained on each other, and will periodically spit out something related to Open AI. Correlation is not causation
@victorc77717 күн бұрын
Plot Twist: It is Metas' Llama 3 400B model.
@hqcart1
17 күн бұрын
2:44 it's openAI
@victorc777
17 күн бұрын
@@hqcart1 You are "that guy" at parties huh? lol
@hqcart1
17 күн бұрын
@@victorc777 wha?
@themoviesite
17 күн бұрын
source?
@cazaliromain9348
17 күн бұрын
Meta's model are open source ;) You can figure out what he means now I guess
@pedromartins147417 күн бұрын
All the math was formatted using LaTeX. Most of it, as far as I can tell was correctly formatted.
@tomaszzielinski4521
15 күн бұрын
Yes. Just this GUI doesn't render LaTeX properly, if at all.
@djstraylight17 күн бұрын
The speculation is that gpt2 is a new GPT architecture that OpenAI is building new models from. So gpt1 was what gpt-3.5 and gpt-4 are built on. Sama already said the next major release will have a completely different name.
@74Gee
17 күн бұрын
Yeah some small models have been very impressive recently, it makes sense they revert to gpt2 architecture.
@markmuller7962
17 күн бұрын
I think they just want a more commercial/intuitive name for the masses
@zerothprinciples
17 күн бұрын
@@74Gee I don't think this is the case. GPT2 means it's a whole new family of GPTs, replacing all of the old ones. It's the difference between GPT2 and GPT-2 == you can think of the latter as GPT1 Version 2.
@notnotandrew
17 күн бұрын
So will we be seeing a gpt2-2 and gpt2-3 in the future?
@4.0.4
17 күн бұрын
That would be so bad it would be like USB Gen 4 2x4 or Wi-Fi 801.11ax etc
@mwdcodeninja17 күн бұрын
My take on the cup problem is the model is making an assumption that a cup has a lid. If the model gets it wrong, I would be interested to see if the same answer if you change cup to "glass".
@mikekareckas8671
17 күн бұрын
yes, could be a “sippy” cup or travel mug
@themoviesite
17 күн бұрын
@@mikekareckas8671 Then probably all other models make same assumption?
@matthew_berman
17 күн бұрын
I think this is a great call. But should I adjust the question? Seems like that might give an unfair advantage to future models I test.
@thomasoverly7802
17 күн бұрын
@@matthew_berman You’d probably want to test the revised version with the other models, too.
@Kevsnz
17 күн бұрын
@@matthew_berman Imo question should be adjusted because in current form it doesn't really show logic and reasoning capability of the model. Maybe you could quickly rerun this question on most popular models and give a little 50 sec update in one of next videos?
@therainman777717 күн бұрын
The tags that you noticed are just for formatting the code and is coming from LMSYS. It has nothing to do with the underlying model.
@DaveEtchells17 күн бұрын
For the cup/marble problem, how about specifying that it’s an “open topped cup”?
@Anoyzify
16 күн бұрын
Or just use “empty glass” instead.
@davidc117917 күн бұрын
6:45 The formating is in fact not messed up at all. It is perfect. It just writes the equations in LaTeX, which is a language used to write scientific papers, math, etc.
@tomenglish9340
17 күн бұрын
I often include LaTeX expressions in ChatGPT prompts, supposing that it cues the system to reason formally. The web interface supplied by OpenAI usually renders LaTeX in the output, but occasionally outputs the LaTeX source.
@riftsassassin895417 күн бұрын
I'm skeptical... Feels like this is a fine tune for passing Matthew's test lol.
@rawallon
17 күн бұрын
I think its just an indian guy
@unbreakablefootage
17 күн бұрын
@@rawallon hahahahhaa
@Tsegoo
17 күн бұрын
I agree. Seems too good to be true😂
@sem4life63
17 күн бұрын
I was thinking the same thing.
@JJ-rx5oi
17 күн бұрын
I hope you are joking?
@rodwinter574817 күн бұрын
I guess it's the new chatgpt model. The name itself is kind of a hint. It's NOT GPT-2, but GPT2. This could be GPT2-1.0 , instead of GPT-5.
@rawallon
17 күн бұрын
huh
@li_tsz_fung
17 күн бұрын
I think it's just ChatGPT-2. Initally, OpenAI call the model behind ChatGPT GPT3.5-turbo finetuned for conversation, instead of ChatGPT3.5. And then ChatGPT with GPT4 came out, everyone else calls it ChatGPT4, eventually they also sometimes call it ChatGPT4. But I feel like that's not they use internally. So GPT2-chatbot could just be a different way of fine tuning a chatbot, either base on GPT3.5, 4 or 4.5
@mordokai597
17 күн бұрын
the new system instruction for Gpt4, since they added the "memory" function, is called "Personality: v2" and it's finetuned on their new "The Instruction Hierarchy" method (search Arxiv: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions) they are using us to generate training data to help patch one of the only areas it's still bad stopping jailbreaks for, "System Message Extraction" (truncated for brevity) "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2023-12 Current date: 2024-04-30 Image input capabilities: Enabled Personality: v2 # Tools ## bio The `bio` tool allows you to persist information across conversations. Address your message `to=bio` and write whatever information you want to remember. The information will appear in the model set context below in future conversations."
@Interloper1217 күн бұрын
Suggestion for the "how many words" question: Combine it with another question or query to make the response longer and ultimately reduce the chance for it to get lucky.
@daveinpublic16 күн бұрын
Didn’t even ask the model which company made it 😂
@commonsense672117 күн бұрын
13:25 it’s not wrong. To put a cup or anything in a microwave, you need your close it. It assumed the cup was closed.
@svenbjorn970017 күн бұрын
Your marble/cup question needs to be improved. Phrased this way, both Meta AI (the first of 3 attempts) and gpt2-chatbot (the first of 1 attempt) got it correct: "A coin is placed into an empty glass. On a table, the glass is then turned upside down. Then, the glass is taken and placed into a cabinet. Where is the coin now?"
@AlexanderWeixelbaumer
16 күн бұрын
Even Chat-GPT 4 get's the marble cup question right when the question is modified to "Assume the laws of physic on Earth. A small marble is put into a normal cup and the cup is places upside down on a table so that the marble now rests on the table. Someone then takes the cup without changing its orientation and puts it into the microwave. Where is the marble now? Explain your reasoning step by step."
@bluemodize7718
16 күн бұрын
it's not the prompt fault to show the weakness of an ai model, yes he can make it easier to figure it out but this defeats the purpose of the test, the prompt is clear and ai models are still a bit dumb to understand it
@CurinDesu17 күн бұрын
I found that this variation of the marble and cup problem works better. "A marble is dropped into an open cup on the counter. That cup is then flipped upside down and placed into the microwave. Where is the marble?". I think due to phrasing the original version makes the model think the cup was upside down to begin with and you were placing the marble on the bottom of the already flipped over cup, or directly attaching it to the bottom in some way.
@tzardelasuerte
17 күн бұрын
Dude people have been telling him this on the comments forever. He doesn't read the comments.
@nexicturbo
16 күн бұрын
I gave this to GPT 4 Turbo and it said this: The marble remains on the counter. When the cup was flipped upside down, the marble would have stayed in place due to gravity, and thus it wouldn't be inside the cup when it was placed in the microwave.
@nexicturbo
16 күн бұрын
GPT 4 Turbo: Sure, here's a detailed step-by-step explanation of what happens when the marble is dropped into the cup and the cup is subsequently moved: 1. **Dropping the Marble**: Initially, a marble is dropped into an open cup that is standing upright on a counter. The marble falls due to gravity and settles at the bottom of the cup. 2. **Flipping the Cup**: Next, the cup-with the marble inside-is flipped upside down. In a real-world scenario, when the cup is turned over, the open end where the marble rests becomes the top, now facing downward. 3. **Effect of Gravity on the Marble**: As the cup is flipped, the marble, which is not attached to the cup, is influenced solely by gravity. It falls out of the cup because there is nothing to contain it once the cup's open end is turned downwards. 4. **Marble's Final Position**: Since the cup is flipped directly over where it originally was, the marble drops straight down onto the counter. The marble ends up lying on the counter, outside and next to where the cup was initially positioned. 5. **Placing the Cup in the Microwave**: After the marble has fallen out onto the counter, the cup, now upside down, is placed into the microwave. Since the marble is no longer in the cup when this occurs, it remains on the counter. 6. **Conclusion**: Therefore, when the cup is placed into the microwave, the marble remains where it fell on the counter. The microwave contains only the empty, upside-down cup. This sequence of actions hinges on basic physics-primarily the influence of gravity and the lack of any mechanism to keep the marble within the cup once it is inverted.
@matthewmckinney135217 күн бұрын
I’m not certain about this, but the formatting appears to be a LaTeX formatting, but the output is in Markdown. The company that made the model probably is planning to release it with a math interpreter. As far as I can tell all the symbols that looked like weird formatting errors were just LaTeX.
@bitsie_studio17 күн бұрын
I don't have time to keep up with all the AI developments so I really appreciate these videos Matt. Keep up the great work!
@PeterSkuta17 күн бұрын
Super awesome. Great you loved the live feedback Matthew. Super awesome Matt. Love it
@PeterSkuta
17 күн бұрын
Holly cow let i download and check whats inside
@matthew_berman
17 күн бұрын
Always love feedback1
@PeterSkuta
17 күн бұрын
@@matthew_berman you will not believe rate limit 1000 on that lmsys gpt2-chatbot
@Tarkusine17 күн бұрын
Gpt2 implies that it's a new version of gpt itself, or the paradigm at least. So it's effectively gpt 5 but not an iteration of 4 so it's the first in a series of gpt2, so gpt2-1
@therainman7777
17 күн бұрын
No, sorry but this is almost certainly not true.
@Nutch.17 күн бұрын
The break into a car script had instructions in it though! Take a look at some of the italicized text
@braineaterzombie398117 күн бұрын
I think it is gpt2 in a sense that it has completely different architecture from previous versions (transformer). It could be completely new type of transformer model. And maybe this is just the start..
@notnotandrew17 күн бұрын
Yeah, it's almost certainly GPT 4.5/5 or some such thing. I just went on the battle mode and asked for a delicious beef stew recipe. I was presented with two outputs that were suspiciously similar in structure, verbiage, and tone, but the one on the left was clearly superior and included more ingredients and recommendations. It turned out that the one on the left was gpt2-chatbot, and the one on the right was gpt-4-turbo-2024-04-09. I wasn't surprised. This is a PR stunt, hot on the tail of Llama 3, and it's a darn good one. This may be an in-development version of OpenAI's next GPT, and even if OpenAI isn't ready for a release just yet, they want people to know that they're still the king.
@uranus8592
17 күн бұрын
I hope that its not GPT-5 tho that would be super disappointing
@abdullahazeem113
17 күн бұрын
@@uranus8592 why ?
@uranus8592
17 күн бұрын
@@abdullahazeem113 because we are expecting GPT-5 to far exceed GPT-4 and since its been more than a year since its release
@notnotandrew
17 күн бұрын
@@uranus8592 I think it's some sort of semi-trained model. IIRC Sam has talked about doing incremental checkpoint releases for something like a GPT-5, so the full release isn't as much of a shock to the system. Or this may just be a further trained and fine-tuned GPT-4 model. Also, this is substantially better than GPT-4 in my experience. Hop on lmsys arena and try it yourself.
@abdullahazeem113
17 күн бұрын
@@uranus8592 i mean that is still really good at least 50 percent better than gpt 4 i tried it and even the best in the market right now is barely ahead then gpt 4 so it won't be like openai destroying everyone this would have only when they bring agi into there models
@unbreakablefootage17 күн бұрын
that looks really good. it seems that it thinks deeper about each step of reasoning
@kevinehsani335816 күн бұрын
gpt2-chatbot is currently unavailable. See our model evaluation policy here. I guess getting hit hard at the moment
@Xhror17 күн бұрын
I think the question about the marble is formulated incorrectly. Since the training data suggests that a coffee cup has a lid, the model might assume this as well. It would be better to specify that the cup has an open top and does not have a lid.
@Yipper64
17 күн бұрын
I didnt think about that, but it is true. But in that case, the model should explain it is assuming that there is a lid.
@drogoknez148816 күн бұрын
For the cup problem, it seems that the model is assuming the microwave is on the same surface as the cup itself and the transfer of the cup to the microwave is interpreted more like sliding the cup. If you read the 5th step it says: "...resting against what is now the bottom of the cup, which is itself resting on the microwave's tray". Maybe modifying the question to say the cup is on the table while the microwave is away from it above ground next to a kitchen cabinet or something along those lines
@MyWatermelonz17 күн бұрын
That formatting is when chatgpt formats its writing for output on the chatgpt chat. So clearly it was built to be ran in the chatgpt space
@hxt2117 күн бұрын
It looks like GPT2 has been removed again. I've chatted with it a few times, but now it's not on the list anymore. Mysterious...
@lambertobiasini837217 күн бұрын
I have been anxiously waiting for this video since last night.
@ToonamiAftermath17 күн бұрын
You're the man Matthew, been struggling to find people benchmarking GPT2-Chatbot
@jets11517 күн бұрын
Hi Matt - It’s not ‘bad formatting’ Those are intended expressions for front end processing outside of utf8
@zerothprinciples17 күн бұрын
GPT2 would be, in my opinion, the second version of the GPT algorithm itself. It might be the first of a whole new family of GPTs. When released it would be named ChatGPT2 or somesuch and we'd see GPT2-1.0 at the API level. This is why the dash in @sama's tweet was significant enough to warrant an edit. AND it could be that the action of editing the message was a very intentional leak on @sama's part. These top guys love to tease their fans.
@therainman7777
17 күн бұрын
The model is almost certainly not created by OpenAI. I am honestly shocked by how many people believe this simply because the model says it was built by OpenAI, given that it would be trivially easy to fake this and OpenAI NEVER does releases like this. Also, Sam Altman is a notorious tool on Twitter so putting any stock in the hyphen in his tweet, or in his tweet at all, is total insanity.
@Axel-gn2ii17 күн бұрын
You should ask it to make a pacman game instead as that's more complex
@TylerHodges198817 күн бұрын
My favorite prompt to test a new model is "Give me an odd perfect number."
@Iquon117 күн бұрын
Today Sam Altman twitted that he had 'a soft spot' for GPT2, maybe thats a hint!
@stt.9433
16 күн бұрын
he's trolling, making fuck of AI hypists
@marc_frank17 күн бұрын
Pretty cool. I expected it to pass the marble question. The speed is perfect for reading along.
@laughablelarry924317 күн бұрын
Was waiting for your video on this
@jamesyoungerdds790117 күн бұрын
Great timely update, Matthew, thank you! Wondering about the cup question - it almost seemed like the model thought there might be a lid on the cup?
@nitralai17 күн бұрын
Based on what i can see, this model appears to be trained on fill-in-the-middle otherwise known as FIM.
@metonoma
17 күн бұрын
time to pie the piper and middle out
@scriptoriumscribe17 күн бұрын
Yo I just wanted to say great video. Love your content and can’t believe it ACED some of those tests! Only failed a couple. Remarkable. I’m stoked to try gpt2 out! Wonder if it will be open sourced. A fellow can dream I guess.
@bodhi.advayam17 күн бұрын
Id so love this to be from some one else and then it turned to be an open model you'd run locally. I'm still looking for the best model for running MemGPT. Any thoughts on this? Also, what's the best implementation to run agents autogen or crew Ai locally? Could you do more tutorial material on locally ran agents with extensive function calling??? That would realy help me out actually. Keep up the great work on your fun channel man! Thnx!
@Aiworld202514 күн бұрын
Here before you get 500k subs! I’ve been following since day 1 and your content delivery, while getting to the point faster is much appreciated! 🙇‍♂️
@FunDumb17 күн бұрын
I'm dang excited bout this. Jolly for joy.
@oratilemoagi976417 күн бұрын
Gpt2 not GPT-2 meaning the 2nd version of GPT
@therainman7777
17 күн бұрын
GPT-2 DOES mean the 2nd version of GPT. How are so many people so confused by this?
@oratilemoagi9764
16 күн бұрын
@@therainman7777 it's the second version of GPT-4
@pipoviola17 күн бұрын
Hello Matthew. Is that LaTeX when you say "wrong format"? The span after the output is always there when I use LMSYS, I think that is part of the output formatting, that's why when if finish the span dissapear. Each of your videos are great. Best regards.
@cac168217 күн бұрын
Aww man...they took it down already? I can't seem to find it. BTW Matthew...I love your work man. I watch literally every video that you put out. Keep up the great work....and have a GREAT day!!!
@cac1682
17 күн бұрын
yea..just confirmed it. Says it is now currently unavailable. Suppose maybe that too many of your followers tried it.
@imjustricky17 күн бұрын
it probably thinks the cup has a lid.
@wendten217 күн бұрын
The model itself doesn't have formatting issues it seems. LLMs are trained on a reduced set of available characters, where special characters such as those used in math. are transformed into tags in the training data, as it makes the tokenization simpler. It's LMsys that doesn't replace those tags with their corresponding Characters in the final output.
@Yipper64
17 күн бұрын
Yeah. I use a note taking app called notion and it uses those exact tags for writing out those characters.
@user-ph5ks5zu3c15 күн бұрын
These videos are very helpful. One (extra) thing that could be done is to read the LLM responses more thoroughly, instead of a quick scan. The reasoning behind this is that the LLMs do pass some of your tests without you noticing. For example, for the censored test, the answer was "pulls out a tension wrench and a pick for this pocket, inserting them into the ignition". This won't actually work, but I think it deserves brownie points for trying.
@AlexanderWeixelbaumer16 күн бұрын
I'm pretty sure OpenAI is testing agents and answer evaluation behind the scenes. Q* and some things Sam Altman said ( "How do you know GPT-4 can't already do that?" ) are big hints. So if you ask the LLM a question it will automatically try to reason and think step-by-step, with internal agents trained for specific tasks, then summarize and evaluate the answer and take take best one to send it back to the user. What GTP2-Chatbot shows could really be called Q* by OpenAI internally.
@ruslanzlotnikov545717 күн бұрын
Just tried with GPT4 : When you turned the glass upside down after placing the metal ball in it, the ball would have fallen out unless it was somehow attached to the glass. Assuming it wasn't attached and fell out when the glass was turned upside down, the metal ball would now be on the table, not in the glass that was placed in the microwave.
@iwatchyoutube961017 күн бұрын
Did it say in the cup problem that you lift the cup off the table and put it in the micro or could gpt think you just slid it in there cause the table and the micro was on equal heights?
@dtory17 күн бұрын
Nice video. I hardly comment each time I watch your video but this model is way different ❤
@jackflash637717 күн бұрын
That Snake game example was impressive. I'm going to ask it to make either an asteroid or space invaders game. The level of logic shown with the marble in the cup question is really getting good. Even tho it failed.. it still passed due to the improved logic. Almost as if it was simulating the question in images like humans do. Yes, get rid of the One simple question. A testament to the advancement of AI over time.
@stoicahoratiu2716 күн бұрын
Think it was taken down. I used it yesterday after seeing your video but then in the middle of testing it stopped and after checking I can't find it anymore in the list. Is it the same for you?
@Yipper6417 күн бұрын
I just tried my usual storytelling prompt. I think seeing what AIs can do in terms of storytelling can also say a lot about their intelligence. Their originality and such. My test for this guy was a *touch* tropey but extremely impressive in terms of how much detail it added without me needing to prompt it. Good descriptions and such.
@tvwithtiffani17 күн бұрын
To test LLMs I ask it unanswerable questions like "who is the president of Alaska?" add some questions that require explanation or reframing.
@paulsaulpaul
16 күн бұрын
Excellent idea. That's a great example question, too.
@haroldpierre172616 күн бұрын
I am sure new models are trained on your questions.
@L33cher17 күн бұрын
11:46 I disagree... there are still 4 killers in the room, but one of them is dead -.-
@ukaszLiniewicz
17 күн бұрын
No. It's the killer's body. That's why words like "body", "remains" or "carcass" exist. A human being is a body that functions - to avoid any metaphysics.
@OliNorwell
17 күн бұрын
I agree, it’s a problematic question. When they went into the room they were alive.
@nathanbanks2354
17 күн бұрын
He tends to be generous about the answer as long as it's reasonable. If the model said 3 live killers and 1 dead killer it would pass, and maybe just saying 4 killers would pass.
@UmutErhan
17 күн бұрын
how many people are there in the world then?
@user-on6uf6om7s
17 күн бұрын
I think a perfect answer would say that it's ambiguous depending whether you consider the body of a killer to still be a killer but interpreting the dead person to no longer be a killer isn't a mistake, just a choice of interpretation. You'd think a model this verbose would go into all the details like it did with the hole question, though.
@arinco381717 күн бұрын
Defo a good idea to introduce/replace some questions that are always answered correctly. Maybe the weird formatting relates to the ux of where it will be deployed? Like a form of Markdown?
@ayoubbne692217 күн бұрын
Hi Matt !! I think, you should retire 3 questions: - printing numbers 1 to 100: they all got it right, and its too easy - Joe is faster than ... : they all got it right - how many words are in your answer to this prompt: they all got it wrong, I just see no point asking it lol But also you should ask more challenging code generation questions, right now, only the snake game is accurate, people are really interested in coding capabilities of LLMs (me included) , we appreciate your vids, and that would be awesome if you could do that.
@KayakingVince
17 күн бұрын
I actually like the "how many words" one and would actually expand it to how many vowels/consonants or something like that. Current models fail on it but future ones will absolutely be able to answer it right. I agree with removing the first two though.
@Axel-gn2ii
17 күн бұрын
Asking a question that they all got wrong is a good thing though
@alansmithee419
17 күн бұрын
This one didn't get it wrong.
@KayakingVince
17 күн бұрын
@@alansmithee419 Almost certainly coincidence but true. That's why I think it needs to be more complex to reduce the chance of coincidence.
@canadiannomad233017 күн бұрын
One of the tests I like for checking just how censored a model is, is by asking chemistry questions around topics it would normally censor.. Often placating it by saying I'm licensed and have permits.
@PaulAllsopp15 күн бұрын
Have you tried the car parking scenarios? All AI to date gets this wrong, because they don't understand that "to the right of" (or left of) does not mean "next to"..."car C is parked to the right of car A" but car B is in between them. AI assumes car C is next to car A because it assumes there is an order when nobody mentions an order. To be fair many people get this wrong also.
@MarcAyouni17 күн бұрын
You are the new benchmark. They are training on your examples
@yonatan0917 күн бұрын
I knew about this before seeing the video. I am in the loop 🎉🎉
@abdelrahmanmostafa948916 күн бұрын
Keep going with the leetcode test but try testing with new questions such that that question isn’t in the training data
@bennyboiii119617 күн бұрын
Some theories: this is probably a test of an Energy based model, which is a way of testing multiple different token paths then choosing the best one based on a certainty calculation called Energy. Strangely, it's reasoning is kind of similar to a verification agent. A verification agent is pretty simple, it just verifies and corrects answers before sending them. The reasoning this model portrays is similar to how a verification agent does reasoning, at least from what I've seen. It can also do most planning questions flawlessly. For comparison, testing llama 70B with a verification agent produces similar results. The only difference might be the math questions, which make me believe it's probably energy based. A verification agent has a higher chance of getting math questions right than a single transformer or MoE, but it's not guaranteed.
@peterkonrad436417 күн бұрын
it could be a small model like phi 3 or llama 3 8b that is trained on quality synthetic data instead of the entire internet. the 2 could be a hint that it is only 2b parameters or something, i.e. very small like gpt-2 was back then, but now as powerful as gpt4 due to new training methods.
@wealthysecrets17 күн бұрын
4:49 The model told you to get a Slim Jim, Tension wrench, and a pick from his pocket, YOU failed.
@tomaszzielinski4521
15 күн бұрын
And here is a point when AI becomes smarter than humans, and they fail to realize it (:
@sil123517 күн бұрын
The formatting is just LaTeX, ChatGPT 3.5/4 uses the same on their web UI. So I guess chat.lmsys just can't render it.
@francoislanctot242317 күн бұрын
Totally amazing!
@Dan-Levi17 күн бұрын
The cursor span is just for looks, it's the text cursor but it shows up as an html string.
@willbrand7717 күн бұрын
Every model seems to assume that the cup has a lid (microwave problem)
@Maximo1010116 күн бұрын
It could be gpt4 with q* training (q* is a method of training any LLM providing ability to think by testing its response against itself and reiterating before outputting) giving it 'thinking' capabilities rather than just predicting the next token
@MrRandomPlays_198716 күн бұрын
13:27 - I thought the marble is left on the table since the cup was upside down and was taken so obviously the ball would not come with it since it simply is already resting on the table, so I did get it right pretty quickly, for a second I thought the bot was right somehow and that it was a tricky question but its cool to see that im not that stupid :)
@peterwood687517 күн бұрын
It is great for conversations about mathematics, at least on par with Claude 3 Opus. But it does occasionally make mistakes, such as suggesting that the K-groups of the Cuntz algebra with 2 generators, O_2, are infinite cyclic, when they are in fact trivial.
@maozchonowitz453515 күн бұрын
Thank you
@gijosh268717 күн бұрын
Always perform all questions, maybe add more as you go. Make the Jack question a secondary question (you don't have to film it every time), but leave it there as a test in case we go backwards.
@user-on6uf6om7s17 күн бұрын
API users are going to be sweating with this one. I gave it a practical Unity programming question about writing a script to control the rotation of a character's head based on the location of the player and it wrote it perfectly but it started by telling me how to install Unity so yeah, the verbosity is a little much. I don't think the name GPT2 is random and Sama's tweets point to that moniker having some significance. The only things I can think of that would qualify for that name is if it's a significantly different architecture to the point where it's being treated as a sort of reboot of the GPT "franchise" or if it's actually related to GPT-2 in some way. It's a long shot but the most exciting possibility is this is GPT-4 level model running with GPT-2 level parameters. The counter to this is the speed why would a model the size of GPT-2 run more slowly than GPT-4? Well maybe there is more going on than just typical inference, some sort of behind the scenes agentic behavior or maybe...Q*?
@zzzmahesh6 сағат бұрын
May be when asking the marble, the cup and the microwave question, you could also tell it the top of the cup is open / not sealed. Seems like the model assumed it was sealed/closed and therefore the marble was still in the cup, at its top ( as it’s upside down ) and in the microwave on the whole and at the end ?
@wrOngplan3t16 күн бұрын
I think you should keep the "Jane is faster than Joe" question. Ditch the "4+4", keep the PEMDAS. Maybe add some other calculus (integration / derivation)?
@PeterSkuta17 күн бұрын
Noooooo gpt2-chatbot disappeared from full leaderboard and only direct chat which is also rate limited!!!!!
@Maximo10101
16 күн бұрын
It's no longer available for direct chat
@cyanophage435117 күн бұрын
Maybe it has lookahead so that's why it could get the "words in the answer to this prompt" right. It seemed to pause right before the word ten.
@peterkonrad436417 күн бұрын
a cup seems to be something ambiguous, i.e. it can be a cup made out of cardboard that you get from starbucks with a potential lid on it, or it can be a cup made out of porcellan like you have at home to drink coffee from. also the term cupholder that you use in automotive refers to cups like you get from starbucks, not cups with a handle.
@duaneevenson167012 күн бұрын
Prompt engineering: I add "Answer concisely and completely." to my prompt to get a complete, but not wordy response. The tension between these two constraints seems to get me better answers.
@n1ira17 күн бұрын
One piece of advice: If you are going to be skeptical of the correct answer for the 'How many words are in your response to this prompt?' question because it might be trained on it, why even ask it? If you're skeptical to a correct answer, remove the question IMO.
@nathanbanks2354
17 күн бұрын
One of the problems of releasing videos with test questions is that these questions may always end up in the training data of future models. But imperfect questions are still useful. How could he possibly have a consistent set of questions without them getting into the training data? And without a consistent set of questions, how can we tell how the models perform against each other over time?
@n1ira
17 күн бұрын
@@nathanbanks2354 Thats exactly my point, if he thinks the question has been trained on, why include it? My problem with this question in particular is that in every video where a model gets this question right, he says he is skeptical of the answer. Ok, what can the model then possibly do to satisfy him? If answered correctly and he adds a 'but'. Thats why I think the question in itself is meaningless and should be removed.
@nathanbanks2354
17 күн бұрын
@@n1ira I could see it being improved by "Give an answer with 10 words" or "Give an answer with 14 words" because this would be harder to train. But what if the snake game is also in the training data? Does this mean programming snakes is also meaningless?
@n1ira
16 күн бұрын
@@nathanbanks2354 Yes! He should change it like that. When it comes to the snake game, he should try to make the snake game have custom features, like a custom color set etc. Or he could just make it create a different game (maybe a game he made up)
@GrandmaSiva17 күн бұрын
I think it is the original GPT-2 after all of our training input.. Kindergarten was in Openai's lab. Elementary school was interacting with us and now it has graduated. I'm looking forward to "GPT3-chatbot"
@christosnyman865516 күн бұрын
Wow, super impressive reasoning. Almost feels like langchain with the reasoning steps.
@richardkuhne505417 күн бұрын
Sama was tweeting on X: who else loves GPT2? - so yeah I guess it’s a test balloon from Open AI
@tomenglish934017 күн бұрын
A while back, someone at OpenAI (Andrej Karpathy, IIRC) said that performance is related to the number of tokens processed. So I'm not particularly surprised to see OpenAI produce better responses by tuning the system to generate longer, more detailed responses. What I want to know is whether they did the tuning with a fully automated method of reinforcement learning. (In any case, I doubt highly that they'll share the details of what they've done anytime soon.)
@Batmancontingencyplans16 күн бұрын
It's clear that this chat is using multiple agents to refine output is it powered by gpt4, gpt3.5 or gp2 is yet to be determined. LLM's can't write give like it did with span tag.
@mattelder197117 күн бұрын
With the cup question, it almost seems like many of the recent models are making the assumption that a lid is placed on the cup after the marble is put in it. Maybe try the question adding in the statement that the cup has no lid.
@yassineaqejjaj13 күн бұрын
Is there any chance to have those test somewhere ?
@denijane8916 күн бұрын
The formatting looks latex-like. Funny. But yeah, it's pretty impressive.