Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

Ғылым және технология

How to build a Jarvis like super interactive AI that can listen, watch and talk back? We rebuilt the Gemini demo with GPT4V + Whisper + TTS, here is how it really performed…
Build AI powered ad assets at scale with Hubspot campaign assistant for free: www.hubspot.com/campaign-assi...
🔗 Links
- Follow me on twitter: / jasonzhou1993
- Join my AI email list: crafters.ai/
- My discord: / discord
- Github - Gemini demo with GPT4V: www.crafters.ai/aitools/rebui...
⏱️ Timestamps
0:00 Quick demo
1:41 Project plan & challenges
3:11 Open source Gemini demo & overview
9:37 Project setup
11:22 Setup video recorder
14:37 Setup silence aware audio recorder
16:36 Create img grid
19:44 Whisper
24:31 Connect to GPT4V
27:36 Streaming result & TTS
29:19 Demo
👋🏻 About Me
My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
#gpt4v #gemini #autogen #gpt4 #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #chatgpt #largelanguagemodels #largelanguagemodel #bestaiagent #chatgpt #agentgpt #agent #babyagi

Пікірлер: 66

@show-me-the-data5 ай бұрын
Im genuinely honored that you thought of making the same thing as I just did on my channel! I like how you made it the same style as the original demo! One advantage of your project over mine is that since you're using the audio recording browser API instead of the speech recognition one, you open the app up yo choose different microphones (webkit speech recognition doesn't support that). Yours also supports more than webkit browsers. Great work as always Jason! 👋
@gabrieleguo3 ай бұрын
Thanks man, huge fan here. Always love the projects you do and explain all the process, keep it up man!
@kennycommentsofficial5 ай бұрын
Just found your channel yesterday and I've now watched all of your videos. You do great work! I'm now subscribed and ready to get my hype on next time you hit us with a let's get it.
@Dron0085 ай бұрын
What a great and big work! You are very productive.
@bastianstrauss5 ай бұрын
Hi Jason, thanks for the next sleepless nights to understand and reproduce the code. You did a great job!
@SamiSabirIdrissi5 ай бұрын
Amazing!! Good job!👏🏼
@jaanireel5 ай бұрын
00:05 Gemini demo showcased potential of GPT-4V + Whisper + TTS 02:05 GPT 4V + Whisper + TTS for video transformation and interaction challenges 07:48 Build a Gemini demo using GPT4V, Whisper, TTS 09:41 Setting up a Nextjs project with necessary packages and customizing functionality 13:41 Adding video component and functionality for recording and streaming video and audio. 15:41 Dynamically update volume indicator in the UI 19:25 Implementing whisper for transcript 21:22 Providing optimal infrastructure for API 25:10 Integrating OpenAI and Veral AI for real-time video analysis. 27:05 Integration of GPT4V, Whisper, and TTS for text to speech functionality 31:11 Recreate Gemini demo with GPT-4V, Whisper, and TTS
@Palmac-ep1hg5 ай бұрын
This is great! Would it be possible to provide a quick response before the vision response. Having a quicker/smaller model responding to the speech 2 text prompt. Something in the lines of thinking fast and slow? That would fill the time waiting for the right answer with something like "Hmm, let me think a bit. Which jacket to wear... That is indeed is a good question..." Just as we humans do when asked about a complex question.
@kambiz.h
3 ай бұрын
Hi there, that is what I was thinking about as well. Did you figure out a proper solution for your suggested point?
@tdb20125 ай бұрын
This was excellent!
@aaronfallis5 ай бұрын
great demonstration
@pointlesspos84405 ай бұрын
If you want the AI to know when it can process, wihtout having to 'press enter'. You would process each incoming token. You can then establish a 'conclusion' threshold. This would either be 'enter' or it would be when the AI has heard enough and wants to answer. This will lead to what I I call 'the interupting AI' It would be a person who constantly interupts the conversation, or sporadically decides the user has said enough, and the AI is ready to speak.
@RealLexable5 ай бұрын
Extraordinary,..time to build a new friend soon 😮
@phinneasmctipper75185 ай бұрын
That is super cool! Finally something that is tangible and not a marketing video. I was wondering if it is possible to decrease latency by starting to stream the TTS response, rather than waiting for the audio buffer to be fully created. It seems possible in the Python package, but the Node.js package does not allow for such a solution (yet)
@lightweightabsorbent2008
5 ай бұрын
We were already aware that this was 'tangible'; many indie devs have been doing it since GPT-4V (Just why would we show it when it's so simple, a child could figure out how to do it using 3 endpoints) came out. As for streaming the audio, you can do so directly from the OpenAI endpoint. Now, the biggest issue is how lackluster the OpenAI API is at responding. Unless OpenAI upgrades, there is no way to 'make it faster' than what is shown in the video. I've had responses take upwards of a minute just for reading a simple wiki page
@shivamkumar-qp1jm
5 ай бұрын
We can stream the audio as well I have created the same using opena tts
@erics.a.288
Ай бұрын
Why is it not available with node.js??
@elie22225 ай бұрын
Good stuff 🎉
@RolandoLopezNieto5 ай бұрын
Awesome guys
@micbab-vg2mu5 ай бұрын
Great video - thank you - you have build Gemini Ultimate:)
@sandeepphophaliya24565 ай бұрын
Excellent work..
@shubhamdayma5209
5 ай бұрын
I was in shock to see you here. I thought you are Sales Number enthusiast. Bdw I am trying to build similar(along with style heads) with open source llms. - From Ex Emp. Sigma Jod.
@fabriziocasula5 ай бұрын
it is incredible
@kenchang34563 ай бұрын
Amazing
@stephennfernandes5 ай бұрын
how do you use video and enable GPT4V to use the videos temporal information. i belive GPT4V only supports images and not video
@paulevans30604 ай бұрын
Question: is there a way to teach GPt4-GPTs a sequence and have it store that information in its Knowledge? say i teach to move red ball to box. GPT4 will store this information, then i test this yellow ball(incorrect colour) and get GPT4 to tell me my movement was incorrect and even stop me be fore i complete the movement of putting the yellow ball in the box?
@alpineai4 ай бұрын
2:11 = 🔥🔥🔥
@fabriziocasula4 ай бұрын
Hello Jason, I want try to start this App on my iphone but the camera does^t work. can you help me? with other app the camera works fine
@dailywisdomquotes5185 ай бұрын
I had the UI up and running, but it cant able to process, not sure why
@minutecodingtips5 ай бұрын
Is there a github repo for all the code you wrote sir?
@GianniDAlerta5 ай бұрын
Would be cool to see if vision can understand a persons sentiment and or speaking. Are they happy or sad or could it recommend the speaker user to improve how they present or deliver a speech
@gauravgarg-wc4zl5 ай бұрын
fully functional , i have to make one fix in src/app/api/chat/route.ts file was messages: [{ role: "system", content: systemMessage(lang) }].concat( messages ), .. changed it to messages: [{ role: "system", content: systemMessage(lang) }, ...messages], could be related to openai version that I have ; Another Great video 😊
@ibrahimhalouane81305 ай бұрын
Can you re-make it using only open LLMs? (e.g. BakLLaVA+StyleTTS2+dolphin-2.5-mixtral-8x7b+SpeechBrain)
@AestethicSounds
5 ай бұрын
Jason, please can you? API is super costy
@petswolrd280
5 ай бұрын
I'm trying@@AestethicSounds
@pixelgo1704
4 ай бұрын
منين يا هيمه أنا نفسي زيك بالظبط يتعمل offline
@dogemaster63615 ай бұрын
This is so good Jason, that's for showing us. Can you comment on what the costs turned out to be?
@hidroman1993
5 ай бұрын
That would be so useful
@andreabuzzolan9807
5 ай бұрын
I guess it's around 500÷1000$/hr
@hidroman1993
5 ай бұрын
@@andreabuzzolan9807 lulz ok
@kingtaro68445 ай бұрын
At 28:08, I'd like to suggest an alternative. Blob didn't work for me, no matter what I tried. So if anyone has the same issue, the buffer should do the trick. As this is also how it's documented by OpenAI: const buffer = Buffer.from(arrayBuffer); return new Response(buffer, { headers: { "Content-Type": "audio/mpeg", }, }); Nonetheless, as always, amazing job, Jason.
@techfren5 ай бұрын
Jason beats multi billion dollar company, Google
@saiprakash64835 ай бұрын
Can you tell us how much it may cost for an hour? Please.
@kliusd5 ай бұрын
Awesome! I wonder if it’s possible to build something that translates and speaks other languages at the native level using the person’s own voice as he or she speaks in real time 🤔
@user-hm9iy5nr3i5 ай бұрын
Hi jason
@Tika-fz6li5 ай бұрын
freakin video
@ayush25655 ай бұрын
whare are some ways to reduce the latency ??
@rajatag27
5 ай бұрын
Looking for similar solution
@ahsin.shabbir5 ай бұрын
Building the system end to end is nice. However, the examples shown seem cherry picked successful ones. Interested in seeing examples where the gpt model fails when the prompt is quite simple in terms of human understanding. Going to try this to validate how well it actually works.
@lokeshart33402 ай бұрын
can we not use gemini vision ?
@NguyenTien-yu9gu5 ай бұрын
any code available?
@codeplaywatch5 ай бұрын
probably not but I think I met someone looking very similar like you and talked about onigiri (?)
@elpablitorodriguezharrera4 ай бұрын
Damn you and julian are making google a joke🤣🤣🤣 really cool! Bytheway jason, if I may asking, how much it costs if I wanna hire you to develop an app as a mvp? Can you at least give me a price range? And are you accepting equity as compensation? I'd greatly appreciate a response.
@jaysonp94265 ай бұрын
Fyi, your website is down. Maybe an expired ssl?
@fabriziocasula5 ай бұрын
ok :-) npm install not it works thanks
@user-tk7sc4gz2v5 ай бұрын
this just shows how lazy google is
@pixelgo17044 ай бұрын
Please make this offline by open source model and support Arabic language for speak and Listening
@brytonkalyi2775 ай бұрын
✓ I believe we are meant to be like Jesus in our hearts and not in our flesh. But be careful of AI, for it knows only things of the flesh which are our fleshly desires and cannot comprehend things of the spirit such as true love and eternal joy that comes from obeying God's Word. Man is a spirit and has a soul but lives in a body which is flesh. When you go to bed it is the flesh that sleeps, but your spirit never sleeps and that is why you have dreams, unless you have died in peace physically. More so, true love that endures and last is a thing of the heart. When I say 'heart', I mean 'spirit'. But fake love, pretentious love, love with expectations, love for classic reasons, love for material reasons and love for selfish reasons those are things of the flesh. In the beginning God said let us make man in our own image, according to our likeness. Take note, God is Spirit and God is Love. As Love He is the source of it. We also know that God is Omnipotent, for He creates out of nothing and He has no beginning and has no end. That means, our love is but a shadow of God's Love. True love looks around to see who is in need of your help, your smile, your possessions, your money, your strength, your quality time. Love forgives and forgets. Love wants for others what it wants for itself. However, true love works in conjunction with other spiritual forces such as patience and faith - in the finished work of our Lord and Savior, Jesus Christ, rather than in what man has done such as science, technology and organizations which won't last forever. To avoid sin and error which leads to the death of your body and your spirit-soul in hell fire (second death), you must make God's Word the standard for your life, not AI. If not, God will let you face AI on your own (with your own strength) and it will cast the truth down to the ground, it will be the cause of so much destruction like never seen before, it will deceive many and take many captive in order to enslave them into worshipping it and abiding in lawlessness. We can only destroy ourselves but with God all things are possible. God knows us better because He is our Creater and He knows our beginning and our end. The prove text can be found in the book of John 5:31-44, 2 Thessalonians 2:1-12, Daniel 2, Daniel 7-9, Revelation 13-15, Matthew 24-25 and Luke 21. *HOW TO MAKE GOD'S WORD THE STANDARD FOR YOUR LIFE?* You must read your Bible slowly, attentively and repeatedly, having this in mind that Christianity is not a religion but a Love relationship. It is measured by the love you have for God and the love you have for your neighbor. Matthew 5:13 says, "You are the salt of the earth; but if the salt loses its flavor, how shall it be seasoned? It is then good for nothing but to be thrown out and trampled underfoot by men." Our spirits can only be purified while in the body (while on earth) but after death anything unpurified (unclean) cannot enter Heaven Gates. No one in his right mind can risk or even bare to put anything rotten into his body nor put the rotten thing closer to the those which are not rotten. Sin makes the heart unclean but you can ask God to forgive you, to save your soul, to cleanse you of your sin, to purify your heart by the blood of His Son, our Lord and Savior, Jesus Christ which He shed here on earth - "But He was wounded for our transgressions, He was bruised for our iniquities; the chastisement for our peace was upon Him, and by His stripes we are healed", Isaiah 53:5. Meditation in the Word of God is a visit to God because God is in His Word. We know God through His Word because the Word He speaks represent His heart's desires. Meditation is a thing of the heart, not a thing of the mind. Thinking is lower level while meditation is upper level. You think of your problems, your troubles but inorder to meditate, you must let go of your own will, your own desires, your own ways and let the Word you read prevail over thinking process by thinking of it more and more, until the Word gets into your blood and gains supremacy over you. That is when meditation comes - naturally without forcing yourself, turning the Word over and over in your heart. You can be having a conversation with someone while meditating in your heart - saying 'Thank you, Jesus...' over and over in your heart. But it is hard to meditate when you haven't let go of offence and past hurts. Your pain of the past, leave it for God, don't worry yourself, Jesus is alive, you can face tomorrow, He understands what you are passing through today. Begin to meditate on this prayer day and night (in all that you do), "Lord take more of me and give me more of you. Give me more of your holiness, faithfulness, obedience, self-control, purity, humility, love, goodness, kindness, joy, patience, forgiveness, wisdom, understanding, calmness, perseverance... Make me a channel of shinning light where there is darkness, a channel of pardon where there is injury, a channel of love where there is hatred, a channel of humility where there is pride..." The Word of God becomes a part of us by meditation, not by saying words but spirit prayer (prayer from the heart). When the Word becomes a part of you, it will by its very nature influence your conduct and behavior. Your bad habits, you will no longer have the urge to do them. You will think differently, dream differently, act differently and talk differently - if something does not qualify for meditation, it does not qualify for conversation. Glory and honour be to God our Father, our Lord and Savior Jesus Christ and our Helper the Holy Spirit. Let us watch and pray... Thank you for your time.
@ChrisK-cp1ro
5 ай бұрын
Excuse me, sir - this is a Wendy's
@PySnek
5 ай бұрын
It's too late jesus man. This is as big as the internet was in the early 2000s. The next big thing for society. All your prayers and resistance will not help. Adapt or get stuck in the past like a caveman in the eyes of coming generations. Imagine someone today without a phone... Nonetheless, enjoy Christmas! Maybe it's going to be the last one without AI assistants everywhere.
@othername24285 ай бұрын
Man I can't understand half of what you're saying without captions. Please practice improving your English pronunciation!
@denzelcanvasYT
5 ай бұрын
i can understand him just fine idk what that says about you though.
@fabriziocasula5 ай бұрын
help 🙂 (base) fabriziocasula@air-di-fabrizio gpt-video % npm run dev > gpt-video@0.1.0 dev > next dev sh: next: command not found (base) fabriziocasula@air-di-fabrizio gpt-video %
@petswolrd280
5 ай бұрын
i did got same error
@Kno2gether
5 ай бұрын
@@petswolrd280 try `npm install --force`