AI agent + Vision = Incredible

Ғылым және технология

A step by step tutorial of how to build vision powered AI agent via autogen + llava + stable diffusion AND Break down of 160-page analysis of GPT4V capabilities
🤘 Get 15% off on sceneXplain via my code AIJASON : go.jina.ai/scenexplainjason
🔗 Links
- Follow me on twitter: / jasonzhou1993
- Join my AI email list: www.ai-jason.com/
- My discord: / discord
- sceneXplain: go.jina.ai/scenexplainjason
- Vision-agent Github: github.com/JayZeeDesign/visio...
⏱️ Timestamps
0:00 Intro
1:15 What is multi-modal model
2:12 GPT4V ability break down
4:34 sceneXplain
6:00 Visual prompt techniques
10:53 Use cases
13:00 Build vision agent #1 - Setup
14:20 Build vision agent #2 - Use Llava model
15:58 Build vision agent #3 - Use Stable diffusion
16:52 Build vision agent #4 - Set agent system via autogen
18:53 Build vision agent #5 - Demo
👋🏻 About Me
My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
#gpt4 #autogen #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #chatgpt #largelanguagemodels #largelanguagemodel #bestaiagent #chatgpt #agentgpt #agent #babyagi #llava #stablediffusion

Пікірлер: 100

@AIJasonZ7 ай бұрын
Which vision-enabled agent do you want to see me building? Leave comment and let me know! 🤖
@jasonfinance
7 ай бұрын
Would love to see AI agent that can control browser!
@Nodeagent
7 ай бұрын
Yes browser control would be hot. Also image manipulation for things like precise mockups for customers - useful for Ecom stores who sell personalized goods
@ld-yt.
7 ай бұрын
Successfully getting this done with a local llm would be interesting to see.
@PeterAustin666
7 ай бұрын
mine.
@dawidzurawski8870
7 ай бұрын
I would to see an agent that can read old handwritten documents and turn them into pdf
@T33KS7 ай бұрын
Your content has the right amount of abstraction, making your videos short sweet nd appealing to a wide audience (it's not a course). But at the same time it has the right amount of technical detail for devs and engineers to replicate what you are demonstrating. Thank you for this great content
@asithakoralage6287 ай бұрын
Hi Jason, yet another great video, I learned a lot from your channel. Thanks for sharing your knowledge.
@MattLuceen7 ай бұрын
This is exactly what I needed. Thank you.
@frankchangshow7 ай бұрын
I really appreciate you and the videos your creating ai Jason. They are helping me a lot in learning this space
@craigcasee71837 ай бұрын
I've been needing to see a video like this where someone strings together some ai with code, glad to see. I want to add eye tracking to ar and ai vision. It would be nice to quickly ask questions in the real world. And the automation aspect is very nice for you to share, plus continue to make informative instructional demonstrative amazing videos like this! Thank you!
@markksantos7 ай бұрын
You're the best. PLEASE POST MORE OFTEN!
@leu23047 ай бұрын
This channel is real gold! Thank you so much
@cliffordramsey25007 ай бұрын
Thank you for this clever integration of tools!
@ryzikx7 ай бұрын
Great stuff I was looking for vision autogen tutorials
@Hisma017 ай бұрын
Great content. You have a new sub. Keep up the great work!
@moberpriller7 ай бұрын
Thanks for the great content!
@SamuelHollis7 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 🌐 Introduction to AI Vision Integration - The video begins with an introduction to the integration of AI agents and vision capabilities. - AI agents with vision power can revolutionize various applications, from web design to answering complex questions and enabling general-purpose robots. 02:06 📸 Multimodal Models and Their Potential - Multimodal models can process not only text but also images, audio, and videos, enabling them to understand different types of data and their relationships. - GPT-4 Vision (GPT-4V) can handle various image types, including photographs, text within images, diagrams, tables, and floor plans, unlocking numerous use cases. 04:36 🧠 Understanding GPT-4V's Abilities - GPT-4V demonstrates impressive out-of-the-box performance, such as identifying objects, recognizing people, counting objects, and even understanding perspective. - However, it also has limitations and can make mistakes, particularly in tasks like text extraction and chart interpretation. 06:49 🚀 Promoting GPT-4V's Performance - Different prompting techniques can be used to improve GPT-4V's performance in image-related tasks. - Techniques include providing detailed text instructions, setting performance expectations, using few-shot prompts, and visual referring prompts. 09:19 🌟 Expanding Use Cases with GPT-4V - GPT-4V's ability to understand the relationship between multiple images opens up new possibilities, such as calculating costs from images or determining the sequence of images in a task. - It can also facilitate interactions through visual annotations, allowing users to point or circle objects for AI understanding. 11:44 🤖 Building Autonomous AI Agents - GPT-4V's capabilities make it possible to create autonomous AI agents that can continuously improve image generation and perform tasks like desktop automation. - These agents have potential applications in various industries, from architecture and engineering to customer support and medical diagnosis. Made with HARPA AI
@skanderbegvictor64877 ай бұрын
Wow this content is great. Subscribed
@vakman94977 ай бұрын
Hey bro good job on that thumbnail! I didnt even realize it was one of your videos I honestly thought I was clicking on a VICE video lmao,
@kodeengatai13477 ай бұрын
Thanks mate great stuff would really be interested in agents that can generate video based on prompts even if the agents need to be first trained on sample videos.
@ultimategolfarchives47467 ай бұрын
Always providing incredible content. 👍 👍👍
@AI-Wire7 ай бұрын
Great job, Jason. In the future could you please consider showing how to use these tools without paying for any API keys. For example, using PaLM API or some of the open source models. This is because building projects at scale is cost prohibitive using recursive tools like Autogen.
@joxxen7 ай бұрын
Really nice video, i for myself would love if the agents could start running stable diffusion on local machine. Any chance you want to create a video about that?
@KCM25NJL7 ай бұрын
I cannot even begin to imagine the API costs for running things like these on frontier models right now. As impressive as it is, you'll need a real profitable use case if you wanna use it like this.
@user-ug3pf3uw6x7 ай бұрын
You are the best!
@PrincepsPolycap7 ай бұрын
Notification enabled for that parse automation!
@GlenBland6 ай бұрын
I would love to see one video that summarizes the most popular libraries and api's for llms along with which are the best to work together and which have replaced older ones. Include: AutoGPT, MemGPT, ChromaDB, LangChain, Ollama, Pinecone, etc.
@leandrogoethals65996 ай бұрын
how to use a uncensored stable diffusion variant with this. Great vid by the way can't wait for what u do next! Also could it be that the discord invite link is broken? Can't wait to join!
@Ychuah_19977 ай бұрын
Chatgpt: I can't count apples... Prompt: You are an expert in counting! Chatgpt: Giving the correct answer :) These prompts are just fascinating - and great content as usual!
@GabrielVeda7 ай бұрын
Brilliant
@krisograbek7 ай бұрын
Would that be possible to build a similar agent but improve on illustrated, short stories for kids? That way it would improve both the images as well as the text provided in the stories... BTW, I've been learning so much from you, Jason! Your channel is a gem! As a fellow KZreadr, you make me feel small...
@stereotyp99917 ай бұрын
I'm always hitting the token limit after just a few posts of the agent. Is there a way to work around this?
@spicer412827 ай бұрын
My Request Please... Can you apply this GPT4V Agent? Simple shed photo and analyze its size, the pitch of the roof, and perhaps how many or how much wood is used to build the simple shed from a photo. Thank you for considering this and testing the multimodal capabilities with this use case.
@Joy_jesterАй бұрын
Hey can u do an agent where it has to do instruction following in a simulator? I think that will be a very practical and interesting application
@lifeofdean36477 ай бұрын
very good man :))
@pocoso7 ай бұрын
First! Good tutorial man
@amandamate91177 ай бұрын
can you write agents that operate a headless browser. Within this browser, one window can utilize GPT-4's website features designed for Plus users, while another window can generate images using DALL-E 3. These images can then be uploaded for review in the same headless browser session. Although you'll be limited to 50 prompts every 3 hours, this setup should still be sufficient for most use-cases. Additionally, this approach allows you to conduct user interface analysis or other tasks without incurring API costs.
@popfizz3117 ай бұрын
Can you use this feature with the gpt4 api?
@aliyousefi97357 ай бұрын
AI Jason is da man
@carterjames1997 ай бұрын
I think another good video would be comparing these different agent creation frameworks. Feel like I see another one everyday. I specifically would like to hear your opinion on autogen vs superagi
@brando28187 ай бұрын
How do you finetune llava?
@JosephDefendre7 ай бұрын
This is nuts auto gen is a game changer
@bbproperties-oq5vu7 ай бұрын
Hey hi jason it is really good. can you upload browser automation. i am really more interested on it.
@darkbelg7 ай бұрын
For what i'm trying to do llava isn't yet good enough like GPT-4V. GPT-4V has once again raised the bar for me. And now the waiting begins for an api.
@ibrahimhalouane81307 ай бұрын
How about a SuperAgent that can create other agents by its own to perform a complex task?
@markksantos7 ай бұрын
make a video about memgpt
@nashvillebrandon7 ай бұрын
Would be awesome to give the agent the ability to do inpainting!
@ward_jl7 ай бұрын
So interesting. Is it possible to get the code to experiment with it?
@AIJasonZ
7 ай бұрын
Yep it is in the description
@georgecochran40917 ай бұрын
Ok you know how the game no man's sky you get a analysis visor to scan the environment and save data on plants rocks and animals. Something like that for irl.i would be collecting data all the time
@olivMertens7 ай бұрын
Could you give the source for the file and examples shown in this video ?
@olivMertens
6 ай бұрын
so i found by myself arxiv.org/pdf/2309.17421.pdf ;)
@spookyrays28167 ай бұрын
Create a bot that can read, and visually react to output, so that way it can create a Deep Learning type feedback loop improving upon itself until it no longer can
@jtjames79
7 ай бұрын
I was thinking AutoGen, an artist agent, and editor agent. I don't know how to do it, but theoretically it should work.
@itshuskai7 ай бұрын
Now to really test it, see if it can pass the "Are you a robot?" prompts lol.
@jtjames797 ай бұрын
I want to be able to use AutoGen or something like that, to set up adversarial agents to use Stable Diffusion for me. So I can ask for an image before I go to bed, and by morning it'll have worked out something.
@jtjames79
7 ай бұрын
I should have just kept watching, instead of commenting before watching.
@matthewboyd86897 ай бұрын
They need to make it be able to work on less information and make correct deductions that aren't in its training data before trying to make it more generalized. Otherwise it will just compound hyperbolically the amount of information they need to be able to understand as much as a human can.
@brisonvsn7 ай бұрын
Can agents browse and interact with the internet yet?
@yasinyaqoobi7 ай бұрын
Great video as always. Can you please put your head to the bottom right. It cut off a lot of the content. :)
@defaultdefault8127 ай бұрын
It got the speedometer right - just equated the wrong measurement circle to MPH.
@pissmilker23137 ай бұрын
Our obsolescence as human beings isnt to be feared, but celebrated. Rejoice!
@raresmircea
7 ай бұрын
There’s gonna be a long time until AI will be conscious & match my subtlety. But even then, this take would still be so myopic. Have birds, elephants & dolphins "became obsolete" when humans arrived? Has your mother & brother "became obsolete" when that Indian boy was found to have a huge IQ? These kinds of extreme opinions, desires & manifestations that most people have often betray some unmet need, and I’m sorry for that.
@greengoblin9567
7 ай бұрын
@@raresmirceawe don’t need the ai to be conscious. We just need it to be more intelligent.
@arpitkumar2981
7 ай бұрын
@@greengoblin9567yes
@jp007387 ай бұрын
hahaha oh man, you are a legendary.
@SkyJensen7 ай бұрын
Full website builder. Full website builder. Full Website Builder
@aghasaad29627 ай бұрын
GPT4V will soon be able to take research papers write code, write thesis, get a job, then marry....wait what thats what humans are for....
@psychxx71467 ай бұрын
« 2023 »
@AntonioRonde7 ай бұрын
there were too many basics in the video, I enjoyed your videos were you provided a more in-depth review like in the Autogen video
@AIJasonZ
7 ай бұрын
Thanks for the feedback - is there specific area you would like to see me dive deeper?
@Huru_7 ай бұрын
I wonder what kind of results you'd get if you were to feed that model some proper English...
@soulspawn
7 ай бұрын
Well, it generated human hands because it has been tasked to create palms instead of hooves (see manager reply @20:04 ). I'd call this a win. 👀
@Huru_
7 ай бұрын
Didn't say it wasn't one. Just giving pointers for optimization. Also, I wasn't even going that deep. Just regular ass grammar and complete sentences for starters... @@soulspawn
@PeterAustin666
7 ай бұрын
wastes tokens@@Huru_
@AIJasonZ
7 ай бұрын
I honestly didn’t know palm is specifically for human, hah 😂😂 thanks will try again
@Huru_
7 ай бұрын
Lol, that's why you need to read your Manager's input.@@AIJasonZ
@brytonkalyi2777 ай бұрын
`• I believe we are meant to be like Jesus in our hearts and not in our flesh. But be careful of AI, for it is just our flesh and that is it. It knows only things of the flesh (our fleshly desires) and cannot comprehend things of the spirit such as peace of heart (which comes from obeying God's Word). Whereas we are a spirit and we have a soul but live in the body (in the flesh). When you go to bed it is your flesh that sleeps but your spirit never sleeps (otherwise you have died physically) that is why you have dreams. More so, true love that endures and last is a thing of the heart (when I say 'heart', I mean 'spirit'). But fake love, pretentious love, love with expectations, love for classic reasons, love for material reasons and love for selfish reasons that is a thing of our flesh. In the beginning God said let us make man in our own image, according to our likeness. Take note, God is Spirit and God is Love. As Love He is the source of it. We also know that God is Omnipotent, for He creates out of nothing and He has no beginning and has no end. That means, our love is but a shadow of God's Love. True love looks around to see who is in need of your help, your smile, your possessions, your money, your strength, your quality time. Love forgives and forgets. Love wants for others what it wants for itself. Take note, love works in conjunction with other spiritual forces such as faith and patience. We should let the Word of God be the standard of our lives not AI. If not, God will let us face AI on our own and it will cast the truth down to the ground, enslave us and make us worship it. We can only destroy ourselves but with God all things are possible. God knows us better because He is our Creater and He knows our beginning and our end. Our prove text is taken from the book of John 5:31-44, Daniel 7-9, Revelation 13-15, Matthew 24-25 and Luke 21. Let us watch and pray... God bless you as you share this message to others.