ClippyGPT - How I Built Supabase’s OpenAI Doc Search (Embeddings)
Ғылым және технология
Supabase hired me to build ClippyGPT - their next generation doc search. We can ask our old friend Clippy anything you want about Supabase, and it will answer it using natural language. Powered by OpenAI + prompt engineering.
In this video I will be showing you exactly how I did this, and how you can do the same in your projects. We'll be covering:
- Prompt engineering and best practices
- Working with a custom knowledge base via context injection + OpenAI embeddings
- How to store embeddings in Postgres using pgvector
Supabase blog post:
supabase.com/blog/chatgpt-sup...
pgvector extension:
github.com/pgvector/pgvector
Generate embeddings implementation:
github.com/supabase/supabase/...
Clippy edge function implementation:
github.com/supabase/supabase/...
Clippy frontend implementation:
github.com/supabase/supabase/...
Prompt engineering:
prmpts.ai/blog/what-is-prompt...
00:00 Why?
01:40 Let's get started
03:15 Custom knowledge base
04:49 Context injection
06:13 Pre-process MDX files
13:40 Embeddings
15:40 Storing in Postgres + pgvector
22:21 API endpoint (edge function)
23:44 Calculating similarity in pgvector
27:55 Prompt engineering
33:15 Prompt best practices
38:37 Demo time!
41:32 Thanks for watching!
Пікірлер: 333
This channel is awesome! Love the rabbit holes you take us one! keep them coming please!
The clarity of this video while maintaining detailed granularity of the subject is very impressive and very appreciated. Thank you for making this video.
This content is really top notch. I appreciate the clear and detailed explanations of everything!
@RabbitHoleSyndrome
Жыл бұрын
Glad you found it helpful!
Found your channel while learning React Three Fiber, subbed with notifications immediately. Today I get a notification for a well-explained ChatGPT tutorial, right as I embark on building a similar thing. Fantastic continued work, thank you very much!
@RabbitHoleSyndrome
Жыл бұрын
Awesome! Thank you!
This was so valuable to just settle the thoughts into a clear action plan as to how to implement in production. Thank you!
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped!
I am blown away at how much information is densely packed into this. You got yourself a new subscriber, sir. It’s staggering to think about how these technologies will shape the landscape for data and analytics. This is just the beginning.
@RabbitHoleSyndrome
Жыл бұрын
Thanks for the sub! I am continuously blown away by the possibilities of large language models 😀
fantastic content, didn't expect it will be so informative in just 40 mins. Looking forwards to the next one!
@RabbitHoleSyndrome
Жыл бұрын
Thanks for watching 😀
First time on this channel. The way you structured the video, the pace and the explanations are all on point. Keep up the good work. +1 subscriber.
@RabbitHoleSyndrome
Жыл бұрын
Glad to hear that, thanks for the sub!
Insane, I love how you are able to do so many things. My laptop atm is unable to power every wish of mine (getting into 3D) but I hope I will soon be able to do so.
Damn, I spent like a week researching this shit on my own, and have been working on almost exactly the same thing. Processing MDX files into embeddings etc. It’s really cool to see somebody doing almost the same exact thing. Makes me think I am really on the right track!
@RabbitHoleSyndrome
Жыл бұрын
Nice! Glad to help give some validation 😄 are you also building for docs?
@automioai
11 ай бұрын
Hey! how had you turn your pdf files into a propper mdx format ? tnx
@RabbitHoleSyndrome
11 ай бұрын
@@automioai In this project no PDF files were used - all documentation had been written directly in MDX. You'll have to do some research on ways to extract text from PDF files. Once you have that, I wouldn't bother with MDX at all - just generate embeddings directly on that content.
This is amazing, you’ve created Clippy just as enthusiastic and helpful you are! Thanks a lot
@RabbitHoleSyndrome
Жыл бұрын
Thanks for watching 😄
@maertscisum7243
Жыл бұрын
@@RabbitHoleSyndromehow did you generate the mdx files?
@RabbitHoleSyndrome
Жыл бұрын
The MDX files weren’t generated - the Supabase team wrote them as you would any markdown file.
Wow, this is incredibile... I can see a future where every docs site does the same thing. This is truly powerful stuff! Well done 👏
@RabbitHoleSyndrome
Жыл бұрын
I’m am both excited and blown away by the possibilities. Thanks for watching!
What a fantastic video and content. I’ve gone through multiple videos trying to better understand embedding and how to work with ChatGPT in the best way for querying large amount of content and producing an analyzed response. I’m not a developer, have a background in computer science but I’m a software sales person that is curious about technology and I was able to completely understand your video and content. Subscribed, liked and will be watching more of your videos. Thank you!
@RabbitHoleSyndrome
Жыл бұрын
Glad it was helpful!
This is a glorious illustration. Thank you very much! I've been trying to find an example of doing this, and yours has put it all together for me! Subscribed!
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped, thanks for the sub!
Ive been looking for this information for months. Such an excellent tutorial and I love that Supabase's code is all open source so i can actually clone it and read how it works in detail later. Thank you so much for the walk through. Super talented dude too - love the blender stuff.
@RabbitHoleSyndrome
Жыл бұрын
Glad to hear it helped! Agreed - open source is amazing! Let me know if you hit any road blocks along the way, happy to help 😃
@fraternitas5117
Жыл бұрын
@@RabbitHoleSyndrome why is the generate embeddings file so different in the video then what is in the repo now? I can't find anything you talk about in minutes 10-13.
@RabbitHoleSyndrome
11 ай бұрын
Hey @@fraternitas5117! Supabase moves pretty quick - the code I references has been refactored now to support multiple knowledge sources (ie. more than just markdown). You can find the markdown specific code here: github.com/supabase/supabase/blob/1b2361c099c2573afa1fe59d3187343bb8f1bcab/apps/docs/scripts/search/sources/markdown.ts
This really is/was an epic video clearly and well laid out!
You do a such great job bro ! I love what youhave built and your video, keep build great things bro.
Amazed by your content. Fantastic work here .
@RabbitHoleSyndrome
10 ай бұрын
Thanks, glad it helps!
Wow. I'm so thrilled to know that you were one of the ones behind that great feature. I've been using Supabase for 6 months, and have been pretty happy with it. Except for the docs and the transition to 2.0. I was blown away when I saw that it generated the code for me when I started writing its documentation
@RabbitHoleSyndrome
Жыл бұрын
Glad to hear it has helped you! Any feedback on the docs that you think should improve?
@javiasilis
Жыл бұрын
@@RabbitHoleSyndrome So far so good! I think one challenge is to know how we can check if a user's email address exists. (Or other specific user's metadata) I couldn't find it in the docs. There was a GitHub issue which said to store the user's data in a separate table as the auth table was private. I ended up doing that and haven't had any problems. Btw, thanks again for all the awesomeness!
Wonderful primer on prompt engineering.
GREAT content. It explained everything you need to know about creating 'chat docs' or similar in one run, and all open source. Kudos! And subscribed.
@RabbitHoleSyndrome
Жыл бұрын
Thanks for the sub! Great to hear 😃
I'm doing something even cooler with the vector embeds I cant wait till I can share!
incredible end to end tutorial, nicely done
@RabbitHoleSyndrome
Жыл бұрын
Thank you! 😃 Have you explored embeddings or pgvector?
excellent . you are hands-on & practical
Thanks Greg for a great explanation. I like your presentational style!
@RabbitHoleSyndrome
Жыл бұрын
Glad it was helpful!
This video was fantastic, lots of information given in an easily digestible way. Subscribing!
@RabbitHoleSyndrome
Жыл бұрын
Glad it was helpful, thanks for the sub!
Absolutely amazing quality here, glad I found your video. Subbed!
@RabbitHoleSyndrome
Жыл бұрын
Thank you!
You are awesome! Now I can understand how these things work together.
@RabbitHoleSyndrome
11 ай бұрын
Happy to hear it! 😀 thanks for supporting the channel!
Seems to have found one great coding channel among the noise of today; back to good old days when coding was boring & nerdy. Great job!
incredible value in this video
I am grateful to this video. This open my eyes to Embeddings
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped 😃
So you're the one who made this beautiful thing. Pretty nice.
This is truly valuable and useful content, thanks a lot.
@RabbitHoleSyndrome
Жыл бұрын
You bet! Glad it was helpful
Thank you for sharing . I was struggling on how to get started . This was very well presented . From this side of the world asante sana !
@RabbitHoleSyndrome
11 ай бұрын
Glad it was helpful 😃
Prediction: This is gonna get a million views. Just saw fireship video about vector databases and wanted to understand embeddings. Before I could even search, this video was in the page. Though I wasn’t interested in a 40 min video (had a feeling I’ll just stop after 5 mins like I usually do) I ended up watching it all. The rabbit hole 🐇 🕳️ format is so naturally elegant. Clear end to end use case. I secretly don’t want to share it with anyone but I am forced to fulfill my prediction.
@RabbitHoleSyndrome
Жыл бұрын
Thanks for the comment! Glad to hear the format is working 😃
that's great, you did great job! helps me a lot in this.
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped!
Just what I was looking for - thank you very much🥳
@RabbitHoleSyndrome
Жыл бұрын
Great 😃 Thanks for watching!
Amazing video! I'm going to put this in practice right now!
@RabbitHoleSyndrome
Жыл бұрын
Awesome! Feel free to share as you make progress!
Many thanks for this insightful and helpful video!
@RabbitHoleSyndrome
Жыл бұрын
Happy to help, thanks for watching!
First time here. This is so well done. Subscribed. Your viewer number will explode! I like how you approached the topic in a very calm way without jumping on the "LLMs will take over the world" train :) You don't happen to have the clippy blender asset somewhere?
thx for saving me hours if not days. I wanted to add openai to my supabase app and found the exact tutorial.
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped!
This 40-minute video looks like 10 minutes. I have been researching related engineering topics recently and they have been very inspiring to me.
@RabbitHoleSyndrome
Жыл бұрын
The potentials with LLMs seem to be endless 🤯
Fireship guy? Either way, its been pretty useful in terms of learning API's and how to connect them to my nocode builder. Spent hours trying to get things working and the Assistant basically told me what i was doing wrong and how to fix it. So well done with the implementation.
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped!
Amazing video! Great and detailed information!
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped!
This is really helpful and valuable, thanks a ton!!!
@RabbitHoleSyndrome
Жыл бұрын
You bet, thanks for watching!
Such a great video.
Cool video. I was thinking about doing something similar for some reference pdfs. Thanks for the video
@RabbitHoleSyndrome
Жыл бұрын
Thanks for watching & best of luck!
Thanks for explaining everything so clearly :))
@RabbitHoleSyndrome
Жыл бұрын
You bet! Thanks for watching 😃
Great job 👍, this was really helpful
@RabbitHoleSyndrome
Жыл бұрын
Glad it helped!
Thanks for the knowledge share my friend. 💪🙏
@RabbitHoleSyndrome
Жыл бұрын
You bet!
Subscribed, amazing content.
Thanks for the share !
I'm at 1:42 and this video already is 10/10
🔥🔥🔥 This is amazing!
The most amazing think is that I basically made a chatbot app in less than a week with only the help of GPT4, I had no knowledge of AWS services, PostgreSQL or python. Everthing you told in the video is what GPT4 told me. All of the serves and database are setup, it has memory, STT, TTS and Cognito login/register.
@RabbitHoleSyndrome
Жыл бұрын
It is quite amazing - and I’m sure it will only get better!
Damn that's a great, helpful video!
@RabbitHoleSyndrome
Жыл бұрын
Glad to hear it was helpful, thanks for watching!
I'm reading the comments, and I'm like.. Yeah WTF 🗿🔥
Really good video, nice job! 🎉
@RabbitHoleSyndrome
Жыл бұрын
Cheers! 😀
Amazing Content..
Hi, I'm a lil bit confused. Do we need lemmatization or remove stopwords before we do embeddings? And do we need chunking after embedding?
Dude, that's exactly what I was looking for. No more bullshit articles with clickbait titles, just DIY in the essence. Is there a way to support you through patreon or smth. ?
@RabbitHoleSyndrome
7 ай бұрын
Glad to hear it! You support by watching 🙌
excellent info, great presentation
@RabbitHoleSyndrome
Жыл бұрын
Glad it was helpful!
This is awesome, thank you!
@RabbitHoleSyndrome
Жыл бұрын
Glad it was helpful!
This is really interesting! I'm looking to build something similar.
@RabbitHoleSyndrome
Жыл бұрын
Best of luck!
super helpful. thanks. love supabase
@RabbitHoleSyndrome
Жыл бұрын
Happy to help!
28:37 Whenever I see examples of decoder (GPT) prompts starting with "You are a helpful finance advisor" or "You are an enthusiastic support rep", I can almost see the AI clearing its throat and sitting up straight and saying "right, ok". Gimme that can-do attitude, GPT!
@RabbitHoleSyndrome
Жыл бұрын
🤣
Good job! :)
Brilliant.
At database are you using to store the data, and would it be good to then use that data to fine chune the model.
The documentation you used stored in the database be a couple megabytes big, so its able to find the relevant chunks pretty quickly. What if you wanted to fill a database with gigabytes of text, would it much slower? Is that a case where retraining the whole model would unfortunately be the best route to go?
So if I understood correctly: The embeddings were only used to check for similarity between the user's input and the doc's content, in order to provide the prompt with relevant (text) context, right? Is there a way to provide the GPT model with the embeddings instead?
@RabbitHoleSyndrome
Жыл бұрын
That’s correct. You could have used an alternate search method, but embeddings have a nice alignment with LLMs since they also use language model themselves. Unfortunately no there is not currently a way to inject embeddings directly into GPT today. Maybe this will change in the future or become available in open source models like LLaMa in the same way we’ve seen it happen with Stable Diffusion.
I have not seen such a great video in a while! How wonderful have you explained the whole process 👍👍! Could you explain a bit more about how did you choose 0.78 as threshold for embeddings comparision? have you statisticized that wether the most relevant sections can be found with it?
@RabbitHoleSyndrome
Жыл бұрын
Glad you liked it! 0.78 was a first-stab threshold that worked best based on a limited sample of test queries. I wouldn’t claim that this number is universal - almost certainly this could change by domain.
I've got an internal api that returns train times for given stations. Could I use user prompts to ask the GPT LLM to find train times for a particular station and then fetch using an api get request an array of trains that are set to depart and then get GPT to format that back to the user in a tailored natural language type of way?
@RabbitHoleSyndrome
Жыл бұрын
Definitely possible, will just take some work on the prompt engineering side. Your best bet today is probably LangChain - I recommend you check out their documentation on creating custom tools: python.langchain.com/en/latest/modules/agents/tools.html
This is Awesome
is it possible to use the semantic text search with the embeddings without the database? I don't know much about running databases (newbie here) and would like to skip that step if possible. Can i for example do this with a pandas dataframe in python while saving the data (both the source text chunks themselves and the corresponding embeddings) as a csv file? Then just like in the video, I could calculate the dot-product between my query and all other embeddings and take the best k matches. Is there a downside to this?
Yo, thank you for putting out your knowledge :). But I have question regarding the data search for the context. So basically your creating a graphdatabse with the gpt models and then you input the users question into the database to find the relevant articles/information?
16:10 would love an explanation on how embedding work with a document structure? I.e query is “summarize chapter 3”. The embedding sans retrieval don’t seem to capture the structure of the chunks that are contained in title chunk “chapter 3 “. All explanations on embedding I’ve seen all rely on the text content within a chunk.
About preprocessing the data, do you know about these Generative Pseudo-Labeling techniques (T5 model, Negative Mining)? Or any chunking techniques? Like, overlapping chunks. I'm pretty interested in all the preprocessing techniques, but I just read those ones.
Thank you
You are so cool! I love you!
This is a fantastic video! Thank you very much for sharing :D Quick question - Currently if the info is not in the documentation it responds "Sorry I don't know how to help with that". But how can we make it respond like this: "Sorry I don't have relevant info in the documentation but you can do something like this". For e.g. "I don't have any info about how to make banana pancakes in the documentation, but here is how you can make one...." Idea here is to make it act like chatgpt on top of the information provided. Keen to know more on this and thank you so much for making this video :D
Thanks a lot. How to use supabase openAPI schema inside GPT4?
amazing! how does clippy update the vector db and process the text as embedding when NEW documentation is added? is it automatic? maybe i missed it in the video
@RabbitHoleSyndrome
Жыл бұрын
Great question! The `generate-embeddings` script was designed to be diff-based. So next time you run it, it will pull in only the documents that have changed and re-create embeddings on just those. It currently works using checksums: 1. Generate a checksum for the content and store in the DB 2. Next time the script runs, compare the checksums. If they don't match, the content has changed and embeddings should be re-generated. The script runs on CI, so anytime documents change a GitHub Action will trigger the script. See this PR for details: github.com/supabase/supabase/pull/13936
When breaking up a document into smaller chunks to generate the embeddings, is proximity of the sections (same document) taken into consideration when generating the similarity scores? What if it so happens some key information is in a section that is separate from key words located in another section?
@RabbitHoleSyndrome
Жыл бұрын
Great point! Remember though that embeddings are not matching keywords - they’re matching meaning as understood by the underlying embedding model. But I agree that proximity should almost certainly be accounted for in the ranking since some context could be missed otherwise.
@concuben
Жыл бұрын
@@RabbitHoleSyndrome thanks for your reply... Do you have a suggestion for how to add this context? I didn't see from your tutorial how it was being accounted for.
Great video! How closely do the .mdx files have to match this structure before they can be processed into embeddings? Do they need to export the meta const, for example?
@RabbitHoleSyndrome
7 ай бұрын
The meta const is optional! You’re also free to tweak the pre processing logic to fit whichever format you need to work with
Is there a version of this that can apply to any type of content? Not just specifically made for Supabase documentation, but documentation of any type, or insert a book, of textbook, and the search outputs relevant gpt responses?
@RabbitHoleSyndrome
Жыл бұрын
Hey! I don’t have a pre-built suggestion off the top of my head, but solutions to this seem to be popping up everywhere (just hang out on Twitter for a few hours 😅). If you don’t mind coding, it should be relatively straightforward to swap out the MDX docs with really any kind of content source, and the remaining steps should be identical.
Watched because of the content, subscribed because of the dog. 👍🏻
@RabbitHoleSyndrome
Жыл бұрын
Thanks for the sub! 🐶
I'm curious how hot my API key to OpenAI would get in practice. How much would this cost on average based on the size of the doc base. I like free tiers, but I do not think they exist here
Cool project, would be cool to create a tool like this that you can embed in any documentation, reading directly the markdown or scraping the website.
@RabbitHoleSyndrome
Жыл бұрын
Definitely!
you are great!
@RabbitHoleSyndrome
Жыл бұрын
Thanks for watching!
What kind of vectors did you generate from chatGPT. Are they word vectors? You passed one whole section of the mdx so they are not word vectors but paragraph vectors?
@RabbitHoleSyndrome
11 ай бұрын
Yeah you got it - the community mostly calls these "sentence embeddings". Check out SBERT/sentence transformers for some good info
Amazing video! This is exactly what I was looking for a long time. You basically explains everything I wanted to know about how to create a search engine using open ai. But I have a few questions: How much did you spend on open ai embending API building this? How much supabase spends monthly with searchs using the open ai api? It is possible to use an open source embedding API instead of calling the open ai api ? Wouldn't it be less expensive than the approach you took?
@RabbitHoleSyndrome
Жыл бұрын
Glad it was useful! As for costs, you may be surprised how inexpensive OpenAI embeddings are (at least I was). To put it in perspective, for the Supabase guides we currently have around 1500 page sections which total just over 220000 tokens. At OpenAI's current embedding price ($0.0004/1k tokens), that brought us to just less than $0.10 for the entire guide knowledge base (~one-time pre-processing). After that the average query is likely
I'd love to learn how to also incorporate user-feedback (thumbs up/down)
@RabbitHoleSyndrome
Жыл бұрын
This will likely make it into the next iterations. Will be a good challenge!
@phemartin
Жыл бұрын
@@RabbitHoleSyndrome That's awesome! Can't wait
@LV-md6lb
Жыл бұрын
@@RabbitHoleSyndrome amazing video! Please let us know if you get to it:)
@forbiddenera
Жыл бұрын
@@RabbitHoleSyndromeany progress?
are there alternatives to openai to create these vectors? Dont really feel comfortable building something around a closed source api that is controlled by one vendor.
@RabbitHoleSyndrome
Жыл бұрын
Really great question. You’ll want to look into sentence embeddings. There has been a lot of work on the OSS side with Sentence-BERT (SBERT) you can check out. You might also want to look into Universal Sentence Encoder (USE) and InferSent.
@RabbitHoleSyndrome
Жыл бұрын
LlamaIndex actually uses OpenAI (text-embedding-ada-002) by default for embeddings today. They're more of a toolkit layer to assist with the workflow. There are many other alternatives though (which LlamaIndex supports via LangChain) that are worth checking out: langchain.readthedocs.io/en/latest/reference/modules/embeddings.html
@trejohnson7677
Жыл бұрын
LOL what computer arch r u on
Can you make the completion also tell which chunks have been used and link to them or get “read more” links or would you do that by just listing “top 5” matches from the context?
One thing that might help is if the question result shows the links to the documents that it acquired the information from. Since you are currently fetching which document to run chatgpt on based on similarity of features, maybe you can change the prompt so that it also returns the link of the document that was deemed as a "similar document"
@RabbitHoleSyndrome
Жыл бұрын
Absolutely! This should definitely be the next progression.
Good stuff! Curious how you are handling performance of the respons? We setup a similar pattern and have found that GPT 3.5 obviously returns faster, but GPT-4 returns with much better quality. Have you experienced similar results? In this context does the speed of 3.5 outweigh the GPT4 quality - for us we are seeing some GPT-4 responses take >30 seconds, which is pretty terrible from a UX perspective. Also, curious if you are still using Algolia for the other "basic" search experience? We were thinking about using Algolia for search, and then sending the top few most relevant results to OpenAI for summarization. Not sure if the quality of Algolia's search results (even with all of their AI synonnyms, dynamic re-ranking, etc.) was meaningfully different then creating your own vector database? I appreciate you taking the time to make this video, it's validating to some stuff I'm working on for sure!
Isn't it super expensive to calculate the similarity twice (~27min), in the select and where?
Loved the video validated a lot of the decisions we are making at work. I have a question however on the section about context injection. You mention that you search for relevant information to inject into the prompt. How do you accomplish the search part ? Is it using an index or a sql query amongst all columns ?
@RabbitHoleSyndrome
Жыл бұрын
Glad it was helpful! The search is done through embeddings - we perform a similarity search between the embeddings generated from user's query and the pre-generated embeddings on the knowledge base (stored in a column using pgvector).
@WilbertoCasillas
Жыл бұрын
@@RabbitHoleSyndrome ahh so: 1) call OpenAi embedding api for the query 2) use cos sim to compare the query embedding against the stored embeddings 3) utilize the top results to inject into a prompt that we compile to send to OpenAi completion api ?
@RabbitHoleSyndrome
Жыл бұрын
You got it 👍
this would be amazing for obsidian note taking application
@RabbitHoleSyndrome
Жыл бұрын
Great idea!