Johnny Code
9 ай бұрын
34,986
1

Getting Started with ChromaDB - Lowest Learning Curve Vector Database For Semantic Search

Ғылым және технология

Start testing out semantic searches on a vector database within minutes. Everything works locally and is free. Don't need to sign up for a cloud vector database account or learn Langchain first.
Buy Me a Coffee: www.buymeacoffee.com/johnnycode
Get the code: github.com/johnnycode8/chroma...
ChromaDB Playlist: • ChromaDB Vector Databa...

Пікірлер: 57

@johnnycode7 ай бұрын
Please help me out with a subscribe if this video helped you 😀 AND I would love to know what you're doing in your ChromaDB project. Check out my new video: How to vectorize 33K embeddings to ChromaDB in 3 minutes: kzread.info/dash/bejne/aXqqxtmwptTYdJc.html
@efexzium
4 ай бұрын
Hi Johnny, my names johnny nice to meet u lol.
@5uryaprakashp112 күн бұрын
Now I can put chroma db in my resume. Thanks you creating such a crisp and straightforward tutorial.
@Bryan-mw1wj7 күн бұрын
Perfect, all i was interested in was persisting the database. Thanks for adding that at the end
@deeplearning70977 ай бұрын
Short and Sweet! Thank you very much.
@sherozeajmal8 ай бұрын
Man, that was amazing. Thank you so much.
@newcooldiscoveries57116 ай бұрын
Excellent tutorial. Thank You!
@sanjeevKumar-eg6hp6 ай бұрын
Such an amazing video man, Thanks for this valuable Knowledge
@youngzproduction74983 ай бұрын
Concise and precise are what you did here.
@kenchang34564 ай бұрын
You have a new subscriber. This video was very timely for my proof of concept.
@devopsmentor95117 ай бұрын
Awesome demo
@thatoshebe55054 ай бұрын
great demo and concise
@user-tv7nv1wr9d7 ай бұрын
Thanks for the excellent explanation
@VELTIONoptimum6 ай бұрын
Excellent tutorial.
@Connor5144017 күн бұрын
Very helpful! thank you
@davidtindell9503 ай бұрын
Good Intro / Review. Thank You from a NEW Subscriber !!!
@RevMan0015 ай бұрын
This helped a lot! 👍
@freepythoncode5 ай бұрын
Amazing video thank you so much 🙂❤
@daenindanielrae50135 ай бұрын
Thank you my guy 🙂
@popalex8 ай бұрын
Very interesting !
@dan_taninecz_geopol7 ай бұрын
Rad. Thanks.
@kenchang34564 ай бұрын
The POC use case I want to try is to query vehicle parts based on a user description which can be somewhat vague and keyword search is not ideal most of the time. I'd also want to do search-as-you-type with a vector db. My understanding is that there are embedding models trained on vehicle parts and seeing how easy it so specify a new embedding model although you have to remember the collection, I hope I can prove my POC. In addition, I want to use the persisted collection for other ideas as well. Thanks for making this video it really helps.
@johnnycode
4 ай бұрын
Thanks for sharing, Ken. Your app would have came in handy for me when I was searching for what turned out to be a "cap" or "flute" oil filter wrench :D
@ddricci12Ай бұрын
Thanks!
@johnnycode
Ай бұрын
Thank you for the support!!!!😀😀😀
@user-dp7lr5qh6o4 ай бұрын
thank you
@webchicka4 ай бұрын
Thank you so much for an easy-to-follow, practical example! I’m curious about something… at one point in the video you note illustrate that the default embeddings are returning something unrelated (sesame ball) as the #1 choice. Your solution is to swap it out for another embedding provider. But how would you go about digging in here and debugging further?
@johnnycode
4 ай бұрын
Unfortunately, when it comes to the search results, we are at the mercy of how 'smart' the embedding model is. The term 'sesame ball' is a translation and unique to this restaurant's menu, so I wouldn't expect models to know the meaning of the term. Somehow, its vector representation is close to the word shrimp for the first model, but we don't have a way to 'debug' it. Here are some things that we have control of: 1. Changing the amount of text (short phrase, sentences, paragraphs, vs pages) per embedding. The more that is packed into one embedding, the harder it is for the model to be accurate. 2. Switch distance function to another one like Cosine Similarity: docs.trychroma.com/usage-guide#changing-the-distance-function 3. Switch to a more powerful, possibly paid, model, see listed of model and the section on Custom Embedding Functions: docs.trychroma.com/embeddings 4. Fine tune a model to understand the terms used by your organization.
@webchicka
4 ай бұрын
@@johnnycode Awesome, thanks for the incredibly thorough and helpful answer!
@rsg4772 ай бұрын
Hi there I have a csv file with 150 rows. I have created collection added the document to the collection. when i query the collection the document field is giving me None. ids field gives the correct id but document field is none. What should I do. Is there any size for the field returned by document
@johnnycode
2 ай бұрын
If you use my code and the CSV provided in my GitHub repo (github.com/johnnycode8/chromadb_quickstart), does the document field show up? If the document field does show up, then you should check your loading logic. If the document field does not show up, check the part of the video that talks about using "include": collection.query(... include=[ "documents" ] ) I don't see a publish document field size limit. Are you loading extremely long docs?
@bk3460Ай бұрын
@johnnycode, Is there any idea how to manage this error that occurs when I try to load previously saved chromadb file, e.g. "vectordb": InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768?
@johnnycode
Ай бұрын
You must use the same Embedding Function that you used to create that database. Embedding Functions convert text to a matrix of numbers and different Embedding Functions use different dimensions, so you can't use them interchangeably.
@DinosaurSuccess5 ай бұрын
what do you do when all-MiniLM-L6-v2 is not very good at judging whats similar? it gets it wrong a lot!
@johnnycode
5 ай бұрын
Here are a few suggestions, hope this helps: 1. Are you embedding a short phrase, a few sentences, paragraphs, or pages? The more that is packed into 1 embedding, the harder it is for the model to be accurate. 2. Try switching the distance function to another one like Cosine Similarity: docs.trychroma.com/usage-guide#changing-the-distance-function 3. Switch to a more powerful, possibly paid, model, see listed of model and the section on Custom Embedding Functions: docs.trychroma.com/embeddings
@spinzeАй бұрын
How do the models know that shrimp and prawn is the same thing? Like how did the first model not get all 5 dishes, but the second model found all 5 dishes?
@johnnycode
Ай бұрын
The models are trained to understand language like ChatGPT’s models. The second model in the video is a larger “smarter” model, so it performs better than the 1st. The disadvantages of a larger model is that it takes up more storage space, uses more computing power and memory, and is slower than a smaller model.
@reubengeorge74702 ай бұрын
I have a CSV file with 150 rows. I have created a collection and added my document to it. when i query it the document field always contains none but the id field give me the correct id but document field is always none. Is there any size constraint for the document field? How should I solve it?
@johnnycode
2 ай бұрын
If you use my code and the CSV provided in my GitHub repo (github.com/johnnycode8/chromadb_quickstart), does the document field show up? If the document field does show up, then you should check your loading logic. If the document field does not show up, check the part of the video that talks about using "include": collection.query(... include=[ "documents" ] ) I don't see a publish document field size limit. Are you loading extremely long docs?
@reubengeorge7470
2 ай бұрын
@@johnnycodeGot it! the problem was because of passing multiple columns from a row in csv file.
@reubengeorge7470
2 ай бұрын
Have you come across this warning: Add of existing embedding ID: 1 Add of existing embedding ID: 2 ...till all ids I am just querying the database only but I am getting this warning also.
@johnnycode
2 ай бұрын
I think you should create a fresh database and collection and try again. If you had tried to insert records into the same IDs, it causes weird issues.
@reubengeorge7470
2 ай бұрын
@@johnnycode Okay. I deleted the collection and did it again.... It worked Thanks!!!
@SoundTamilanАй бұрын
Where user defind db space we can create
@johnnycode
Ай бұрын
Sorry, I don’t understand your question.
@SoundTamilan
Ай бұрын
@@johnnycode where it will take the memory if it is default database, user can manually create the db and assign that path?
@johnnycode
Ай бұрын
You can run ChromaDB in-memory if you are prototyping and don't need to retain the data. However, if you want to retain the data, use persistence mode: client = chromadb.PersistentClient(path="/path/to/save/to") You can see that I use persistence mode in my other videos: kzread.info/head/PL58zEckBH8fA-R1ifTjTIjrdc3QKSk6hI I hope this answers your question.
@kevinehsani33588 ай бұрын
do you have a list of your code or colab link?
@johnnycode
8 ай бұрын
Here you go: github.com/johnnycode8/chromadb_quickstart
@kevinehsani3358
8 ай бұрын
Thanks@@johnnycode
@kevinehsani3358
8 ай бұрын
Thanks. I do have one or two questions if you don't mind, first client = chromadb.PersistentClient(path='content/drive') does not create the db on colab , the folder exist. It just defaults to in memory and store it there, not sure if that is because of colab? Also when I retrieve using ' document = collection.get(ids=[document_id], include=['documents'])' still brings the entire record instead of just documents, {'ids': ['kk'], 'embeddings': None, 'metadatas': None, 'documents': ['**Section 1: Numbers 1-5 in ......' am i doing this wrong? Thanks a bunch
@johnnycode
8 ай бұрын
For question 1: path='content/drive' points to your Google Drive folder. Change it to something like path='content/myvectordb' or path='content/drive/My Drive/Colab Notebooks/myvectordb'. For question 2: The 'get' function will always return the entire record structure, but you can see that the data is not returned for your example: embeddings:None,metadatas:None.
@kevinehsani3358
8 ай бұрын
I tried all sorts of combinations for persist directory like './', 'content', './content' nothing works.
@djs45533 ай бұрын
UserWarning: Unsupported Windows version (11). ONNX Runtime supports Windows 10 and above, only. беда с вами...
@sivuyilesifuba4 ай бұрын
@efexzium3 ай бұрын
U should really sell ur code everyone's benefiting for free.