No video

Semantic Chunking - 3 Methods for Better RAG

Semantic chunking allows us to build more context-aware chunks of information. We can use this for RAG, splitting video and audio, and much more.
In this video, we will use a simple RAG-focused example. We will learn about three different types of chunkers: StatisticalChunker, ConsecutiveChunker, and CumulativeChunker.
At the end, we also discuss semantic chunking for video, such as for the new gpt-4o and other multi-modal use cases.
📌 Code:
github.com/aur...
⭐️ Article:
www.aurelio.ai...
👋🏼 AI Consulting:
aurelio.ai
👾 Discord:
/ discord
Twitter: / jamescalam
LinkedIn: / jamescalam
#ai #artificialintelligence #chatbot #nlp
00:00 3 Types of Semantic Chunking
00:42 Python Prerequisites
02:44 Statistical Semantic Chunking
04:38 Consecutive Semantic Chunking
06:45 Cumulative Semantic Chunking
08:58 Multi-modal Chunking

Пікірлер: 25

  • @wassfila
    @wassfila2 ай бұрын

    this is really promising, thank you. It's really hard to get an overview on cost/benefit for end results from a RAG end user perspective. Like a comparison table.

  • @KenRossPhotography
    @KenRossPhotography2 ай бұрын

    Super interesting - thanks for that! I'll definitely be experimenting with those chunking variations.

  • @jamesbriggs

    @jamesbriggs

    2 ай бұрын

    Awesome, would love to hear how it goes

  • @BB-ou5ui
    @BB-ou5ui2 ай бұрын

    Hi! That's exactly what I was looking for and explaining with some personal implementation, and trying to implement something with different strategies from dense vectors... Have you considered using multivec models like ColBERT? To some extent, you could work with matrix similarities on bigger contexts... I'm also testing some weighted strategies using splade, but that's too early to make claims 😊

  • @jamesbriggs
    @jamesbriggs2 ай бұрын

    📌 Code: github.com/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb ⭐ Article: www.aurelio.ai/learn/semantic-chunkers-intro

  • @samcavalera9489
    @samcavalera94892 ай бұрын

    Hi James, First off, I want to express my immense gratitude for your insightful videos on RAG and other AI topics. Your content is so enriching that I find myself watching each video at least twice! I do have a couple of questions that I hope you can shed some light on: 1) When using OpenAI’s small embedding model with the recursivecharctertextsplitter, is there a general guideline for determining the optimal chunk size and overlapping size? I’m looking for a rule of thumb that could help me set the right values for these parameters. 2) My work primarily involves using RAG on scientific papers, which often include figures that sometimes convey more information than the text itself. Is there a technique to incorporate these figures into the vector database along with the paper’s text? Essentially, for multi-modal vector embedding that includes both text and images, what’s the best approach to achieve this? I greatly appreciate your insight 🙏🙏🙏

  • @jamesbriggs

    @jamesbriggs

    2 ай бұрын

    Hey thanks for the message! For (1) my rule of thumb is 200-300 tokens with a 20-40 token overlap, for (2) you can use the multimodal models (like gpt-4o) to describe what is in the image, then embed that - alternatively you could use an text-image embedding model but they don’t capture as much detail as what you could get from a multimodal LLM. Hope that helps :)

  • @samcavalera9489

    @samcavalera9489

    2 ай бұрын

    @@jamesbriggs many thanks James 🙏🙏🙏

  • @AGI-Bingo
    @AGI-Bingo2 ай бұрын

    Hi James, could you please cover how to do "citing" with rag? With option to open the original source. That would be cool ❤ Also if love to see an example for LiveRag, that watches certain files or folders for changes, and rechunks, embeddes, removes outdated and saves diffs. What do you think about these? Thanks a lot!

  • @tarapogancev

    @tarapogancev

    Ай бұрын

    If you are using Pinecone or similar vector database, along with the vector entry you can usually also add specific metadata. I mostly keep the original text stored within that vector as a 'content' metadata field, and then add other fields for the file's name, topic etc. :) This way, you can cross-reference your data for the users to navigate easily.

  • @AGI-Bingo

    @AGI-Bingo

    Ай бұрын

    got it so you could also add "filepath" and trigger opening the file, wonder if there's a way to jump and highlight a specific part of text after opening (i.e pdf) Also,@@tarapogancev do you know of a way to run diffs on files and delete/reupload all relevant chunks. Watching files and folders for changes, then triggering re-RAG embeddings, to keep everything automatically up-to-date. Thanks 🙏 👍

  • @tarapogancev

    @tarapogancev

    Ай бұрын

    @@AGI-Bingo The idea of highlighting relevant text sounds great! I am yet to face the UI portion of this problem, trying to achieve similar results. :) I haven't worked with automatic syncs, but they would be very useful! So far, from what I've seen AWS Knowledge Bases and Azure's AI Search (if I remember correctly) both offer options to sync data manually when needed. It's not as convenient but I'm thinking it's not a bad solution either, considering it is possibly less work on the server-side, and maybe less credits for OpenAI or other LLM services. Sorry I couln't offer help on this topic, but I hope you come uo with a great solution! :D

  • @Piero-xi1yi
    @Piero-xi1yi2 ай бұрын

    Could you please explain the logic and concept of your code? How does this compare with semantic_chunker from langchain / llama index (it use something like your comulative, using a sliding window of n sentences, and with an "adaptive" threshold based on percentile)

  • @hughesadam87
    @hughesadam8719 күн бұрын

    I've been using a tool unstructured to split my documents into known sections (ie. title, abstract, pararaphs) - it can do the splitting. Do you think having these sentences apriori is helpful to chunking or it's better to just feed plaintex to the chunking strat and let it do all the grouping/separations etc...

  • @user-eh2ji5xs8k
    @user-eh2ji5xs8k2 ай бұрын

    Can we use Ollama for the embedding?

  • @CBCELIMUPORTALORG
    @CBCELIMUPORTALORG2 ай бұрын

    🎯 Key points for quick navigation: 📘 The video introduces three semantic chunking methods for text data, improving retrieval-augmented generation (RAG) applications. 💻 Demonstrates use of the "semantic chunkers library," showcasing practical examples via a Colab notebook, requiring OpenAI's API key. 📊 Focuses on a dataset of AI archive papers, applying semantic chunking to manage the data's complexity and improve processing efficiency. 🤖 Discusses the need for an embedding model to facilitate semantic chunking, highlighting OpenAI's Embedding Model as a primary tool. 📈 Outlines the "statistical chunking method" as a recommended approach for its efficiency, cost-effectiveness, and automatic parameter adjustments. 🔍 Explains "consecutive chunking" as being cost-effective and relatively fast, but requiring more manual input for tuning parameters. 📝 Presents "cumulative chunking" as a method that builds embeddings progressively, offering noise resistance but at a higher computational cost. 🌐 Notes the adaptability of chunking methods to different data modalities, with specific mention of their suitability for text and potential for video. Made with HARPA AI

  • @ariugarte
    @ariugarte2 ай бұрын

    Hello, it's a fantastic tool! but I encountered some problems with tables in PDFs and with strings that use characters such as '-' to separate phrases or sections.I end up with chunks that are much bigger than the maximum size.

  • @prasunkumar2106
    @prasunkumar21062 күн бұрын

    How can I use llama3.1 to achieve this?

  • @talesfromthetrailz
    @talesfromthetrailz2 ай бұрын

    How would you compare the Statistical chunker with the rolling window splitter you used for semantic chunking? Do you prefer one over the other? I'm designing a recommendation system that uses user queries to match to certain outputs they may want. Thanks!

  • @jamesbriggs

    @jamesbriggs

    2 ай бұрын

    StatisticalChunker is actually just a more recent version of the rolling window splitter, it includes handling for larger documents and some other optimizations so I'd recommend the statistical

  • @maxlgemeinderat9202
    @maxlgemeinderat92022 ай бұрын

    Nice video! So e.g if i am reading in docs with unstructured io, i can then use the semantic chunker instead of a RecursiveCharacterSplitter?

  • @jamesbriggs

    @jamesbriggs

    2 ай бұрын

    yes you can, there's an (old, I should update) example here github.com/aurelio-labs/semantic-router/blob/main/docs/examples/unstructured-element-splitter.ipynb ^ the "splitter" here is equivalent to the StatisticalChunker in semantic-chunkers

  • @looppp
    @looppp2 ай бұрын

    great video

  • @lavamonkeymc
    @lavamonkeymc2 ай бұрын

    Where’s the advanced lamb graph video?