Semantic-Text-Splitter - AI Based Text-Splitting with LangChain

In this video I will add upon my last video, where I introduced the semantic-text-splitter package. This time I will show you how to split texts with an LLM

Пікірлер: 46

  • @awakenwithoutcoffee
    @awakenwithoutcoffeeАй бұрын

    this is an amazing method brother. I think the next step would be to train a local LLM to pre-process documents, extract images and tables (with metadata) etc.

  • @andydataguy
    @andydataguy4 ай бұрын

    This is awesome! Am working on a textsplitter this weekend. Thank you for this video 🙌🏾

  • @Challseus
    @Challseus4 ай бұрын

    Very happy you continue to do advanced topics 💪🏾

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Monday the next advanced topic will follow :)

  • @user-sw2se1xz6r
    @user-sw2se1xz6r4 ай бұрын

    thanks for the addition to the first video! this now makes all clear! 👍

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Great ;). Thank you for your comment

  • @pragyantiwari3885
    @pragyantiwari3885Ай бұрын

    See what I did....first I extracted the text from my pdf files and passed it to this class...it is working well.....

  • @codingcrashcourses8533

    @codingcrashcourses8533

    Ай бұрын

    what do you do with images and tables? :)

  • @swiftmindai
    @swiftmindai4 ай бұрын

    Looking forward to your upcoming Monday video. My two cents about this channel, honestly the quality of contents in this channel are far above par compare to many others which has multiple times subscribers. Personally I've shared this with many of my friends and all of them are benefited from it. You deserve alot recognition. Keep up your good work and I wish you all the very best to you and your channel.

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Thanks so much

  • @say.xy_
    @say.xy_4 ай бұрын

    Finally I able to connect three dots, LLMs outperform when reference were given, Every RAG uses that reference as chunks, Better quality of chunks will eventually lower the hallucinations and increase the chance of more accurate outputs.

  • @andreypetrunin5702
    @andreypetrunin57024 ай бұрын

    Огромное спасибо!!!

  • @nathank5140
    @nathank51404 ай бұрын

    Love it. Thank you. I’d love to hear your ideas on how to preprocess a raw meeting transcript. Say for example A Joe Rogen episode. In my case it would be a business meeting between prospect and onboarding agent. The goal is to process the meeting into something that would be useful later by a retriever. I’ve thought about writing an article about the meeting. Then chunking that, with each chunk being enriched with metadata about who what where when. But just can’t find the right approach to make the chunks useful/valuable/dense enough, but without loosing context.

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Depends on what you want to do with it. You can also summarise parts of it if every detail is not that important

  • @Kevin-jm1go
    @Kevin-jm1go4 ай бұрын

    The reason you dont see this in LangChain yet is probably that the text splitters there are meant to split text which is much larger than the context length of the model. In that case, you cannot just make a single LLM call to chunk your document. If you find a way to semantically chunk text being much larger than the models context length without making too many assumptions about the structure of the text, that would be really interesting

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Yes, but you could probably do a multistep-splitter, which takes the size of the context-window into consideration :).

  • @Kevin-jm1go

    @Kevin-jm1go

    4 ай бұрын

    @@codingcrashcourses8533 I like the idea and I would love to see a video about this 😉to get an idea what kind of challenges this might bring and how they are being tackled. One of these challenges could be that when sliding with a fixed window size over your text, you might involuntarily split up a coherent piece of text into 2 topics. Also, sometimes, a specific topic is mentioned in distributed parts of the whole text. I guess you would want to merge those chunks, but the challenge lies in identifying them. It would also be interesting to see this with some other LLMs like e.g. Aleph Alpha.

  • @efexzium

    @efexzium

    4 ай бұрын

    semantic text splitter can adjust the token size so it can work for most model context size.

  • @karthikarunr8505
    @karthikarunr85052 ай бұрын

    Hey, can you share the code/notebook for this? I have a similar use-case and it'd be great if I can take some excerpts from this

  • @yinxing418
    @yinxing4184 ай бұрын

    I actually liked the semantic text splitter, too bad it's not so semantic. This way doesn't seem like it would scale for large amounts of documents, I don't think I would use it.

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Yes, i made this Video just as brain teaser. As you can see i did not make a repo for this :)

  • @DzulkifleeTaib
    @DzulkifleeTaib4 ай бұрын

    Is it correct to say that the limitation is the context size window of the LLM to process the inbound input?

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Yes and no. RAG is also about the amount of noise you sent to an LLM. There more unrelated data you sent to an LLM, the worse performance becomes

  • @anthonycadden120
    @anthonycadden1204 ай бұрын

    have you tried to do this with a smaller model for either speed or local instance?

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    no, I rarely use local models due to my old computer ;-)

  • @maxiweisei
    @maxiweisei4 ай бұрын

    looks cool! Thinking about implementing this with large, complex pdfs. What will be suitable ways use this approach with pdf but with keeping the information about the page_number of a chunk?

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    For PDF i made a Video about multimodal rag with gpt-4-vision

  • @mrchongnoi
    @mrchongnoi4 ай бұрын

    Edited: Looks good. What about large documents? How would this work for tables?

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    Tables are a different topic... probably should not embedded at all, but rather stored in a DB and queried via function calling. That´s at least my experience with it.

  • @swiftmindai
    @swiftmindai4 ай бұрын

    Appreciate the update. Please share the github link for the above codebase if you've time. Thanks.

  • @codingcrashcourses8533

    @codingcrashcourses8533

    4 ай бұрын

    I did not create a repo for that since it´s just very basic code. The important stuff is the idea behind it :)

  • @swiftmindai

    @swiftmindai

    4 ай бұрын

    Yeah, thanks for the wakeup call. Sometimes people tends to get lazier. I've managed to get it worked perfectly.

  • @andreweducates
    @andreweducates3 ай бұрын

    I'm attempting to do Q&A retrieval with a legal document and the RecursiveCharacterSplitter hasn't been the best for me, do you think chunking Semantically as you've shown here would work well in my use case? Appreciate it! 🙏🏾

  • @codingcrashcourses8533

    @codingcrashcourses8533

    3 ай бұрын

    I dont know your documents. Are they in PDF format? HTML? It heavily depends on how it looks like

  • @andreweducates

    @andreweducates

    3 ай бұрын

    @@codingcrashcourses8533 im extracting all the text out of the Microsoft Word Document So it’s just a huge string of text then I’m doing the createDocuments function on it and then the RecursiveSplitting

  • @andreweducates

    @andreweducates

    3 ай бұрын

    @@codingcrashcourses8533 I'm getting the document in a huge chunk of text basically. Do you have an email/linkedIn/twitter to which I can reach out brother? Thanks!

  • @awakenwithoutcoffee

    @awakenwithoutcoffee

    Ай бұрын

    did you end up finding a solution ? I think most devs are facing similar issues right now..

  • @codingcrashcourses8533

    @codingcrashcourses8533

    Ай бұрын

    @@awakenwithoutcoffee We use chunking with GPT4... works by far the best.

  • @yawboateng9904
    @yawboateng990421 күн бұрын

    is there a github repo for us to see this code and walk through it?

  • @codingcrashcourses8533

    @codingcrashcourses8533

    21 күн бұрын

    All of my projects i made videos about are available on github. Everything ;)