Data Extraction with Large Language Models

Ғылым және технология

➡️ JSON Extraction Scripts and/or ADVANCED-inference Repo Access: trelis.com/enterprise-server-...
➡️ ADVANCED-fine-tuning Repo: trelis.com/advanced-fine-tuni...
➡️ Trelis Function-calling Models: trelis.com/function-calling/
➡️ One-click Fine-tuning & Inference Templates: github.com/TrelisResearch/one...
➡️ Trelis Newsletter: Trelis.Substack.com
➡️ Tip Jar and Discord: ko-fi.com/trelisresearch
Affiliate Links (support the channel):
- Vast AI - cloud.vast.ai/?ref_id=98762
- RunPod - tinyurl.com/4b6ecbbn
Resources:
- Slides: tinyurl.com/3m9ckm4s
- One-click-llms: github.com/TrelisResearch/one...
- Chat interfaces: chat.trelis.com or chatbotui.com
Hat tip to Sagar Desai for his insights and help on this vid. Check out his blog on LLMs here: sdcodehub.github.io/
Chapters
0:00 Introduction to Data Extraction with Language Models
0:28 Overview of the Video
3:26 Challenges in Data Extraction
5:13 JSON Extraction and YAML Extraction
13:27 Practical Demonstration of Data Extraction Using Open Chat
31:44 Comparing GPT 4 and GPT 3.5 for data extraction
34:37 Comparing Performance of Different Models
40:34 Extracting Data from Long Contexts
51:53 Exploring the Cost of Different Data Extraction Approaches
55:43 Conclusion and Final Thoughts

Пікірлер: 38

@heski68474 ай бұрын
Wow, man, this is a real quality content! Thank you!
@Xaelum4 ай бұрын
This is the best video on the topic out there. Great work!
@hope424 ай бұрын
My prompting style ... I have great success rambling with no care for grammar, spelling, and punctuation, just run on sentences. I always inject humor and in turn it responds with humor. With humor it does not go lazy on me. Also the deeper you go with inception type comments it doesn't get lazy either. Just jacks up AI and it answers with excitement. I also ask to answer in analogies if it is complicated. A negative prompt is truly important. I always say concise, matrix and tables, and answer in YAML. Further I say please do not put comments in code snippets for me. I definitely state if I need any extra information I'll ask. I also say save the flops in azure and the bandwidth by eliminating chattiness and be seriously concise. I have noticed it doesn't always pay attention to your custom instructions either so sometimes you have to remind it. I have been at this since November 2020. 😮
@patrickmauboussin
4 ай бұрын
Do you have one refined custom instructions prompt you always use or do you rotate?
@Danne9804 ай бұрын
Great information Trelis!
@sagardesai12534 ай бұрын
Useful for production systems consistency, code is reusable. Thanks for detailed video!
@Ali-me4tv4 ай бұрын
This is really helpful content! Thank you
@user-ef4df8xp8p4 ай бұрын
Awesome. Pure gold....
@easyaistudio4 ай бұрын
another banger video
@sherpya4 ай бұрын
I found that a simple description and a complete example were enough to generate yaml with chatGPT 4, I also specified the formatting I like and max line length
@seandiamond79834 ай бұрын
Nice man. Ty
@etticat4 ай бұрын
Great video, enjoyed the insight into open source models. Did you check out the performance on gpt3.5 and 4 when using json mode or function calling?
@TrelisResearch
4 ай бұрын
Everything was json/yaml mode. I had thought function calling might help, but these models don't need that for data extraction.
@carthagely1224 ай бұрын
Thanks very much
@unshadowlabs4 ай бұрын
In some of your past videos, you mentioned using LLMs to do other data tasks such as data summarization, data cleaning, and creating data Q&A sets from text. Do you have any updated recommendations on which open-source models work better for some of these tasks?
@TrelisResearch
4 ай бұрын
Here are my reccs: - 7B Model - OpenChat 3.5 - Coding Model - DeepSeek Coder 33B - General, Large - Yi 34B or DeepSeek 67B OpenChat is quite remarkable. Performs incredibly well on function calling and extraction. Perhaps the only drawback is a short context length (although the latest model versions support larger contexts, albeit with a bit more hallucination).
@robboerman93784 ай бұрын
Thanks, very interesting on the different ways to approach it especially when comparing the cost. Need to look into RunPod. Have not looked at that yet since it looked very expensive, interesting to look you pay by just uptime or actual GPU usage. Other thing I am struggling with is going production with something like this. Sure you can tweak the parameters and models until you find the output you have already identified manually to be true but that kind of defeats the purpose
@TrelisResearch
4 ай бұрын
One imperfect approach is to run twice with different chunking and then combine and further use an LLM to post process. No guarantee of 100 perfection, same as humans.
@matbeedotcom4 ай бұрын
This may be outside of your target-market for videos, but showcasing utilizing these models and strategies in some systems like autogen / autogen studio would probably be a hit. I'm a little bit humored from YAML and JSON outputs having a different output, but it could make sense since JSON has so much other stuff like quotes, commas and a lot of white space that could muddle up the relation between words. YAML for documents like this don't have unnecessary punctuation, and dashes "-" lend themselves well to lists in regular documents. Fantastic content, and I appreciate the github repo's as well.
@TrelisResearch
4 ай бұрын
Cheers Mat for the detailed comments. I'll mark down autogen in my list of potential vids, I suspect you're right on it being in demand. I don't have a great framework for thinking about json vs yaml. My sense is that the models shown have significant experience with both, so actually the difference between the two is kind of just noise, i.e. I don't see all that much systematic in different. The failure mode most often is in the LLM either i) repeating terms (which will break any syntax) or ii) omitting data.
@Soniboy843 ай бұрын
thank you, much of useful information there. Would you have some resource where I could learn about how to engineer a good prompt?
@TrelisResearch
3 ай бұрын
Hmm, I don't have a specific resource to recommend, but if someone does, kindly comment below with it
@Soniboy843 ай бұрын
one more thing. Is it possible that the openchat response sometimes included duplicate items with different spelling because you had an 8000 chunk length set? So each parallel execution only got part of your text and wasn't aware of the previous mention of the company/name.
@TrelisResearch
3 ай бұрын
Yes, thanks, I think that's a big factor
@nicolasportu2 ай бұрын
Outstanding! How would you extract a Table of Contents from a PDF? Challenging task I suppose.... Thanks!
@TrelisResearch
2 ай бұрын
Well first you write a script to go from pdf to txt, then you ask the LLM to reconstruct the table from the txt!
@readmarketings90613 ай бұрын
Is it possible to give the next lesson to create the knowledge graph with neo4j?
@TrelisResearch
3 ай бұрын
just added to my list, probably before I get to graphs, I need to get back again to RAG first through and more advanced methods there
@TaiwoMegbope4 ай бұрын
Poof! My mid just went blown again. Thank you Ronan! I know you said you don't currently do consultancy gigs because of how much time youre putting into these models and other wotk, but no harm asking. Im doing a research on Auyonomous Cognitive Entities in Finance, creating autonomous entities in Portfolio creation, Portfolio Optimization, Investor Policy Statement generation, and Sentiment analysis. Im currently looking for a ML expert (you 😊) who'd be able to comtribute at the planning, and maybe analysis stage, just to give direction on how the agents should be trained or fine tuned. I will definitely be getting the repo access, whether you agree to come on board or not, but it would be the perfect world for me if you said yes to this. Its not much of an opportunity to make good money (the budget is really thin) but an opportunity to be part of a research that, if successful, could become a big hit in the investment community. So my ask, again, Ronan, is would you, pretty please, consider being a contributor to this research project? I promise to make it as light work, and minimal time contribution as possible, just strategic guidance on how best to approach your area of specialty. Thanks Ronan.
@TrelisResearch
4 ай бұрын
Thanks! Yeah still not doing full on consulting, but I have a corporate llm review product now that may be of some help. I'm not entirely sure, but you can take a look here: trelis.com/corporate-product-llm-review/
@TaiwoMegbope
4 ай бұрын
Thank you! I think that will work just fine too. Checked it out already.
@alexxx44344 ай бұрын
Batching is not entierly free. Because interacting with KV cache during generation is also a memory operation.
@TrelisResearch
4 ай бұрын
💯
@hubstrangers34504 ай бұрын
Thank you, however, never try anything that's expensive or even cost effective scenarios, in modern technological space...example cloud computing...
@TrelisResearch
4 ай бұрын
hmm, not exactly sure what you mean here
@hubstrangers3450
4 ай бұрын
@@TrelisResearch Avoid anything that's expensive in the tech space, if you're into greenfield projects...