Mindee docTR - Probably the Best Open-Source OCR

Do you want to build ML pipeline to automate data extraction from business documents (receipts, invoices, forms)? Then your first step should be to integrate OCR for text extraction. OCR extraction quality must be good, the whole pipeline will depend on initial text data extraction quality. If extracted data will be accurate, this means ML models will be able to run proper classification. I spent time researching available solutions for OCR and I think Mindee docTR currently is one of the best open-source OCR solutions available. Check the video, where I run and show multiple tests.
Mindee docTR GitHub:
github.com/mindee/doctr
SRD Receipts dataset:
expressexpense.com/blog/free-...
Sparrow GitHub:
github.com/katanaml/sparrow/t...
0:00 Introduction
2:41 Mindee docTR
5:27 Test 1
7:43 Test 2
9:12 Test 3
11:58 Test 4
13:19 Test 5
14: 21 Summary
CONNECT:
- Subscribe to this KZread channel
- Twitter: / andrejusb
- LinkedIn: / andrej-baranovskij
- Medium: / andrejusb
#OCR #MachineLearning #Python

Пікірлер: 36

  • @NickWindham
    @NickWindham Жыл бұрын

    Great video. Thanks for bringing attention to this awesome open source OCR engine

  • @shaheerzaman620
    @shaheerzaman6202 жыл бұрын

    very useful

  • @cataori_program7292
    @cataori_program7292 Жыл бұрын

    Hello Andrej, and again truly grateful for allowing me to know about ML and these libraries. Just in case it is possible I would appreciate some tips, references, or precautions when installing doctr locally since despite following instructions my pc seems no to recognize all modules.

  • @nostalgia18rishi

    @nostalgia18rishi

    8 ай бұрын

    it happened to me too. I am having trouble installing doctr.

  • @cataori_program7292
    @cataori_program7292 Жыл бұрын

    Andrej, very grateful for your contents. I have been trying to install doctr and sparrow ,unsuccesfully though, to make my own research. I will be happy to know of any tip or configuration steps you could provide. Thanks again !

  • @AndrejBaranovskij

    @AndrejBaranovskij

    Жыл бұрын

    Hi. Code structure was refactored and moved to other dir, here is the source location: github.com/katanaml/sparrow/tree/main/sparrow-ocr Thanks for your feedback :)

  • @Mayo-ep2ps
    @Mayo-ep2ps Жыл бұрын

    Thank you for the video. I have been testing out mindees docTR. Something ive wondered about is how you get the confidence to show up on the document after processing? Im running this in Pycharm on windows, and cant seem to get that to show. Much appreciated!

  • @AndrejBaranovskij

    @AndrejBaranovskij

    Жыл бұрын

    Hey. I haven't tested it on Windows, but on MacOS it works to show the doc visually by calling predictions.show(doc). Where doc = DocumentFile.from_images(data_path + data_file) and predictions = self.model(doc)

  • @TheChaoticTranquil
    @TheChaoticTranquil Жыл бұрын

    Hi Andrej, thank you for the great video. What model/library would you recommend if I'm trying to label and extract bounding boxes of text regions. I'm open to train/fine-tune an existing LLM with labels and bounding boxes of text groups (e.g. pragraphs, heading etc.) but would like to know what's the best model to work with.

  • @AndrejBaranovskij

    @AndrejBaranovskij

    Жыл бұрын

    Hey Sina, thanks. Text boxes are constructed with OCR and for data mapping/labeling we built our own open source tool. Its part of Sparrow - github.com/katanaml/sparrow You can check my latest videos: kzread.info/dash/bejne/dKVt16ehpbW6ac4.html

  • @pandorasbox7541
    @pandorasbox7541 Жыл бұрын

    how can i use it with simple html and jquery?? please guide

  • @big-blade
    @big-blade2 жыл бұрын

    Hi Andrej, That was really helpful, do you know if we can extract some of the handwritten texts from documents?

  • @AndrejBaranovskij

    @AndrejBaranovskij

    2 жыл бұрын

    Based on my tests with docTR, yes - it extracts handwritten digets. I wasn't testing with handwritten letters, but I guess it should work too.

  • @big-blade

    @big-blade

    2 жыл бұрын

    @@AndrejBaranovskij thank you I will take that test

  • @muhammadali.8542

    @muhammadali.8542

    Жыл бұрын

    @@big-blade did it work on handwritten docs?

  • @shlokjivtode3769
    @shlokjivtode37695 ай бұрын

    Hello! .. I am finding issue with downloading docTR after istalling it shows NoModuleFind error along with suggesting me something like gtk to add .

  • @AndrejBaranovskij

    @AndrejBaranovskij

    5 ай бұрын

    Unfortunatelly I can't help, for me installation was working fine, by following install guide...

  • @taissacosta4998
    @taissacosta49985 ай бұрын

    hello, im getting some errors on my linux, can you recommend me some tutorials for the installation process?

  • @AndrejBaranovskij

    @AndrejBaranovskij

    5 ай бұрын

    Hey, I doubt I could help. I run it on macOS. Sorry...

  • @3ombieautopilot
    @3ombieautopilot Жыл бұрын

    Do you think it'll extract russian text as well?

  • @AndrejBaranovskij

    @AndrejBaranovskij

    Жыл бұрын

    I haven't tested it with Cyrillic characters, but you should give it a try. May be it will work.

  • @agerray
    @agerray Жыл бұрын

    I would be interested in trying this but (!) I'm afraid as a "simple Windows 11" user I haven't a clue how to install the software. Is there any very basic, simply guide that people like me might have a chance of understanding?

  • @AndrejBaranovskij

    @AndrejBaranovskij

    Жыл бұрын

    Hi. I'm running it on MacOS and Linux, no experience running on Windows.

  • @agerray

    @agerray

    Жыл бұрын

    @@AndrejBaranovskij OK, thanks anyway.

  • @sudhitpanchal4996
    @sudhitpanchal4996 Жыл бұрын

    Can we Get Relation Extraction here? like Paddle OCR gives, the item name and it's price so paddle ocr can give both together can that be implemented in Mindee doctOCR?

  • @AndrejBaranovskij

    @AndrejBaranovskij

    Жыл бұрын

    Hi. PaddleOCR doesnt provide key/value pair relation. There is an option to use LayoutXLM with PaddleOCR, but that is not flexible and moreover LayoutXLM is not allowed for commercial use. I would recommend to use Donut model, it returns key/value pairs JSON and no OCR is needed for inference. More about this solution in Sparrow - kzread.info/dash/bejne/YnaC27FwhrCuZLw.html

  • @sudhitpanchal4996

    @sudhitpanchal4996

    Жыл бұрын

    @@AndrejBaranovskij Yes i thought so but Donut is Heavy and This is for internal purposes no commercial

  • @AndrejBaranovskij

    @AndrejBaranovskij

    Жыл бұрын

    @@sudhitpanchal4996 Why you think Donut is "heavy"? Donut produces much better output than LayoutXLM. In LayoutXLM you only get questions/answers. While in Donut you can fine-tune it to return real key/value pairs.

  • @erikatorres1069

    @erikatorres1069

    Ай бұрын

    @@AndrejBaranovskij PaddleOCR offers Vi-LayoutXLM which returns key, value pairs.

  • @zesciarizo362
    @zesciarizo3622 жыл бұрын

    Recognition part not doing great on cropped images (when you give it slices of text zones) and confidence scores are not representative of the output text unlike Tesseract

  • @AndrejBaranovskij

    @AndrejBaranovskij

    2 жыл бұрын

    Could be, but based on my findings, overall Mindee docTR beats Tesseract, considering text extraction quality

  • @zesciarizo362

    @zesciarizo362

    2 жыл бұрын

    @@AndrejBaranovskij I tested it by giving it "word size" boxes (using another text detector), the results of the Recognition part are not coherent compared to Tesseract (sometimes it doesn't predict anything). Same thing for the detection part, by giving it cropped layouts Ex. tables, the performance is not good compared to other competitors it seems it was fine-tuned only on A4 type documents

  • @AndrejBaranovskij

    @AndrejBaranovskij

    2 жыл бұрын

    @@zesciarizo362 I was testing it with various size documents, e.g. receipts and was getting good results. by far better than Tesseract. Tesseract fails on documents with shadows, even with pre processing. If Mindee doesn't work for you, give it a try for PaddleOCR, it was working for me quite decent too.

  • @zesciarizo362

    @zesciarizo362

    2 жыл бұрын

    @@AndrejBaranovskij paddle indeed is far better than doctr for text detection (and faster), I'm using it with tesseract for recognition (it performs ok when given word level boxes)

  • @AndrejBaranovskij

    @AndrejBaranovskij

    2 жыл бұрын

    @@zesciarizo362 My tests with raw receipts images show opposite results, docTR producing more accurate results and outperforming PaddleOCR. I'm now focused on Django, later when I will return to OCR part, will review it again in more detail.