How can I extract the text of a document?

May 20, 2021 09:24
Updated

Athento has several automation tasks that allow you to extract text from a document. The two most commonly used are Extract Text and Extract OCR.

Screenshot_2021-02-26_at_09.39.06.png

When to use Extract text?

Activate this operation for PDF files that have been born digital. That is, documents that have not been scanned, but we have received the original digital version.

A very easy way to know if a document is digital is to hover the mouse over it. If we can select the text, it is a digital document.

Screenshot_2021-02-26_at_09.43.49.png

For digital documents, it is not necessary to apply OCR, because the text is contained within the file itself.

You can view the text extracted from a document from the document features.

This automation task stores the full text of the document in the feature feature.text that can be used for field extraction.

The advantage of working with the native text of the document is that the quality is very good.

When to use Extract OCR?

We use OCR when working with image files (TIF, JPG, PNG, etc.) or PDFs resulting from scanning and to which the scanner has not previously applied an OCR process.

OCR is a process that transforms an image into text, and the result of this transformation is not always accurate. It depends on many factors, including the quality of the documents.

In this article you can see how to extract OCR from a document.

When to use Extract text?

When to use Extract OCR?

Related articles