Extraction quality

The quality of extraction in OCR processes depends on the quality of the documents, their structure and other factors.

If you have control over the digitization process, the following good digitization practices are recommended.

If you have no control over the nature of the documents, we recommend at least testing with actual documents received to obtain more realistic expectations about the results.

In any case, Athento recommends always considering that OCR processes include manual or semi-automatic validation.

Please note that Athento does not guarantee extraction rates. These will have to be calculated based on the configuration that has been made and will depend on the nature of the documents.

Document processing times

OCR processing times depend on the size of the documents and number of pages, so if there are time requirements, e.g. real-time results are expected, it may be necessary to extend the architecture.

On average, the processing time is approximately 11 seconds per page. If a shorter time is desired for a large volume of documents, it is advisable to extend the architecture.

Can results be improved?

As more documents are incorporated and more casuistry and scenarios are detected, it is possible to define strategies to improve the results. For example, load databases with known values, compare against other documents, prioritize the most common values, etc.

Is it possible to train OCR?

Normally the OCR engine itself is not trained (it is possible but it is a costly process), but the layout analysis and some other document analysis processes can be trained. Typically 1000 to 2000 documents are needed to perform an analysis.

What impact on the improvement of my processes can I expect with the introduction of OCR?

It depends a lot on the business process, there are cases in which the improvement is 10% of the effort and there are cases in which the improvement is 90%. There are also cases where 100% automation is performed assuming an error threshold.