How to extract fields automatically? – Athento

Athento can extract data automatically. The configuration of this extraction must be done from the field administration.

Most of Athento's extraction mechanisms use text analysis strategies to get the information. Below we will see some of the options available from the user interface.

Before starting this tutorial, we recommend that you load several samples of the same type of document into the system and extract the OCR from them.

Indicate where the value should be taken from

To extract any field, use the Extract from the drop-down to indicate where the data should be extracted from. You can specify a specific page or the Full OCR option to search the entire document for the data. You can also extract the value from the file name with the Filename option.

Screenshot_2022-06-21_at_16.37.42.png

Extracting a value using other words to delimit the data

You can use this mechanism when the value you want to extract is always between two known words or expressions.

For example, suppose that the OCR extracted from your document shows the following information.

If you would like to extract the number after the expression CUIT:, you must delimit the location of the data by indicating two expressions: one preceding the data and one preceding it.

For example:

Extract starting from word (Extraer a partir)-> CUIT:
Extract finishing in (Extraer hasta) -> Apellido y Nombre

You can indicate several start or end expressions by separating them with the | (pipe) character.

Extract using a regular expression

To use this extraction mechanism, you must tell Athento a text pattern to find in the OCR of the document. This method works very well with data that has a defined pattern, such as an ID, a VAT number, a date, etc.

In the Regular Expression to Extract field in the field administration, indicate the pattern you want to search for, for example, if you are looking for a 7-digit number, you can enter an expression like the one below:

Regular Expression to Extract -> [0-9]{7}

To extract a date, you can use the expression shown in the screenshot below.

Test the extraction of a field

Once you have set up the extraction of a field, to test it, open a sample document that is sorted with the same Document Form in which the field is found.

Unlock the buttons below the fields and use the peephole option to extract the data.

Indicate where the value should be taken from

Extracting a value using other words to delimit the data

Extract using a regular expression

Test the extraction of a field

Related articles