Interpreting Your OCR and ETL Output
Introduction
The ETL (Extract, Transform, Load) output is an organized summary of your raw data. It provides a high-level overview of the document contents, with links to detailed OCR (Optical Character Recognition) outputs for further insights. You can retrieve the ETL output directly from your JSON results.
These detailed OCR outputs meticulously document each character in your document, noting their exact locations on the page.
ETL Output
In your ETL output, you may encounter the following key elements:
- doc_start_offset / doc_end_offset: Indicate the index of the first and last character in the document.
- image: A link to an Indico storage URL that contains a PNG image of the page.
- page_info: Contains OCR data for each page of the document. Detailed contents can be found in the OCR Output section of this document.
- thumbnail: Links to a JPG thumbnail of the page.
- ocr_statistics: Provides OCR confidence statistics, which may help in determining whether documents or fields are processed automatically.
ETL Output Structure in Version 3
Version 3 introduces a new structure for ETL output. OCR data is now segmented into subfiles, allowing for more efficient data retrieval. This structure provides direct access to specific details, such as full text or character-level information, based on your needs. For more information, refer to the OCR Output section.
ETL Output Examples
Below are examples of ETL output for different versions, illustrating how document data is organized.
- Version 1: Shows basic page information, including document offsets and image links.
- Version 3: Similar structure to Version 1, but with minor differences in page information.
{
"num_pages": 2,
"pages": [
{
"doc_end_offset": 1812,
"doc_start_offset": 0,
"image": "indico-file:///storage/submission/13##/9/original_00001_page_1.png",
"page_info": "indico-file:///storage/submission/13##/9/page_info_1.json",
"page_num": 0,
"thumbnail": "indico-file:///storage/submission/13##/9/original_00001_thumbnail_1.jpg"
},
{
"doc_end_offset": 3256,
"doc_start_offset": 1813,
"image": "indico-file:///storage/submission/13##/9/original_00002_page_2.png",
"page_info": "indico-file:///storage/submission/13##/9/page_info_1.json",
"page_num": 1,
"thumbnail": "indico-file:///storage/submission/13##/9/original_00002_thumbnail_2.jpg"
}
]
}{
"pages": [
{
"size": {
"width": 2550,
"height": 3300
},
"image": "indico-file:///storage/submission/28020/79064/184226/original_page_0.png",
"thumbnail": "indico-file:///storage/submission/28020/79064/184226/original_thumbnail_0.png",
"dpi": {
"dpix": 300,
"dpiy": 300
},
"transformation": {
"skew": 0.0,
"rotation": 0,
"crop": {
"top": 0,
"left": 0,
"right": 2550,
"bottom": 3300
},
"method": "OCRFormatter"
},
"filename": "Charity Commission.pdf",
"doc_offset": {
"start": 0,
"end": 725
},
"page_num": 0,
"text": "indico-file:///storage/submission/28020/79064/184226/page_0_text.txt",
"ocr_format": "default",
"characters": "indico-file:///storage/submission/28020/79064/184226/page_0_chars.json",
"tokens": "indico-file:///storage/submission/28020/79064/184226/page_0_tokens.json",
"blocks": "indico-file:///storage/submission/28020/79064/184226/page_0_blocks.json",
"page_info": "indico-file:///storage/submission/28020/79064/184226/page_info_0.json"
},
...
],
"num_pages": 3,
"full_text": "indico-file:///storage/submission/28020/79064/184226/full_text.txt",
"email_metadata": {}
}OCR Output
- dpi describes dots per linear inch.
- size describes the height and width in pixels.
- block_type describes whether the block contains images or text.
page_info and Other File Examples
page_info and Other File ExamplesThe examples provided here are greatly simplified versions of this type of document. The page_info section includes data on the position of each character on a page, resulting in the creation of very large documents.
- Version 1: The output is a single file with information about each page’s text, block positions, and OCR statistics.
- Version 3: The newest version introduces multiple links for text, tokens, and blocks, as well as confidence scores.
OCR Output Version 3
In versions 1, the OCR data was bundled into a single
page_infofile containing all details for each page. With Version 3, the OCR output has been reorganized into separate subfiles, making it easier to retrieve only the necessary details. For example:
- Use the text link for simple text extraction.
- Access the characters link for precise character positions.
- The
full_textlink provides the complete document text in one file.This segmentation makes data retrieval more efficient, allowing you to avoid downloading unnecessary information.
Standard Output
Below are examples of the OCR output divided into smaller sections, demonstrating how the document data is structured.
page_info: This consolidated file contains all data from tokens, blocks, and characters. It is similar to what your complete ETL output will look like upon retrieval.- For convenience and to provide flexibility in viewing specific data types, tokens, blocks, and characters have been separated into their own files. This separation allows you to access only the relevant information, especially since the full ETL output can be quite large.
{
"pages": [
{
"doc_offset": {
"start": 0,
"end": 1024
},
"text": "<original_text_would_be_here>",
"dpi": {
"dpix": 300,
"dpiy": 300
},
"page_num": 0,
"size": {
"height": 3300,
"width": 2550
},
"ocr_statistics": {
"mean_confidence": 99.94318181818181,
"median_confidence": 100.0,
"mode_confidence": 100
}
}
],
"blocks": [
{
"block_type": "text",
"doc_offset": {
"end": 7,
"start": 0
},
"page_num": 0,
"page_offset": {
"end": 7,
"start": 0
},
"position": {
"bottom": 157,
"left": 168,
"right": 320,
"top": 128
},
"text": "INVOICE"
},
...
],
"chars": [
{
"doc_index": 0,
"confidence": 100,
"text": "I",
"position": {
"top": 406,
"bottom": 440,
"left": 272,
"right": 280,
"bbBot": 442,
"bbTop": 405,
"bbLeft": 272,
"bbRight": 285
},
"page_num": 0,
"block_index": 0,
"page_index": 0
},
...
],
"tokens": [
{
"block_offset": {
"end": 7,
"start": 0
},
"doc_offset": {
"end": 7,
"start": 0
},
"page_num": 0,
"page_offset": {
"end": 7,
"start": 0
},
"position": {
"bbBot": 157,
"bbLeft": 168,
"bbRight": 320,
"bbTop": 128,
"bottom": 157,
"left": 168,
"right": 320,
"top": 128
},
"style": {
"background_color": "ffffff",
"bold": true,
"font_face": "DejaVu Sans",
"font_size": 187,
"italic": false,
"text_color": "000000",
"underlined": false
},
"text": "INVOICE"
},
...
]
}OCR Output for Tables
In addition to standard text and image data, the OCR output in Version 3 can include table structures with additional headers:
- cells: Provides the position of each table cell.
- columns: Describes the column information for the table.
[
{
"page_offset": {
"start": 0,
"end": 3
},
"doc_offset": {
"start": 0,
"end": 3
},
"block_offset": {
"start": 0,
"end": 3
},
"page_num": 0,
"text": "To:",
"position": {
"top": 130,
"bottom": 167,
"left": 242,
"right": 314,
"bbTop": 130,
"bbBot": 167,
"bbLeft": 242,
"bbRight": 314
},
"style": {
"bold": null,
"italic": null,
"underlined": null,
"font_face": null,
"background_color": null,
"handwriting": false,
"font_size": null,
"text_color": null
}
},
...
][
{
"doc_index": 0,
"page_index": 0,
"block_index": 0,
"page_num": 0,
"text": "T",
"position": {
"top": 130,
"bottom": 167,
"left": 242,
"right": 279,
"bbTop": 130,
"bbBot": 167,
"bbLeft": 242,
"bbRight": 279
},
"confidence": 100
},
...
]Updated about 18 hours ago
