DocumentExtraction

class indico.queries.documents.DocumentExtraction(files, json_config=None, upload_batch_size=None, ocr_engine='OMNIPAGE')

Extract raw text from PDF or TIF files.

DocumentExtraction performs Optical Character Recognition (OCR) on PDF or TIF files to
extract raw text for model training and prediction.

  • Parameters:
    • files= (List[str**]) – Pathnames of one or more files to OCR
    • json_config (dict or JSON str) – Configuration settings for OCR. See Notes below.
    • upload_batch_size (int) – Size of batches for document upload if uploading many documents
    • ocr_engine (str) – Denotes which ocr engine to use. Defaults to OMNIPAGE.
  • Returns:
    Job object

Notes

DocumentExtraction is extremely configurable. Four preset configurations are provided:

simple - Provides a simple and fast response for native PDFs (3-5x faster). Will NOT work with scanned PDFs.

legacy - Provided to mimic the behavior of Indico’s older pdf_extraction function. Use this if your model was trained with data from the older pdf_extraction.

detailed - Provides detailed bounding box information on tokens and characters. Returns data in a nested format at the document level with all metadata included.

ondocument - Provides detailed information at the page-level in an unnested format.

standard - Provides page text and block text/position in a nested format.

For more information, please reference the Indico knowledgebase article on OCR:
https://docs.indicodata.ai/articles/documentation-publication/ocr