DocumentExtraction
class indico.queries.documents.DocumentExtraction(files, json_config=None, upload_batch_size=None, ocr_engine='OMNIPAGE')
Extract raw text from PDF or TIF files.
DocumentExtraction performs Optical Character Recognition (OCR) on PDF or TIF files to
extract raw text for model training and prediction.
- Parameters:
- files= (List[str**]) – Pathnames of one or more files to OCR
- json_config (dict or JSON str) – Configuration settings for OCR. See Notes below.
- upload_batch_size (int) – Size of batches for document upload if uploading many documents
- ocr_engine (str) – Denotes which ocr engine to use. Defaults to OMNIPAGE.
- Returns:
Job object
Notes
DocumentExtraction is extremely configurable. Four preset configurations are provided:
simple - Provides a simple and fast response for native PDFs (3-5x faster). Will NOT work with scanned PDFs.
legacy - Provided to mimic the behavior of Indico’s older pdf_extraction function. Use this if your model was trained with data from the older pdf_extraction.
detailed - Provides detailed bounding box information on tokens and characters. Returns data in a nested format at the document level with all metadata included.
ondocument - Provides detailed information at the page-level in an unnested format.
standard - Provides page text and block text/position in a nested format.
For more information, please reference the Indico knowledgebase article on OCR:
https://docs.indicodata.ai/articles/documentation-publication/ocr
Updated about 1 year ago