Creating a Dataset

Introduction

Datasets are collections of original documents that you can use to train your agents to extract data.

CreateDataset is used to create a dataset. By default, this query waits for the dataset to finish processing before returning a Dataset response object (see the definition of the Dataset object below). If you do not want this behavior, set wait=False.

📘

For more information about datasets and the different configurations, check out our user guides.

Dataset (Type)

The Dataset Object that's returned from queries such as CreateDataset, GetDataset, ListDatasets, etc.

Imported from: from indico.types.dataset import Dataset

  • id: int The ID of the dataset
  • name: str Name of the dataset
  • row_count: int Number of rows in the dataset
  • status: str Status of the dataset
  • permissions: str Permissions on the dataset
  • files: List[Datafile] Names of the file(s) included in the dataset
  • labelsets: List[LabelSet] Label sets associated with this dataset
  • datacolumns: List[DataColumn] Data column(s) of the dataset

OcrEngine (type/enum)

The enum to use when defining ocr_engine in CreateDataset.

Imported from: from indico.types.dataset import OcrEngine

  • OMNIPAGE

  • READAPI

  • READAPI_V2

  • READAPI_TABLES_V1

OmnipageOcrOptionsInput (type)

Omnipage specific OCR options for dataset creation.

Imported from: from indico.types.dataset import OmnipageOcrOptionsInput

  • auto_rotate: bool Auto rotate

  • single_column: bool Read table as a single column.

  • upscale_images: bool Scale up low-resolution images.

  • languages: List[str] List of strings representing Omnipage Language Options.

  • cells: bool Return table information for post-processing rules

  • force_render: bool Force rendering

  • native_layout: bool Native layout

  • native_pdf: bool Native pdf

  • table_read_order: TableReadOrder Read table by row or column.

ReadApiOcrOptionsInput (type)

ReadAPI OCR engine options for dataset creation.

Imported from: from indico.types.dataset import ReadApiOcrOptionsInput

  • auto_rotate: bool Auto rotate

  • single_column: bool Read table as a single column

  • upscale_images: bool Scale up low resolution images

  • languages: List[str] List of strings representing ReadAPI Language Options.

CreateDataset

Inputs

Imported from: from indico.queries.datasets import CreateDataset

  • name: str Name of the dataset (required)
  • files: List[str] List of file paths (required)
  • wait: bool = True Wait for the dataset to finish processing. By default CreateDataset will wait for the
    dataset to finish processing. (optional)
  • dataset_type: str = "TEXT" Type of the dataset. See above for more information. (optional)
  • from_local_images bool = False ? (optional)
  • image_filename_col: str = "filename" Name of the column in the CSV with the images (optional)
  • batch_size: int = 20 Number of files to submit at a time (optional)
  • ocr_engine: OcrEngine = None Which OCR engine to use (OcrEngine defined below) (optional)
  • omnipage_ocr_options: OmnipageOcrOptionsInput = None Omnipage OCR engine options (optional)
  • read_api_ocr_options: ReadApiOcrOptionsInput = None ReadAPI OCR engine options (optional)

Outputs

Returns a Dataset object.

Try It Out

Try out the CreateDataset call.

from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset

my_config = IndicoConfig(
    host="your-cluser.example.com", api_token_path="./indico_api_token.txt"
)
client = IndicoClient(config=my_config)

dataset_filepaths = [
  "/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"
]

response: Dataset = client.call(
  CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default
    name="pdf-dataset",
    files=dataset_filepaths,
    dataset_type="DOCUMENT",
  )
)