Creating a Dataset

Create a dataset and upload the associated files.


CreateDataset is used to create a Dataset, which are original documents that your model learns to label using a variety of file types.

By default, this query waits for the dataset to finish processing before returning a Dataset response object (See below for the definition of the Dataset object) If you do not want this behavior, set wait=False.

For more information related to Datasets and the different configurations, check out our user guides.

Dataset (Type)


The Dataset Object that's returned from queries such as CreateDataset, GetDataset, ListDatasets, etc.

Imported from: from indico.types.dataset import Dataset

id: int The ID of the dataset

name: str Name of the dataset

row_count: int Number of rows in the dataset

status: str Status of the dataset

permissions: str Permissions on the dataset

files: List[Datafile] Names of the file(s) included in the dataset

labelsets: List[LabelSet] LabelSets associated with this dataset

datacolumns: List[DataColumn] DataColumn(s) of the dataset

OcrEngine (type/enum)


The enum to use when defining ocr_engine in CreateDataset.

Imported from: from indico.types.dataset import OcrEngine





OmnipageOcrOptionsInput (type)


Omnipage specific OCR options for dataset creation.

Imported from: from indico.types.dataset import OmnipageOcrOptionsInput

auto_rotate: bool Auto rotate

single_column: bool Read table as a single column.

upscale_images: bool Scale up low-resolution images.

languages: List[str] List of strings representing Omnipage Language Options.

cells: bool Return table information for post-processing rules

force_render: bool Force rendering

native_layout: bool Native layout

native_pdf: bool Native pdf

table_read_order: TableReadOrder Read table by row or column.

ReadApiOcrOptionsInput (type)


ReadAPI OCR engine options for dataset creation.

Imported from: from indico.types.dataset import ReadApiOcrOptionsInput

auto_rotate: bool Auto rotate

single_column: bool Read table as a single column

upscale_images: bool Scale up low resolution images

languages: List[str] List of strings representing ReadAPI Language Options.




CreateDataset (query) Inputs

Imported from: from indico.queries.datasets import CreateDataset


name: str Name of the dataset (required)

files: List[str] List of file paths (required)

wait: bool = True Wait for the dataset to finish processing. By default CreateDataset will wait for the
dataset to finish processing. (optional)

dataset_type: str = "TEXT",Type of the dataset. See above for more information. (optional)

from_local_images bool = False, ? (optional)

image_filename_col: str = "filename", Name of the column in the CSV with the images (optional)

batch_size: int = 20, Number of files to submit at a time (optional)

ocr_engine: OcrEngine = None, Which OCR engine to use (OcrEngine defined below) (optional)

omnipage_ocr_options: OmnipageOcrOptionsInput = None, Omnipage OCR engine options (optional)

read_api_ocr_options: ReadApiOcrOptionsInput = None, ReadAPI OCR engine options (optional)



CreateDataset (query) Outputs

returns Dataset Object

Try It Out

Try out the CreateDataset call:

from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset

my_config = IndicoConfig(
    host="", api_token_path="./indico_api_token.txt"
client = IndicoClient(config=my_config)

dataset_filepaths = [
  "/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"

response: Dataset =
  CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default