Introduction

CreateDataset is used to create a Dataset, which are original documents that your model learns to label using a variety of file types.

By default, this query waits for the dataset to finish processing before returning a Dataset response object (See below for the definition of the Dataset object) If you do not want this behavior, set wait=False.

For more information related to Datasets and the different configurations, check out our user guides.

Dataset (Type)

📘
Dataset (type)
The Dataset Object that's returned from queries such as CreateDataset, GetDataset, ListDatasets, etc.
Imported from: from indico.types.dataset import Dataset
id: int The ID of the dataset
name: str Name of the dataset
row_count: int Number of rows in the dataset
status: str Status of the dataset
permissions: str Permissions on the dataset
files: List[Datafile] Names of the file(s) included in the dataset
labelsets: List[LabelSet] LabelSets associated with this dataset
datacolumns: List[DataColumn] DataColumn(s) of the dataset

OcrEngine (type/enum)

📘
OcrEngine (type/enum)
The enum to use when defining ocr_engine in CreateDataset.
Imported from: from indico.types.dataset import OcrEngine
OMNIPAGE
READAPI
READAPI_V2
READAPI_TABLES_V1

OmnipageOcrOptionsInput (type)

📘
OmnipageOcrOptionsInput (type)
Omnipage specific OCR options for dataset creation.
Imported from: from indico.types.dataset import OmnipageOcrOptionsInput
auto_rotate: bool Auto rotate
single_column: bool Read table as a single column.
upscale_images: bool Scale up low-resolution images.
languages: List[str] List of strings representing Omnipage Language Options.
cells: bool Return table information for post-processing rules
force_render: bool Force rendering
native_layout: bool Native layout
native_pdf: bool Native pdf
table_read_order: TableReadOrder Read table by row or column.

ReadApiOcrOptionsInput (type)

📘
ReadApiOcrOptionsInput (type)
ReadAPI OCR engine options for dataset creation.
Imported from: from indico.types.dataset import ReadApiOcrOptionsInput
auto_rotate: bool Auto rotate
single_column: bool Read table as a single column
upscale_images: bool Scale up low resolution images
languages: List[str] List of strings representing ReadAPI Language Options.

CreateDataset

Inputs

📥
CreateDataset (query) Inputs
Imported from: from indico.queries.datasets import CreateDataset
Inputs
name: str Name of the dataset (required)
files: List[str] List of file paths (required)
wait: bool = True Wait for the dataset to finish processing. By default CreateDataset will wait for the
dataset to finish processing. (optional)
dataset_type: str = "TEXT",Type of the dataset. See above for more information. (optional)
from_local_images bool = False, ? (optional)
image_filename_col: str = "filename", Name of the column in the CSV with the images (optional)
batch_size: int = 20, Number of files to submit at a time (optional)
ocr_engine: OcrEngine = None, Which OCR engine to use (OcrEngine defined below) (optional)
omnipage_ocr_options: OmnipageOcrOptionsInput = None, Omnipage OCR engine options (optional)
read_api_ocr_options: ReadApiOcrOptionsInput = None, ReadAPI OCR engine options (optional)

Outputs

📤
CreateDataset (query) Outputs
returns Dataset Object

Try It Out

Try out the CreateDataset call:

from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset

my_config = IndicoConfig(
    host="your-cluser.example.com", api_token_path="./indico_api_token.txt"
)
client = IndicoClient(config=my_config)

dataset_filepaths = [
  "/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"
]

response: Dataset = client.call(
  CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default
    name="pdf-dataset",
    files=dataset_filepaths,
    dataset_type="DOCUMENT",
  )
)