Creating a Dataset

Create a dataset and upload the associated files.

Introduction

CreateDataset is used to create a Dataset, which are original documents that your model learns to label using a variety of file types.

By default, this query waits for the dataset to finish processing before returning a Dataset response object (See below for the definition of the Dataset object) If you do not want this behavior, set wait=False.

For more information related to Datasets and the different configurations, check out our user guides.

Dataset (Type)

πŸ“˜

Dataset (type)

The Dataset Object that's returned from queries such as CreateDataset, GetDataset, ListDatasets, etc.

Imported from: from indico.types.dataset import Dataset


id: int The ID of the dataset

name: str Name of the dataset

row_count: int Number of rows in the dataset

status: str Status of the dataset

permissions: str Permissions on the dataset

files: List[Datafile] Names of the file(s) included in the dataset

labelsets: List[LabelSet] LabelSets associated with this dataset

datacolumns: List[DataColumn] DataColumn(s) of the dataset

OcrEngine (type/enum)

πŸ“˜

OcrEngine (type/enum)

The enum to use when defining ocr_engine in CreateDataset.

Imported from: from indico.types.dataset import OcrEngine


OMNIPAGE

READAPI

READAPI_V2

READAPI_TABLES_V1

OmnipageOcrOptionsInput (type)

πŸ“˜

OmnipageOcrOptionsInput (type)

Omnipage specific OCR options for dataset creation.

Imported from: from indico.types.dataset import OmnipageOcrOptionsInput


auto_rotate: bool Auto rotate

single_column: bool Read table as a single column.

upscale_images: bool Scale up low-resolution images.

languages: List[str] List of strings representing Omnipage Language Options.

cells: bool Return table information for post-processing rules

force_render: bool Force rendering

native_layout: bool Native layout

native_pdf: bool Native pdf

table_read_order: TableReadOrder Read table by row or column.

ReadApiOcrOptionsInput (type)

πŸ“˜

ReadApiOcrOptionsInput (type)

ReadAPI OCR engine options for dataset creation.

Imported from: from indico.types.dataset import ReadApiOcrOptionsInput


auto_rotate: bool Auto rotate

single_column: bool Read table as a single column

upscale_images: bool Scale up low resolution images

languages: List[str] List of strings representing ReadAPI Language Options.

CreateDataset

Inputs

πŸ“₯

CreateDataset (query) Inputs

Imported from: from indico.queries.datasets import CreateDataset


Inputs

name: str Name of the dataset (required)

files: List[str] List of file paths (required)

wait: bool = True Wait for the dataset to finish processing. By default CreateDataset will wait for the
dataset to finish processing. (optional)

dataset_type: str = "TEXT",Type of the dataset. See above for more information. (optional)

from_local_images bool = False, ? (optional)

image_filename_col: str = "filename", Name of the column in the CSV with the images (optional)

batch_size: int = 20, Number of files to submit at a time (optional)

ocr_engine: OcrEngine = None, Which OCR engine to use (OcrEngine defined below) (optional)

omnipage_ocr_options: OmnipageOcrOptionsInput = None, Omnipage OCR engine options (optional)

read_api_ocr_options: ReadApiOcrOptionsInput = None, ReadAPI OCR engine options (optional)

Outputs

πŸ“€

CreateDataset (query) Outputs

returns Dataset Object

Try It Out

Try out the CreateDataset call:

from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset

my_config = IndicoConfig(
    host="your-cluser.example.com", api_token_path="./indico_api_token.txt"
)
client = IndicoClient(config=my_config)

dataset_filepaths = [
  "/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"
]

response: Dataset = client.call(
  CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default
    name="pdf-dataset",
    files=dataset_filepaths,
    dataset_type="DOCUMENT",
  )
)