Creating a Dataset
Introduction
Datasets are collections of original documents that you can use to train your agents to extract data.
CreateDataset is used to create a dataset. By default, this query waits for the dataset to finish processing before returning a Dataset response object (see the definition of the Dataset object below). If you do not want this behavior, set wait=False.
For more information about datasets and the different configurations, check out our user guides.
Dataset (Type)
The Dataset Object that's returned from queries such as CreateDataset, GetDataset, ListDatasets, etc.
Imported from: from indico.types.dataset import Dataset
id: intThe ID of the datasetname: strName of the datasetrow_count: intNumber of rows in the datasetstatus: strStatus of the datasetpermissions: strPermissions on the datasetfiles: List[Datafile]Names of the file(s) included in the datasetlabelsets: List[LabelSet]Label sets associated with this datasetdatacolumns: List[DataColumn]Data column(s) of the dataset
OcrEngine (type/enum)
The enum to use when defining ocr_engine in CreateDataset.
Imported from: from indico.types.dataset import OcrEngine
-
OMNIPAGE -
READAPI -
READAPI_V2 -
READAPI_TABLES_V1
OmnipageOcrOptionsInput (type)
Omnipage specific OCR options for dataset creation.
Imported from: from indico.types.dataset import OmnipageOcrOptionsInput
-
auto_rotate: boolAuto rotate -
single_column: boolRead table as a single column. -
upscale_images: boolScale up low-resolution images. -
languages: List[str]List of strings representing Omnipage Language Options. -
cells: boolReturn table information for post-processing rules -
force_render: boolForce rendering -
native_layout: boolNative layout -
native_pdf: boolNative pdf -
table_read_order: TableReadOrderRead table by row or column.
ReadApiOcrOptionsInput (type)
ReadAPI OCR engine options for dataset creation.
Imported from: from indico.types.dataset import ReadApiOcrOptionsInput
-
auto_rotate: boolAuto rotate -
single_column: boolRead table as a single column -
upscale_images: boolScale up low resolution images -
languages: List[str]List of strings representing ReadAPI Language Options.
CreateDataset
Inputs
Imported from: from indico.queries.datasets import CreateDataset
name: strName of the dataset (required)files: List[str]List of file paths (required)wait: bool = TrueWait for the dataset to finish processing. By defaultCreateDatasetwill wait for the
dataset to finish processing. (optional)dataset_type: str = "TEXT"Type of the dataset. See above for more information. (optional)from_local_images bool = False? (optional)image_filename_col: str = "filename"Name of the column in the CSV with the images (optional)batch_size: int = 20Number of files to submit at a time (optional)ocr_engine: OcrEngine = NoneWhich OCR engine to use (OcrEngine defined below) (optional)omnipage_ocr_options: OmnipageOcrOptionsInput = NoneOmnipage OCR engine options (optional)read_api_ocr_options: ReadApiOcrOptionsInput = NoneReadAPI OCR engine options (optional)
Outputs
Returns a Dataset object.
Try It Out
Try out the CreateDataset call.
from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset
my_config = IndicoConfig(
host="your-cluser.example.com", api_token_path="./indico_api_token.txt"
)
client = IndicoClient(config=my_config)
dataset_filepaths = [
"/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"
]
response: Dataset = client.call(
CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default
name="pdf-dataset",
files=dataset_filepaths,
dataset_type="DOCUMENT",
)
)Updated about 12 hours ago
