Creating a Dataset
Create a dataset and upload the associated files.
Introduction
CreateDataset is used to create a Dataset, which are original documents that your model learns to label using a variety of file types.
By default, this query waits for the dataset to finish processing before returning a Dataset response object (See below for the definition of the Dataset object) If you do not want this behavior, set wait=False.
For more information related to Datasets and the different configurations, check out our user guides.
Dataset (Type)
Dataset (type)The Dataset Object that's returned from queries such as
CreateDataset,GetDataset,ListDatasets, etc.Imported from:
from indico.types.dataset import Dataset
id: intThe ID of the dataset
name: strName of the dataset
row_count: intNumber of rows in the dataset
status: strStatus of the dataset
permissions: strPermissions on the dataset
files: List[Datafile]Names of the file(s) included in the dataset
labelsets: List[LabelSet]LabelSets associated with this dataset
datacolumns: List[DataColumn]DataColumn(s) of the dataset
OcrEngine (type/enum)
OcrEngine (type/enum)The enum to use when defining
ocr_engineinCreateDataset.Imported from:
from indico.types.dataset import OcrEngine
OMNIPAGE
READAPI
READAPI_V2
READAPI_TABLES_V1
OmnipageOcrOptionsInput (type)
OmnipageOcrOptionsInput (type)Omnipage specific OCR options for dataset creation.
Imported from:
from indico.types.dataset import OmnipageOcrOptionsInput
auto_rotate: boolAuto rotate
single_column: boolRead table as a single column.
upscale_images: boolScale up low-resolution images.
languages: List[str]List of strings representing Omnipage Language Options.
cells: boolReturn table information for post-processing rules
force_render: boolForce rendering
native_layout: boolNative layout
native_pdf: boolNative pdf
table_read_order: TableReadOrderRead table by row or column.
ReadApiOcrOptionsInput (type)
ReadApiOcrOptionsInput (type)ReadAPI OCR engine options for dataset creation.
Imported from:
from indico.types.dataset import ReadApiOcrOptionsInput
auto_rotate: boolAuto rotate
single_column: boolRead table as a single column
upscale_images: boolScale up low resolution images
languages: List[str]List of strings representing ReadAPI Language Options.
CreateDataset
Inputs
CreateDataset (query) InputsImported from:
from indico.queries.datasets import CreateDataset
Inputs
name: strName of the dataset (required)
files: List[str]List of file paths (required)
wait: bool = TrueWait for the dataset to finish processing. By defaultCreateDatasetwill wait for the
dataset to finish processing. (optional)
dataset_type: str = "TEXT",Type of the dataset. See above for more information. (optional)
from_local_images bool = False,? (optional)
image_filename_col: str = "filename",Name of the column in the CSV with the images (optional)
batch_size: int = 20,Number of files to submit at a time (optional)
ocr_engine: OcrEngine = None,Which OCR engine to use (OcrEngine defined below) (optional)
omnipage_ocr_options: OmnipageOcrOptionsInput = None,Omnipage OCR engine options (optional)
read_api_ocr_options: ReadApiOcrOptionsInput = None,ReadAPI OCR engine options (optional)
Outputs
CreateDataset (query) Outputsreturns
DatasetObject
Try It Out
Try out the CreateDataset call:
from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset
my_config = IndicoConfig(
host="your-cluser.example.com", api_token_path="./indico_api_token.txt"
)
client = IndicoClient(config=my_config)
dataset_filepaths = [
"/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"
]
response: Dataset = client.call(
CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default
name="pdf-dataset",
files=dataset_filepaths,
dataset_type="DOCUMENT",
)
)Updated 6 months ago
