Creating a Dataset
Create a dataset and upload the associated files.
Introduction
CreateDataset
is used to create a Dataset, which are original documents that your model learns to label using a variety of file types.
By default, this query waits for the dataset to finish processing before returning a Dataset
response object (See below for the definition of the Dataset
object) If you do not want this behavior, set wait=False
.
For more information related to Datasets and the different configurations, check out our user guides.
Dataset (Type)
Dataset (type)The Dataset Object that's returned from queries such as
CreateDataset
,GetDataset
,ListDatasets
, etc.Imported from:
from indico.types.dataset import Dataset
id: int
The ID of the dataset
name: str
Name of the dataset
row_count: int
Number of rows in the dataset
status: str
Status of the dataset
permissions: str
Permissions on the dataset
files: List[Datafile]
Names of the file(s) included in the dataset
labelsets: List[LabelSet]
LabelSets associated with this dataset
datacolumns: List[DataColumn]
DataColumn(s) of the dataset
OcrEngine (type/enum)
OcrEngine (type/enum)The enum to use when defining
ocr_engine
inCreateDataset
.Imported from:
from indico.types.dataset import OcrEngine
OMNIPAGE
READAPI
READAPI_V2
READAPI_TABLES_V1
OmnipageOcrOptionsInput (type)
OmnipageOcrOptionsInput (type)Omnipage specific OCR options for dataset creation.
Imported from:
from indico.types.dataset import OmnipageOcrOptionsInput
auto_rotate: bool
Auto rotate
single_column: bool
Read table as a single column.
upscale_images: bool
Scale up low-resolution images.
languages: List[str]
List of strings representing Omnipage Language Options.
cells: bool
Return table information for post-processing rules
force_render: bool
Force rendering
native_layout: bool
Native layout
native_pdf: bool
Native pdf
table_read_order: TableReadOrder
Read table by row or column.
ReadApiOcrOptionsInput (type)
ReadApiOcrOptionsInput (type)ReadAPI OCR engine options for dataset creation.
Imported from:
from indico.types.dataset import ReadApiOcrOptionsInput
auto_rotate: bool
Auto rotate
single_column: bool
Read table as a single column
upscale_images: bool
Scale up low resolution images
languages: List[str]
List of strings representing ReadAPI Language Options.
CreateDataset
Inputs
CreateDataset (query) InputsImported from:
from indico.queries.datasets import CreateDataset
Inputs
name: str
Name of the dataset (required)
files: List[str]
List of file paths (required)
wait: bool = True
Wait for the dataset to finish processing. By defaultCreateDataset
will wait for the
dataset to finish processing. (optional)
dataset_type: str = "TEXT",
Type of the dataset. See above for more information. (optional)
from_local_images bool = False,
? (optional)
image_filename_col: str = "filename",
Name of the column in the CSV with the images (optional)
batch_size: int = 20,
Number of files to submit at a time (optional)
ocr_engine: OcrEngine = None,
Which OCR engine to use (OcrEngine defined below) (optional)
omnipage_ocr_options: OmnipageOcrOptionsInput = None,
Omnipage OCR engine options (optional)
read_api_ocr_options: ReadApiOcrOptionsInput = None,
ReadAPI OCR engine options (optional)
Outputs
CreateDataset (query) Outputsreturns
Dataset
Object
Try It Out
Try out the CreateDataset
call:
from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset
my_config = IndicoConfig(
host="your-cluser.example.com", api_token_path="./indico_api_token.txt"
)
client = IndicoClient(config=my_config)
dataset_filepaths = [
"/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"
]
response: Dataset = client.call(
CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default
name="pdf-dataset",
files=dataset_filepaths,
dataset_type="DOCUMENT",
)
)
Updated about 2 months ago