Creating a Dataset
Create a dataset and upload the associated files.
Introduction
CreateDataset
is used to create a Dataset, which are original documents that your model learns to label using a variety of file types.
By default, this query waits for the dataset to finish processing before returning a Dataset
response object (See below for the definition of the Dataset
object) If you do not want this behavior, set wait=False
.
For more information related to Datasets and the different configurations, check out our user guides.
Dataset (Type)
Dataset (type)
The Dataset Object that's returned from queries such as
CreateDataset
,GetDataset
,ListDatasets
, etc.Imported from:
from indico.types.dataset import Dataset
id: int
The ID of the dataset
name: str
Name of the dataset
row_count: int
Number of rows in the dataset
status: str
Status of the dataset
permissions: str
Permissions on the dataset
files: List[Datafile]
Names of the file(s) included in the dataset
labelsets: List[LabelSet]
LabelSets associated with this dataset
datacolumns: List[DataColumn]
DataColumn(s) of the dataset
OcrEngine (type/enum)
OcrEngine (type/enum)
The enum to use when defining
ocr_engine
inCreateDataset
.Imported from:
from indico.types.dataset import OcrEngine
OMNIPAGE
READAPI
READAPI_V2
READAPI_TABLES_V1
OmnipageOcrOptionsInput (type)
OmnipageOcrOptionsInput (type)
Omnipage specific OCR options for dataset creation.
Imported from:
from indico.types.dataset import OmnipageOcrOptionsInput
auto_rotate: bool
Auto rotate
single_column: bool
Read table as a single column.
upscale_images: bool
Scale up low-resolution images.
languages: List[str]
List of strings representing Omnipage Language Options.
cells: bool
Return table information for post-processing rules
force_render: bool
Force rendering
native_layout: bool
Native layout
native_pdf: bool
Native pdf
table_read_order: TableReadOrder
Read table by row or column.
ReadApiOcrOptionsInput (type)
ReadApiOcrOptionsInput (type)
ReadAPI OCR engine options for dataset creation.
Imported from:
from indico.types.dataset import ReadApiOcrOptionsInput
auto_rotate: bool
Auto rotate
single_column: bool
Read table as a single column
upscale_images: bool
Scale up low resolution images
languages: List[str]
List of strings representing ReadAPI Language Options.
CreateDataset
Inputs
CreateDataset (query) Inputs
Imported from:
from indico.queries.datasets import CreateDataset
Inputs
name: str
Name of the dataset (required)
files: List[str]
List of file paths (required)
wait: bool = True
Wait for the dataset to finish processing. By defaultCreateDataset
will wait for the
dataset to finish processing. (optional)
dataset_type: str = "TEXT",
Type of the dataset. See above for more information. (optional)
from_local_images bool = False,
? (optional)
image_filename_col: str = "filename",
Name of the column in the CSV with the images (optional)
batch_size: int = 20,
Number of files to submit at a time (optional)
ocr_engine: OcrEngine = None,
Which OCR engine to use (OcrEngine defined below) (optional)
omnipage_ocr_options: OmnipageOcrOptionsInput = None,
Omnipage OCR engine options (optional)
read_api_ocr_options: ReadApiOcrOptionsInput = None,
ReadAPI OCR engine options (optional)
Outputs
CreateDataset (query) Outputs
returns
Dataset
Object
Try It Out
Try out the CreateDataset
call:
from indico import IndicoClient, IndicoConfig
from indico.queries.datasets import CreateDataset
from indico.types.dataset import Dataset
my_config = IndicoConfig(
host="your-cluser.example.com", api_token_path="./indico_api_token.txt"
)
client = IndicoClient(config=my_config)
dataset_filepaths = [
"/path/to/file/file1.pdf", "/path/to/file/file2.pdf", "/path/to/file/file3.pdf"
]
response: Dataset = client.call(
CreateDataset( # CreateDataset waits for the dataset to finish processing the files by default
name="pdf-dataset",
files=dataset_filepaths,
dataset_type="DOCUMENT",
)
)
Updated 11 months ago