Advanced Model Training Options

Introduction

Advanced users may want to change some of the model training settings to fine-tune their models' performance in certain areas. Whether it be improving subjective classification or prediction speed, your model training settings may be the answer to achieving your desired results. The available settings differ based on the model type. Below, you will find a comprehensive list of configurable settings organized by model type.

🚧

For most use cases, Indico out-of-the-box models will meet all of your needs.

Modifying advanced model options will not be necessary for the majority of users. Proceed with caution.


Text Extraction Model Settings

Options

NameDefaultMeaning
max_empty_chunk_ratioDefault: 1.0Controls how much negative data is used during training. This defaults to float 1.0, meaning 1 negative chunk per each positive chunk.
auto_negative_samplingDefault: TrueA model with auto_negative_sampling enabled trains in three phases:

1. It trains on the chunks of the document that have labels.
2. It predicts overall training data.
3. It retrains on both chunks with labels and chunks that produce any predictions ("negative" examples).

When auto-negative_sampling is turned off it, it only uses the labeled chunks and configurable negative chunks from the areas around the labeled chunks (the "max_empty_chunk_ratio" setting, which defaults to 1.0).
optimize_forDefault: "predict_speed"Primarily controls the maximum number of words the model takes into account when making a prediction.
By default, models can "see" roughly 64 words before and after a given word. Setting optimize_for to "accuracy" increases this to roughly 256 words. Using the "speed" setting reduces the number of times the model sees each example in the dataset and keeps the context size the same.

Valid Options: "predict_speed", "accuracy", "speed", "accuracy_fp16" and "predict_speed_fp16"
Note: "accuracy_fp16" and "predict_speed_fp16" use mixed-precision to give training and prediction speed improvements that may come with slight model performance accuracy/speed degradation. These options are advanced and experimental.
subtoken_predictionsDefault: TrueAllow Subword Predictions - This setting controls whether model predictions can contain partial words or must correspond to full word boundaries.
base_modelDefault: English-only "RoBERTa" modelCustom model starting points for problems that require higher throughput or multilingual support

Valid options: "roberta", "small" (distilled version of RoBERTa), "multilingual", "fast", "textcnn", "fasttextcnn"
My model can't handle text in languages other than English
class_weightDefault: "sqrt"Upweights Rare Classes - Rewards the model more for correctly predicting rare classes.

Options (from rare-class biased to common-class biased): "linear," "sqrt," "log," None
min_stepsDefault:Values from 0 to 300,000 are valid. Controls the number of minimum number of updates made to the model before training is considered complete. In scenarios where data is scarce (e.g., if you have only 50 documents in your dataset, it can make sense to set this value to 100000 or higher to try to squeeze the best performance out of your limited dataset size)


Common Text Extraction Challenges and Resolutions

Text Extraction

Issue: I'm getting low recall and/or all zero metric

Cause:This happens when a user does not label all valid instances of an extraction in a document (e.g., a value appears at the bottom of every page, but the user only labels the value on the first page).


Resolution: Exhaustively label. Make sure ALL valid instances of a label are labeled. Or, change to auto_negative_sampling=false

📘

Note on Turning off auto_negative_sampling

Turning off auto_negative_sampling will likely get you a result, but you may have problems with false positives. To help counter-act false positives you may find that you need to increase the value of max_empty_chunk_ratio (e.g. 10.0).


I have poorly performing fields that require the information from many contextualizing words to produce the right prediction.

Cause: Models look at roughly 64 words on either side of a given word to classify that word when max_length = 128.
Solution: Pass optimize_for='accuracy' as a model training option. This will increase max_length from 128 to 512. This will let your model look at around ~256 words on either side of the target word. It is also roughly five times more expensive to train and predict with this setting on, so it may not be appropriate for high-throughput use cases.


My model is predicting partial words.

Cause: Only a portion of a word or a digit is captured when subtoken_predictions=True. This may be affecting the way the model identifies long numbers, addresses, etc.
Solution: Turning subtoken_predictions = False, and retraining will correct the partial predictions. Note that this might prevent you from extracting fields like "Currency" if you are highlighting only the "$" in "$123.45" and would force the model to return full price rather than just the "$".


My model cannot handle non-English languages.

Cause: The base_model used defaults to a model that is English-only.
Solution: Set base_model = 'multilingual'


My model is performing poorly with rare classes

Cause: This is likely a class_weight issue. The class_weight argument offers a "sliding scale" from "linear --> None". Options are "linear", "sqrt", "log", and None. Setting to "linear" rewards the model much more for predicting rare classes in order to compensate for their rarity, whereas setting to None makes no attempt to compensate for class rarity.
Solution: If you're having trouble with the performance of rare classes you can try using "linear" to really reward the model highly for correctly predicting rare classes, but performance on common classes may suffer as a result.


My model is performing poorly with common classes.

Cause: This is likely a class_weight issue. The class_weight argument offers a "sliding scale" from "linear --> None". Options are "linear", "sqrt", "log", and None. Setting to "linear" rewards the model much more for predicting rare classes in order to compensate for their rarity, whereas setting to None makes no attempt to compensate for class rarity.
Solution: If your common classes are performing more poorly, you can try "log" scaling or None instead.


My model is training too slowly.

Solution: Setting auto_negative_sampling=false will result in much faster training times for long documents but will likely come at the expense of having your model produce more false positive predictions.


My model produces false positives when there are similar extractions in areas of the document that are not labeled.

Cause: This may be a result of disabling auto_negative_sampling.
Solution: Increase the max_empty_chunk_ratio value to 5.0 or higher to help correct for this. Re-enabling auto_negative_sampling and labeling more exhaustively might also be needed.



Text Classification Model Settings

Options


NameDefaultMeaning
model_typeDefault: StandardOptions (Spectrum from cheap/fast to accurate/slow*):

TFIDF_LR = “tfidf_lr”
TFIDF_GBT = “tfidf_gbt”
STANDARD = “standard”
FINETUNE = “finetune”(default)

*The most accurate option is not particularly slow.

Common Text Classification Challenges and Resolutions

My classification accuracy is not what I want.

Solution: Setting model_type = "finetune" may help for classification tasks that are more fine-grained or more subjective.


My model isn't picking up on synonyms of the words I've annotated.

Cause: TFIDF options for model_type are most likely your issue here. They were well when just looking for a key phrase, but they are not adept at subjective classification (e.g., sentiment analysis, emotion, anything that requires knowing the order of words in the document. )

Solution: Set model_type = "finetune"


My model is struggling with subjective classification (e.g., sentiment analysis, emotion, or anything that requires knowing the order of words in the document)

Cause: TFIDF options for model_type are most likely your issue here. They were well when just looking for a key phrase, but they are not adept at subjective classification (e.g., sentiment analysis, emotion, anything that requires knowing the order of words in the document. )

Solution: Set model_type = "finetune"



Object Detection/Image Classification Model Settings

Name DefaultMeaning
filter_emptyDefault: FalseWhen 'false,' object detection will remove any images that do not have labels.
n_epochsDefault: 8Number of Epochs - The number of times the model is updated on each training example. On complex problems, increasing this value may lead to better performance at the cost of longer training times.

Values from 1 to 256 are valid. Controls the number of times each example is shown to the model. In cases where data is scarce, you can increase this value to 32 or 64 and may get better performance. It serves a similar role to min_steps, so you likely only need to change one or the other.
use_small_modelDefault: False

Object Detection Common Challenges and Resolutions


I want better model performance.
Cause: You may want to try training for longer. Increase num_epochs or max_iter.

Solution: Increase num_epochs or max_iter.

🚧

This may result in overfitting.


My image-heavy model is outputting false positives.
Cause: By default, object detection will remove any images that do not have labels. This will speed up training significantly if you have a lot of images with empty labels, but it can cause problems with false positives when the model is shown images that it hasn’t seen during training. Anything the model hasn’t seen during training should be considered “undefined behavior.”

Solution: Set filter_empty = true in order to ensure the model sees examples of documents with no labels during training.


I want my model to train faster.

Solution: Set filter_empty = True. This will speed up training significantly if you have a lot of images with empty labels, but it can cause problems with false positives when the model is shown images that it hasn't seen during training. Anything the model hasn't seen during training is basically "undefined behavior". OR, Set use_small_model = True.


I want my model to predict faster.

Solution: Set use_small_model = True. No other settings will affect prediction speed.



Image Classification Common Challenges and Resolutions


Issue: I want to improve my model's accuracy.

Solution: If you set model_type to 'finetune' for a model that is more accurate at the cost of some speed. Refer to the settings for object-detection above to learn more about the potential implications of this change.