Advanced Training Options

Introduction

Advanced users may want to change some of the agent training settings to fine-tune the agents' performance in certain areas. Whether it be improving subjective classification or prediction speed, your agent training settings may be the answer to achieving your desired results. The available settings differ based on the agent type. Below, you will find a comprehensive list of configurable settings organized by agent type.

🚧

For most use cases, Indico out-of-the-box agents will meet all of your needs. Modifying advanced agent options will not be necessary for the majority of users. Proceed with caution.

Text Extraction

Agent Settings

NameDefaultMeaning
max_empty_chunk_ratioDefault: 1.0Controls how much negative data is used during training. This defaults to float 1.0, meaning 1 negative chunk per each positive chunk.
auto_negative_samplingDefault: trueA agent with auto_negative_sampling enabled trains in three phases:
  1. It trains on the chunks of the document that have labels.

  2. It predicts overall training data.

  3. It retrains on both chunks with labels and chunks that produce any predictions ("negative" examples).

    When auto-negative_sampling is turned off it, it only uses the labeled chunks and configurable negative chunks from the areas around the labeled chunks (the max_empty_chunk_ratio setting, which defaults to 1.0).

optimize_forDefault: "predict_speed"

Primarily controls the maximum number of words the agent takes into account when making a prediction.
By default, agents can "see" roughly 64 words before and after a given word. Setting optimize_for to "accuracy" increases this to roughly 256 words. Using the "speed" setting reduces the number of times the agent sees each example in the dataset and keeps the context size the same.

Valid Options: predict_speed, accuracy, speed, "accuracy_fp16" and "predict_speed_fp16"
Note: "accuracy_fp16" and "predict_speed_fp16" use mixed-precision to give training and prediction speed improvements that may come with slight agent performance accuracy/speed degradation. These options are advanced and experimental.

subtoken_predictionsDefault: TrueAllow Subword Predictions - This setting controls whether agent predictions can contain partial words or must correspond to full word boundaries.
base_modelDefault: English-only "RoBERTa" agent

Custom agent starting points for problems that require higher throughput or multilingual support

Valid options: "roberta", "small" (distilled version of RoBERTa), "multilingual", "fast", "textcnn", "fasttextcnn"
My agent can't handle text in languages other than English

class_weightDefault: "sqrt"

Upweights Rare Classes - Rewards the agent more for correctly predicting rare classes.

Options (from rare-class biased to common-class biased): "linear," "sqrt," "log," None

min_stepsDefault:Values from 0 to 300,000 are valid. Controls the number of minimum number of updates made to the agent before training is considered complete. In scenarios where data is scarce (e.g., if you have only 50 documents in your dataset, it can make sense to set this value to 100000 or higher to try to squeeze the best performance out of your limited dataset size)

Common Challenges and Resolutions

Issue: I'm getting low recall and/or all zero metric.

Cause: This happens when a user does not label all valid instances of an extraction in a document (e.g., a value appears at the bottom of every page, but the user only labels the value on the first page).

Resolution: Exhaustively label. Make sure ALL valid instances of a label are labeled. Or, change to auto_negative_sampling=false.

📘

Note on Turning off auto_negative_sampling:

Turning off auto_negative_sampling will likely get you a result, but you may have problems with false positives. To help counter-act false positives you may find that you need to increase the value of max_empty_chunk_ratio (e.g. 10.0).


Issue: I have poorly performing fields that require the information from many contextualizing words to produce the right prediction.

Cause: Agents look at roughly 64 words on either side of a given word to classify that word when max_length=128.

Solution: Pass optimize_for='accuracy' as an agent training option. This will increase max_length from 128 to 512. This will let your agent look at around ~256 words on either side of the target word. It is also roughly five times more expensive to train and predict with this setting on, so it may not be appropriate for high-throughput use cases.


Issue: My agent is predicting partial words.

Cause: Only a portion of a word or a digit is captured when subtoken_predictions=true. This may be affecting the way the agent identifies long numbers, addresses, etc.

Solution: Turning subtoken_predictions=false, and retraining will correct the partial predictions. Note that this might prevent you from extracting fields like "Currency" if you are highlighting only the "$" in "$123.45" and would force the agent to return full price rather than just the "$".


Issue: My agent cannot handle non-English languages.

Cause: The base_model used defaults to an agent that is English-only.

Solution: Set base_model = 'multilingual'


Issue: My agent is performing poorly with rare classes.

Cause: This is likely a class_weight issue. The class_weight argument offers a "sliding scale" from "linear --> None". Options are "linear", "sqrt", "log", and None. Setting to "linear" rewards the agent much more for predicting rare classes in order to compensate for their rarity, whereas setting to None makes no attempt to compensate for class rarity.

Solution: If you're having trouble with the performance of rare classes you can try using "linear" to really reward the agent highly for correctly predicting rare classes, but performance on common classes may suffer as a result.


Issue: My agent is performing poorly with common classes.

Cause: This is likely a class_weight issue. The class_weight argument offers a "sliding scale" from "linear --> None". Options are "linear", "sqrt", "log", and None. Setting to "linear" rewards the agent much more for predicting rare classes in order to compensate for their rarity, whereas setting to None makes no attempt to compensate for class rarity.

Solution: If your common classes are performing more poorly, you can try "log" scaling or None instead.


Issue: My agent is training too slowly.

Solution: Setting auto_negative_sampling=false will result in much faster training times for long documents but will likely come at the expense of having your agent produce more false positive predictions.


Issue: My agent produces false positives when there are similar extractions in areas of the document that are not labeled.

Cause: This may be a result of disabling auto_negative_sampling.

Solution: Increase the max_empty_chunk_ratio value to 5.0 or higher to help correct for this. Re-enabling auto_negative_sampling and labeling more exhaustively might also be needed.

Text Classification

Agent Settings

NameDefaultMeaning
model_typeDefault: Standard

Options (Spectrum from cheap/fast to accurate/slow*):

TFIDF_LR = “tfidf_lr”
TFIDF_GBT = “tfidf_gbt”
STANDARD = “standard”
FINETUNE = “finetune”(default)

The most accurate option is not particularly slow.

Common Challenges and Resolutions

Issue: My classification accuracy is not what I want.

Solution: Setting model_type = "finetune" may help for classification tasks that are more fine-grained or more subjective.


Issue: My agent isn't picking up on synonyms of the words I've annotated.

Cause: TFIDF options for model_type are most likely your issue here. They were well when just looking for a key phrase, but they are not adept at subjective classification (e.g., sentiment analysis, emotion, anything that requires knowing the order of words in the document).

Solution: Set model_type = "finetune".


Issue: My agent is struggling with subjective classification (e.g., sentiment analysis, emotion, or anything that requires knowing the order of words in the document).

Cause: TFIDF options for model_type are most likely your issue here. They were well when just looking for a key phrase, but they are not adept at subjective classification (e.g., sentiment analysis, emotion, anything that requires knowing the order of words in the document).

Solution: Set model_type = "finetune".

Object Detection and Image Classification

Agent Settings

NameDefaultMeaning
filter_emptyDefault: falseWhen false, object detection will remove any images that do not have labels.
n_epochsDefault: 8

Number of Epochs - The number of times the agent is updated on each training example. On complex problems, increasing this value may lead to better performance at the cost of longer training times.

Values from 1 to 256 are valid. Controls the number of times each example is shown to the agent. In cases where data is scarce, you can increase this value to 32 or 64 and may get better performance. It serves a similar role to min_steps, so you likely only need to change one or the other.

use_small_modelDefault: false

Common Challenges and Resolutions

Issue: I want better agent performance.

Cause: You may want to try training for longer. Increase num_epochs or max_iter.

Solution: Increase num_epochs or max_iter.

🚧

Warning: This may result in overfitting.


Issue: My image-heavy agent is outputting false positives.

Cause: By default, object detection will remove any images that do not have labels. This will speed up training significantly if you have a lot of images with empty labels, but it can cause problems with false positives when the agent is shown images that it hasn’t seen during training. Anything the agent hasn’t seen during training should be considered "undefined behavior".

Solution: Set filter_empty = true in order to ensure the agent sees examples of documents with no labels during training.


Issue: I want my agent to train faster.

Solution: Set filter_empty = true. This will speed up training significantly if you have a lot of images with empty labels, but it can cause problems with false positives when the agent is shown images that it hasn't seen during training. Anything the agent hasn't seen during training is basically "undefined behavior". Alternatively, set use_small_model = true.


Issue: I want my agent to predict faster.

Solution: Set use_small_model = true. No other settings will affect prediction speed.



Issue: I want to improve my agent's accuracy.

Solution: If you set model_type = 'finetune' for an agent that is more accurate at the cost of some speed. Refer to the settings for object detection above to learn more about the potential implications of this change.