Advanced Training Options
Introduction
Advanced users may want to change some of the agent training settings to fine-tune the agents' performance in certain areas. Whether it be improving subjective classification or prediction speed, your agent training settings may be the answer to achieving your desired results. The available settings differ based on the agent type. Below, you will find a comprehensive list of configurable settings organized by agent type.
For most use cases, Indico out-of-the-box agents will meet all of your needs. Modifying advanced agent options will not be necessary for the majority of users. Proceed with caution.
Text Extraction
Agent Settings
| Name | Default | Meaning |
|---|---|---|
max_empty_chunk_ratio | Default: 1.0 | Controls how much negative data is used during training. This defaults to float 1.0, meaning 1 negative chunk per each positive chunk. |
auto_negative_sampling | Default: true | A agent with auto_negative_sampling enabled trains in three phases:
|
optimize_for | Default: "predict_speed" | Primarily controls the maximum number of words the agent takes into account when making a prediction. Valid Options: |
subtoken_predictions | Default: True | Allow Subword Predictions - This setting controls whether agent predictions can contain partial words or must correspond to full word boundaries. |
base_model | Default: English-only "RoBERTa" agent | Custom agent starting points for problems that require higher throughput or multilingual support Valid options: "roberta", "small" (distilled version of RoBERTa), "multilingual", "fast", "textcnn", "fasttextcnn" |
class_weight | Default: "sqrt" | Upweights Rare Classes - Rewards the agent more for correctly predicting rare classes. Options (from rare-class biased to common-class biased): "linear," "sqrt," "log," None |
min_steps | Default: | Values from 0 to 300,000 are valid. Controls the number of minimum number of updates made to the agent before training is considered complete. In scenarios where data is scarce (e.g., if you have only 50 documents in your dataset, it can make sense to set this value to 100000 or higher to try to squeeze the best performance out of your limited dataset size) |
Common Challenges and Resolutions
Issue: I'm getting low recall and/or all zero metric.
Cause: This happens when a user does not label all valid instances of an extraction in a document (e.g., a value appears at the bottom of every page, but the user only labels the value on the first page).
Resolution: Exhaustively label. Make sure ALL valid instances of a label are labeled. Or, change to auto_negative_sampling=false.
Note on Turning off
auto_negative_sampling:Turning off
auto_negative_samplingwill likely get you a result, but you may have problems with false positives. To help counter-act false positives you may find that you need to increase the value ofmax_empty_chunk_ratio(e.g. 10.0).
Issue: I have poorly performing fields that require the information from many contextualizing words to produce the right prediction.
Cause: Agents look at roughly 64 words on either side of a given word to classify that word when max_length=128.
Solution: Pass optimize_for='accuracy' as an agent training option. This will increase max_length from 128 to 512. This will let your agent look at around ~256 words on either side of the target word. It is also roughly five times more expensive to train and predict with this setting on, so it may not be appropriate for high-throughput use cases.
Issue: My agent is predicting partial words.
Cause: Only a portion of a word or a digit is captured when subtoken_predictions=true. This may be affecting the way the agent identifies long numbers, addresses, etc.
Solution: Turning subtoken_predictions=false, and retraining will correct the partial predictions. Note that this might prevent you from extracting fields like "Currency" if you are highlighting only the "$" in "$123.45" and would force the agent to return full price rather than just the "$".
Issue: My agent cannot handle non-English languages.
Cause: The base_model used defaults to an agent that is English-only.
Solution: Set base_model = 'multilingual'
Issue: My agent is performing poorly with rare classes.
Cause: This is likely a class_weight issue. The class_weight argument offers a "sliding scale" from "linear --> None". Options are "linear", "sqrt", "log", and None. Setting to "linear" rewards the agent much more for predicting rare classes in order to compensate for their rarity, whereas setting to None makes no attempt to compensate for class rarity.
Solution: If you're having trouble with the performance of rare classes you can try using "linear" to really reward the agent highly for correctly predicting rare classes, but performance on common classes may suffer as a result.
Issue: My agent is performing poorly with common classes.
Cause: This is likely a class_weight issue. The class_weight argument offers a "sliding scale" from "linear --> None". Options are "linear", "sqrt", "log", and None. Setting to "linear" rewards the agent much more for predicting rare classes in order to compensate for their rarity, whereas setting to None makes no attempt to compensate for class rarity.
Solution: If your common classes are performing more poorly, you can try "log" scaling or None instead.
Issue: My agent is training too slowly.
Solution: Setting auto_negative_sampling=false will result in much faster training times for long documents but will likely come at the expense of having your agent produce more false positive predictions.
Issue: My agent produces false positives when there are similar extractions in areas of the document that are not labeled.
Cause: This may be a result of disabling auto_negative_sampling.
Solution: Increase the max_empty_chunk_ratio value to 5.0 or higher to help correct for this. Re-enabling auto_negative_sampling and labeling more exhaustively might also be needed.
Text Classification
Agent Settings
| Name | Default | Meaning |
|---|---|---|
model_type | Default: Standard | Options (Spectrum from cheap/fast to accurate/slow*):
The most accurate option is not particularly slow. |
Common Challenges and Resolutions
Issue: My classification accuracy is not what I want.
Solution: Setting model_type = "finetune" may help for classification tasks that are more fine-grained or more subjective.
Issue: My agent isn't picking up on synonyms of the words I've annotated.
Cause: TFIDF options for model_type are most likely your issue here. They were well when just looking for a key phrase, but they are not adept at subjective classification (e.g., sentiment analysis, emotion, anything that requires knowing the order of words in the document).
Solution: Set model_type = "finetune".
Issue: My agent is struggling with subjective classification (e.g., sentiment analysis, emotion, or anything that requires knowing the order of words in the document).
Cause: TFIDF options for model_type are most likely your issue here. They were well when just looking for a key phrase, but they are not adept at subjective classification (e.g., sentiment analysis, emotion, anything that requires knowing the order of words in the document).
Solution: Set model_type = "finetune".
Object Detection and Image Classification
Agent Settings
| Name | Default | Meaning |
|---|---|---|
filter_empty | Default: false | When false, object detection will remove any images that do not have labels. |
n_epochs | Default: 8 | Number of Epochs - The number of times the agent is updated on each training example. On complex problems, increasing this value may lead to better performance at the cost of longer training times. Values from 1 to 256 are valid. Controls the number of times each example is shown to the agent. In cases where data is scarce, you can increase this value to 32 or 64 and may get better performance. It serves a similar role to |
use_small_model | Default: false |
Common Challenges and Resolutions
Issue: I want better agent performance.
Cause: You may want to try training for longer. Increase num_epochs or max_iter.
Solution: Increase num_epochs or max_iter.
Warning: This may result in overfitting.
Issue: My image-heavy agent is outputting false positives.
Cause: By default, object detection will remove any images that do not have labels. This will speed up training significantly if you have a lot of images with empty labels, but it can cause problems with false positives when the agent is shown images that it hasn’t seen during training. Anything the agent hasn’t seen during training should be considered "undefined behavior".
Solution: Set filter_empty = true in order to ensure the agent sees examples of documents with no labels during training.
Issue: I want my agent to train faster.
Solution: Set filter_empty = true. This will speed up training significantly if you have a lot of images with empty labels, but it can cause problems with false positives when the agent is shown images that it hasn't seen during training. Anything the agent hasn't seen during training is basically "undefined behavior". Alternatively, set use_small_model = true.
Issue: I want my agent to predict faster.
Solution: Set use_small_model = true. No other settings will affect prediction speed.
Issue: I want to improve my agent's accuracy.
Solution: If you set model_type = 'finetune' for an agent that is more accurate at the cost of some speed. Refer to the settings for object detection above to learn more about the potential implications of this change.
