Why is it necessary?
SmartPDF AI uses supervised training with millions of manually processed invoices as a training set. However, these training data sets are prone to human error, both at the instance level and population level. The percentage of human error in training data ranges from 0.8% to 30%, which makes automatic cleaning of the training data set essential. Mosquito is the tool used for automatic cleaning, which ensures improved accuracy.
How does it work?
Data cleaning for SmartPDF AI is done to combat human errors in the training data set by using two boosting techniques, one of which is centered around the mosquito data cleaner.
The first data cleaning technique is the committee technique, which involves using a committee of six models, each trained with a fraction of the data set that is rotated when possible. The rotating factor depends on the size of the training data per group. The committee model ensures the uniqueness of each model because they are trained on significantly different data sets which helps in outlier detection. Disagreement in predictions by the committee models is an indication of human error in the data set. Rules are in place for how many committee models must agree on the final field prediction, especially for important fields. Approximately 30% of errors were removed by using the committee boosting technique for data cleaning.
The second technique is the anomaly detection technique which uses a weak learner model. The weak learner model is trained quickly because it is lightweight and of low quality. This weak learner model is called mosquito, it is trained on labelled training data. The model is reapplied to its own training data to find the areas of the training data that are especially bad. The sample set that the model fails to predict properly is considered anomalies which are removed from the main model training. This method is the opposite of standard boosting like XGBoost where sample weight is increased so that the model fits better to the data. For mosquito, the sample weight of the wrong pattern is reduced so that the model ignores these suspicious data, making the model fit worse to the data. Boosting in mosquito ultimately affects the final model and not the training data, as the data remains dirty. The model gets smarter and can ignore the dirtiness of the data, so it makes fewer mistakes in its predictions. This technique also removes about 30% of error from final prediction.
Mosquito uses techniques like hierarchical clustering or pivot-based clustering to put invoices into groups called templates during invoice extraction. Invoices in a template have static parts called anchors and variable parts called fields. Ground truth (GT) is used to match labels to fields, but GT contains human errors which leads to ambiguity in the matching. The ambiguity is resolved by using majority voting. Specifically, each anchor candidate gets the score of correct extractions. The anchor with the largest score is considered a winner, and all GT samples that disagree with the winner choice are considered anomalies. Such anomalies are marked with lower sample weight for the main model training. This approach significantly boosts the quality of the production model.