Mosquito data cleaner

What is it?

Mosquito is a novel data cleaning technology used for cleaning data extracted from invoice PDFs.

Why is it necessary?

SmartPDF AI uses supervised training with millions of manually processed invoices as a training set. However, these training data sets are prone to human error, both at the instance level and population level. The percentage of human error in training data ranges from 0.8 to 30 percent, which makes automatic cleaning of the training data set essential. Mosquito is the tool used for automatic cleaning, which ensures improved accuracy.

Contact

Timo Sinisalmi
Basware

Send email

How does it work?

Data cleaning for SmartPDF AI is done to combat human errors in the training data set by using two boosting techniques, one of which is centered around the mosquito data cleaner.  

The first data cleaning technique is the committee technique, which involves using a committee of six models, each trained with a fraction of the data set that is rotated when possible. The rotating factor depends on the size of the training data per group. The committee model ensures the uniqueness of each model because they are trained on significantly different data sets which helps in outlier detection. Disagreement in predictions by the committee models is an indication of human error in the data set. Rules are in place for how many committee models must agree on the final field prediction, especially for important fields. Approximately 30 percent of errors were removed by using the committee boosting technique for data cleaning.