Why is it necessary?
In fact, the tabular data, generated from real-world applications, usually suffer from different types of discrepancies, which hinder the utilization of such data in AI applications. For instance, the data collected manually in a certain application, e.g., customers and purchase records, may suffer from missing values due to lack of information or due to data entry errors. Along a similar line, data duplicates may occur due to improper join operations. Examples of other types of discrepancies comprise outliers, rule/pattern violation, mislabelling, inconsistencies, and typos.
How does it work?
SAGED makes use of the concepts of meta learning to exploit historical knowledge and to provide a knob for controlling the execution time of the error detection task. The core idea is to exploit the knowledge embedded in the historical datasets and use this knowledge while training the detection classifier. Specifically, SAGED consists of two phases, a knowledge gathering phase and a detection phase. In the former phase, a set of ML models are trained to identify errors in the historical datasets. The latter phase begins with matching the new dirty data with a set of the historical datasets, before using the corresponding models to generate the feature vector for the meta detection classifier.