Data Quality Dashboard
What is it?
The data quality dashboard developed by Software AG introduces a modular design for an interactive dashboard, designed to streamline and automate multiple aspects of the data quality management process. This novel design brings significant enhancements to the field of data science, particularly in the areas of data profiling, validation, error detection, and correction. This tool offers an automated, interactive, and iterative data quality dashboard designed to enhance data quality for downstream applications such as Business Intelligence (BI) and machine learning (ML) platforms.
Why is it necessary?
Crafting a data quality management pipeline is challenging without extensive data science expertise due to the vast array of tools and technologies available, each with its own strengths and limitations, requiring deep understanding to choose and use effectively for specific data quality issues. Effective data quality management also demands knowledge of the underlying data quality problems, understanding the domain context, potential sources of quality issues, and their implications on the data pipeline and downstream tasks. This involves not only applying tools but also ongoing monitoring and adjustment as new data comes in or as business requirements evolve. Data quality dashboards assist in this process by defining data collection rules and presenting exceptions to data owners, who must then take corrective actions. However, data owners with limited data science knowledge may struggle to determine appropriate actions or fine-tune rules, and ensuring corrections are accurate and beneficial to models can require further analysis or ML techniques to validate and improve corrections.
How does it work?
The process begins with data ingestion from various sources, including SQL databases and CSV files, managed by a data loader that feeds into a dashboard controller. The modular design allows integration with external tools via standard REST APIs. An automated data profiling module analyzes and records the data's characteristics, while an automated rule extraction module generates rules based on statistical properties and domain-specific features. These rules guide the automated error detection module, which scans for inconsistencies and outliers, and the automated error repair module, which applies advanced algorithms to correct detected errors, minimizing the need for manual intervention. A version control tracking module records each data version, enabling robust data management and facilitating rollbacks if necessary. The system also generates DataSheets capturing critical information such as data version tags, hyperparameters, generated rules, data quality metrics, and utilized cleaning tools.
The data quality dashboard is further enhanced by two key features: the user-in-the-loop module and the iterative cleaning module. The user-in-the-loop module allows active user involvement in validating or adjusting system-generated rules and corrections, and introducing custom rules. Users can annotate specific data samples to train ML models for validation and correction, and proactively manage errors by tagging known corrupted samples. The iterative cleaning module executes multiple cleaning cycles on the input data, progressively optimizing data quality and enhancing the performance of downstream applications. This iterative process effectively addresses complex or stubborn errors through repeated refinements. The system outputs cleaned data suitable for BI or ML applications, and provides data visualization tools for stakeholders like data scientists, developers, domain experts, managers, and business owners to review and understand the data cleaning process and results. Overall, the invention offers a comprehensive, automated, and user-friendly solution for data quality management, significantly improving efficiency, effectiveness, and transparency.