Data Quality Evaluation Tool
What is it?
The data quality evaluation tool conducts comprehensive assessments on the provided data, generating a holistic evaluation of its overall quality. The quality checks are based on ISO25012/24 standard and cover quality measures related to attributes like accuracy, consistency, and completeness among others. The tool can be used independently or in the context of CABC, where the data quality reports are generated and made available to the platform. In its current state, the application only supports COCO-formatted image data sets.
Why is it necessary?
Besides providing reliable, consistent, and thorough quality checks on data by performing quality assessments based on a well-defined standard, the evaluation tool can be integrated with tools like the Pipeline Probe that facilitates automated quality assessment of data in an MLOps process.
How does it work?
The tool utilizes FiftyOne API to compute various Quality Measure Elements (QMEs), for instance, Intersection Over Unions (IOUs) for detecting annotation overlaps, image hash for duplicate detection, and incorrect annotations for semantic accuracy checks (ref. Figure 2). These QMEs are then used to compute quality measures using measurement functions described in the ISO25024 document.
Figure 1 shows a general overview of the workflow. As can be seen, the tool can be triggered as a Python script, utilizing settings retrieved from a YAML file, or as a subprocess invoked by an external process, in which case default configuration in YAML are overridden by the configurations from the caller. Figure 2 shows a more detailed look at the QMEs used by the tool during evaluation along with the corresponding quality measures.
Further, the tool generates comprehensive logs that enable users to visualize data with identified issues (ref. Figure 1). The final evaluation results are in the form of a JSON formatted file which constitutes of a detailed list of various quality measures along with their corresponding computed values. This allows downstream applications to consume the results from the quality evaluations.
Figure 1 shows a general overview of the workflow. As can be seen, the tool can be triggered as a Python script, utilizing settings retrieved from a YAML file, or as a subprocess invoked by an external process, in which case default configuration in YAML are overridden by the configurations from the caller. Figure 2 shows a more detailed look at the QMEs used by the tool during evaluation along with the corresponding quality measures.
Further, the tool generates comprehensive logs that enable users to visualize data with identified issues (ref. Figure 1). The final evaluation results are in the form of a JSON formatted file which constitutes of a detailed list of various quality measures along with their corresponding computed values. This allows downstream applications to consume the results from the quality evaluations.