A short overview of data quality definitions and challenges in support of the Love Your Data week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: Defining Data Quality, Documenting, Describing, Defining, Good Data Examples, Finding the Right Data, Rescuing Unloved Data
Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).
Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):
- how to assess quality
- how to manage quality
- what impact quality has on organization
![]() |
3 Reasons Why Data Quality Should Be Your Top Priority This Year |
Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.
As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:
- How does non-compliance with policies undermine the quality of data?
- What role does scientific community consensus play in establishing the quality of data?
- Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized?
Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.
References and further reading:
- Altman, M. (2012). Mitigating Threats to Data Quality Throughout the Curation Lifecycle.
- Ballou, D. P., & Pazer, H. L. (1985). Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems. Management Science, 31(2), 150–162.
- Fan, W., & Geerts, F. (2012). Foundations of data quality management. Morgan & Claypool.
- Ge, M., & Helfert, M. (2007). A review of information quality research - develop a research agenda. In The International Conference on Information Quality (pp. 76--91).
- International Monetary Fund. (2001). Fourth Review of the Fund’s Data Standards’ Initiatives.
- Letzter, R. (2017, February 9). No, the NOAA didn’t fake climate change data. Science Alert.
- Madnick, S. E., Wang, R. Y., Lee, Y. W., & Zhu, H. (2009). Overview and Framework for Data and Information Quality Research. Journal of Data and Information Quality, 1(1), 1–22.
- Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211.
- Price, R., & Shanks, G. (2005). A semiotic information quality framework: development and comparative analysis. Journal of Information Technology, 20(2), 88–102.
- Redman, T. C. (2001). Data Quality: The Field Guide. Boston: Digital Press.
- Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95.