A short overview of data quality definitions and challenges in support of the Love Your Data week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: Defining Data Quality, Documenting, Describing, Defining, Good Data Examples, Finding the Right Data, Rescuing Unloved Data
Data quality is the degree to which data meets the purposes and requirements of its use. Good data, therefore, is the data that can be used for the task at hand even if it has some issues (e.g., missing data, poor metadata, value inconsistencies, etc.) Data that has errors, is hard to retrieve or understand, or has no context or traces of where it came from is generally considered bad.
Numerous attempts to define data quality over the last few decades relied on diverse methodologies and identified multiple dimensions of data or information quality (Price and Shanks, 2005). The importance of quality of data is recognized in business and commercial data warehousing (Fan and Geertz, 2012; Redman, 2001), in government operations (Information Quality Act, 2001) and by international agencies involved in data-intensive activities (IMF, 2001). Many research domains have also developed frameworks to evaluate quality of information, including decision, measurement, test, and estimation theories (Altman, 2012).
Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).
Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):
Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.
Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).
Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):
- how to assess quality
- how to manage quality
- what impact quality has on organization
3 Reasons Why Data Quality Should Be Your Top Priority This Year |
Each dimension of data quality, such as completeness, accuracy, timeliness, or consistency creates challenges for data quality.
Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.
As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:
Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.
As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:
- How does non-compliance with policies undermine the quality of data?
- What role does scientific community consensus play in establishing the quality of data?
- Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized?
Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.
Most likely, some combination of the above is the best approach. Quality depends on how data was collected as well as on how it was subsequently stored, curated, and made available to others. Data quality is a responsibility that is shared between data providers, data curators and data consumers. While data providers can ensure the quality of their individual datasets, curators help with consistency, coverage and metadata. Maintaining current and consistent metadata across copies and systems also benefits contributions from those who intend to re-use the data. Data and software documentation is another aspect of data quality that cannot be solved technically and needs a combination of organizational / information science solutions.
References and further reading:
References and further reading:
- Altman, M. (2012). Mitigating Threats to Data Quality Throughout the Curation Lifecycle.
- Ballou, D. P., & Pazer, H. L. (1985). Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems. Management Science, 31(2), 150–162.
- Fan, W., & Geerts, F. (2012). Foundations of data quality management. Morgan & Claypool.
- Ge, M., & Helfert, M. (2007). A review of information quality research - develop a research agenda. In The International Conference on Information Quality (pp. 76--91).
- International Monetary Fund. (2001). Fourth Review of the Fund’s Data Standards’ Initiatives.
- Letzter, R. (2017, February 9). No, the NOAA didn’t fake climate change data. Science Alert.
- Madnick, S. E., Wang, R. Y., Lee, Y. W., & Zhu, H. (2009). Overview and Framework for Data and Information Quality Research. Journal of Data and Information Quality, 1(1), 1–22.
- Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211.
- Price, R., & Shanks, G. (2005). A semiotic information quality framework: development and comparative analysis. Journal of Information Technology, 20(2), 88–102.
- Redman, T. C. (2001). Data Quality: The Field Guide. Boston: Digital Press.
- Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95.
No comments:
Post a Comment