Feb 13, 2017

Data quality - a short overview

love your data image

A short overview of data quality definitions and challenges in support of the Love Your Data week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: Defining Data QualityDocumenting, Describing, DefiningGood Data ExamplesFinding the Right DataRescuing Unloved Data

Data quality is the degree to which data meets the purposes and requirements of its use. Good data, therefore, is the data that can be used for the task at hand even if it has some issues (e.g., missing data, poor metadata, value inconsistencies, etc.) Data that has errors, is hard to retrieve or understand, or has no context or traces of where it came from is generally considered bad.

Numerous attempts to define data quality over the last few decades relied on diverse methodologies and identified multiple dimensions of data or information quality (Price and Shanks, 2005). The importance of quality of data is recognized in business and commercial data warehousing (Fan and Geertz, 2012; Redman, 2001), in government operations (Information Quality Act, 2001) and by international agencies involved in data-intensive activities (IMF, 2001). Many research domains have also developed frameworks to evaluate quality of information, including decision, measurement, test, and estimation theories (Altman, 2012).

Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).

Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):

  • how to assess quality
  • how to manage quality
  • what impact quality has on organization
The multitude of definitions, frameworks, and contexts in which data quality is used demonstrate that making data quality a useful paradigm is a persisting challenge that can benefit from establishing a dynamic network of researchers and practitioners in the area of data quality and from developing a framework that would be general and yet flexible enough to accommodate highly specific attributes and measurements from particular domains.

data quality attributes
3 Reasons Why Data Quality Should Be Your Top Priority This Year
Each dimension of data quality, such as completeness, accuracy, timeliness, or consistency creates challenges for data quality.

Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.

As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:
  • How does non-compliance with policies undermine the quality of data?
  • What role does scientific community consensus play in establishing the quality of data?
  • Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized? 
Another aspect of data quality that complicates formalized treatment of initial dimensions is that data is often heterogeneous and can be applied in varied contexts. As has been pointed above, data quality frameworks and approaches are being developed in business, government, and research contexts and quality solutions have to consider structured, semi-structured, and unstructured data and their combinations. Most of the previous data quality research focused on structured or semi-structured data. Additionally, spatial, temporal, and volume dimensions of data contribute to quality assessment and management.

Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.

Most likely, some combination of the above is the best approach. Quality depends on how data was collected as well as on how it was subsequently stored, curated, and made available to others. Data quality is a responsibility that is shared between data providers, data curators and data consumers. While data providers can ensure the quality of their individual datasets, curators help with consistency, coverage and metadata. Maintaining current and consistent metadata across copies and systems also benefits contributions from those who intend to re-use the data. Data and software documentation is another aspect of data quality that cannot be solved technically and needs a combination of organizational / information science solutions.

References and further reading: