Feb 12, 2014

Kinds of data and their implications for data re-use

Notes on a paper "What Are Data? The Many Kinds of Data and Their Implications for Data Re-Use" (Journal of Computer-Mediated Communication, 2007, Vol. 12, N 2, pp. 635–651):

The paper reports on an ethnographic research of data sharing practices in four projects that served as case studies. The goal - to reflect on the technical and social contexts of enabling data for re-use. The four projects were:

  • SkyProject – a collaborative project to build the data grid infrastructure for U.K. astronomy
  • SurveyProject – a project to produce a yearly large-scale complex survey dataset (~10,000 U.K. households)
  • CurationProject – a digitization and access project of artifacts and photographs collected since 1884
  • AnthroProject – a digitization of anthropological materials collected in a range of countries over one researcher's academic career

The fieldwork included interviews of participants recruited via snowballing technique and via paths that the data took through each project plus document analysis and observation (including project websites, conference and other face-to-face meetings). Participants included people involved in data collection, processing, analysis and reuse.

Below are some observations on the data practices at four stages of data lifecycle:

  • Data Collection: some disciplines produce digital data, while others (e.g., AnthroProject) work with a mix of digital, non-digital and legacy data (tapes, diaries, photographs, etc.). The labor of digitization is often ignored, but it's still very important at the stage of data collection.
  • Data Formatting: data need to be transformed (converted, re-formatted, flattened, etc.) to be re-usable by others. In SurveyProject, for example, variables are renamed and recoded, files are renamed and loaded into a database. The processes of converting variables (e.g., words into numbers) and of successive renaming and restructuring make collected materials visible, manageable, communicable, and intelligible for others. Such transformations into manageable and communicable chunks are difficult for disciplines that see their primary goal as describing the specificities of particular contexts and drawing distinctions as opposed to generalizations and simplification.
  • Data Release: ownership, consent, and ethics differ depending on whether people are represented in the source data or not. In AnthroProject, where the point of the data is their subject-specificity, anonymization is largely impossible to achieve.
  • Data Re-Use: the case studies suggested that histories and configurations of research communities influence how data are documented, contextualized and checked for quality. Overall, the following aspects are important for data to become re-usable: conditions and context of data capture (e.g., atmospheric conditions or community place-time); instrument quality and calibration techniques; data points and variables to be collected; transormation techniques (e.g., statistical methods and parameters)

This study introduced some interesting observations, but there were not enough details to make them meaningful. With such lack of details and specificity, it's hard to be convinced in the claims that are being made. For example, what questions were asked during interviews? Were there any differences between responses and information in documents, meetings, etc.? Could we make comparisons across all four projects for each stage of data lifecycle? Are quotes mere illustrations or they are indicative of some patterns?