May 9, 2014

Big data report from the White House

Another big data review, this time from the White House - "Big Data: Seizing Opportunities, Preserving Values" (pdf). The report explains what big data is (large, diverse, complex, longitudinal, distributed, making possible unexpected discoveries and creating an asymmetry of power between those who hold the data and those who intentionally or inadvertently supply it) and describes implications of big data for public and private sectors. In addition to many known and less known examples of how big data can be good or bad, the report provides initial thoughts on recommendations for big data governance. It divided its approach to policy framework into four overlapping core areas:

1. Big data and citizens - improve public services while preventing the government from accruing unlimited power by using increased surveillance, algorithmic profiling, and metadata tracking.

2. Big data and consumers - reduce cost of commercial services and personalize them while mitigating security breaches and risks of discrimination based on consumer profiles and lack of consumer awareness and data transparency.

3. Big data and discrimination - do less harm and prevent discriminatory uses of identification and re-identification techniques.

4. Big data and privacy - get used to less privacy while reconsidering the notice and consent framework.

In the concluding section the report had the following recommendations:

  • Advance the Consumer Privacy Bill of Rights.
  • Pass National Data Breach Legislation.
  • Extend Privacy Protections to non-U.S. Persons.
  • Ensure Data Collected on Students in School is Used for Educational Purposes.
  • Expand Technical Expertise to Stop Discrimination.
  • Amend the Electronic Communications Privacy Act.

It's a thorough report and is definitely worth a read, but similarly to my and my colleagues big data review (pre-print), it's just the beginning of studying implications and governance of big data.

May 2, 2014

Summary of drivers and barriers in data sharing

Nice summary of the drivers, barriers, and enablers that determine stakeholder engagement based on expert interviews in Dallmeier-Tiessen et al., 2014, Enabling Sharing and Reuse of Scientific Data (restricted access).

Drivers and benefits

  • Societal benefits - economic/commercial benefits; continued education; inspiring the young; allowing the exploitation of the cognitive surplus in society; better quality decision making in government and commerce; citizens being able to hold governments to accountable.
  • Academic benefits - the integrity of science; increased public understanding of science.
  • Research benefits - validation of scientific results by other scientists; recognition of their contribution; reuse of data in meta-studies to find hidden effects/trends; testing new theories against past data; doing new science not considered when data was collected without repeating the experiment; easing discovery of data by searching/mining across large datasets with benefits of scale; easing discovery and understanding of data across disciplines to promote interdisciplinary studies; combining with other data (new or archived) in the light of new ideas.
  • Organizational benefits - publication of high quality data and citation of data enhance organizational profile; preserved data linked to published articles adds value to the product; data preservation is more business; reputation of institution as “data holder with expert support” is increased; combining data from multiple sources helps to make policy decisions; reuse of data instead of new data collection reduces time and cost to new research results; use of data for teaching purposes.
  • Individual contributor benefits - preserving data for the contributor to access later — sharing with your future self; peer visibility and increased respect achieved through publications and citation; increased research funding; when more established in their careers through increased control of organizational resources; the socio-economic impact of their research (e.g., spin-out companies, patent licenses, inspiring legislation); status, promotion and pay increase with career advancement; status conferring awards and honors.

Barriers and Enablers are Related to:

  • Individual contributor incentives
  • Availability of a sustainable preservation infrastructure
  • Trustworthiness of the data, data usability, pre-archive activities
  • Data discovery
  • Academic defensiveness
  • Finance
  • Subject anonymity and personal data confidentiality
  • Legislation/regulation

Apr 15, 2014

Survey of digital curation and curators

I am conducting a survey of digital curation and digital curators.

If you are involved in taking care of digital materials of any type, form and purpose and are interested in the advancement of digital curation as a professional field, feel free to take the survey and share it with colleagues. The survey takes about 20 min to complete and can be found at

Feb 12, 2014

Kinds of data and their implications for data re-use

Notes on a paper "What Are Data? The Many Kinds of Data and Their Implications for Data Re-Use" (Journal of Computer-Mediated Communication, 2007, Vol. 12, N 2, pp. 635–651):

The paper reports on an ethnographic research of data sharing practices in four projects that served as case studies. The goal - to reflect on the technical and social contexts of enabling data for re-use. The four projects were:

  • SkyProject – a collaborative project to build the data grid infrastructure for U.K. astronomy
  • SurveyProject – a project to produce a yearly large-scale complex survey dataset (~10,000 U.K. households)
  • CurationProject – a digitization and access project of artifacts and photographs collected since 1884
  • AnthroProject – a digitization of anthropological materials collected in a range of countries over one researcher's academic career

The fieldwork included interviews of participants recruited via snowballing technique and via paths that the data took through each project plus document analysis and observation (including project websites, conference and other face-to-face meetings). Participants included people involved in data collection, processing, analysis and reuse.

Below are some observations on the data practices at four stages of data lifecycle:

  • Data Collection: some disciplines produce digital data, while others (e.g., AnthroProject) work with a mix of digital, non-digital and legacy data (tapes, diaries, photographs, etc.). The labor of digitization is often ignored, but it's still very important at the stage of data collection.
  • Data Formatting: data need to be transformed (converted, re-formatted, flattened, etc.) to be re-usable by others. In SurveyProject, for example, variables are renamed and recoded, files are renamed and loaded into a database. The processes of converting variables (e.g., words into numbers) and of successive renaming and restructuring make collected materials visible, manageable, communicable, and intelligible for others. Such transformations into manageable and communicable chunks are difficult for disciplines that see their primary goal as describing the specificities of particular contexts and drawing distinctions as opposed to generalizations and simplification.
  • Data Release: ownership, consent, and ethics differ depending on whether people are represented in the source data or not. In AnthroProject, where the point of the data is their subject-specificity, anonymization is largely impossible to achieve.
  • Data Re-Use: the case studies suggested that histories and configurations of research communities influence how data are documented, contextualized and checked for quality. Overall, the following aspects are important for data to become re-usable: conditions and context of data capture (e.g., atmospheric conditions or community place-time); instrument quality and calibration techniques; data points and variables to be collected; transormation techniques (e.g., statistical methods and parameters)

This study introduced some interesting observations, but there were not enough details to make them meaningful. With such lack of details and specificity, it's hard to be convinced in the claims that are being made. For example, what questions were asked during interviews? Were there any differences between responses and information in documents, meetings, etc.? Could we make comparisons across all four projects for each stage of data lifecycle? Are quotes mere illustrations or they are indicative of some patterns?

Jan 17, 2014

Evidence, a conceptual analysis

Notes from "Conceptual Analysis: A Method for Understanding Information as Evidence, and Evidence as Information" by J. Furner (Archival Science, 2004, 4, 233–265, pdf)

The paper examines the concept of evidence and compares it to the concept of information. Conceptual analysis treats concepts as classes of objects or events and seeks to define the meaning of a given concept by specifying the conditions under which any entity could be classified under the concept in question. The goal of conceptual analysis as a method of inquiry is to improve our understanding of the ways in which particular concepts are used for communicating ideas. The utility of this method is to produce provocative or interesting ideas which can be pushed as directions of further research.

Definition:Evidence is that which we consider or interpret in order to draw or infer a conclusion about some aspect of the world.

Evidence is a concept that is used in science, law, history and archival practice/science. In science evidence is an important component of theories of induction (inference from a set of premises) and explanation (specification of a sequence of cause and effect). Induction can be done by generalizing evidence (e.g., by observing a white swan we conclude that all swans are white), by deducing conclusions from a hypothesis (we hypothesize that cigarette smoking causes cancer and expect a smoker to have lung disease - if the smoker has the disease, the hypothesis is confirmed), or by calculating the effect of observed evidence on our degree of belief in a hypothesis (evidence raises our degree of belief and increases the probability of explanation). Hypothetico-deductive reasoning has a problem of underdetermination (the observed evidence can be entailed by more than one hypothesis). Deductive explanations also assume existence of a law (the deductive-nomological model), which is problematic for the social sciences. In the social sciences this model is rejected in favor of other accounts of explanation that allow for the effects of human intentionality and free will and that require the explainer to engage in imaginative interpretation of the meanings of events.

Evidence in legal context is information presented to prove or disprove a given fact. Eye witness accounts under oath are examples of direct evidence, fingerprints are an example of circumstantial evidence.

Evidence in history depends on whether a historian belongs to the positivist or postmodernist camp. The former camp acknowledges the existence of reality that is external and independent of human thought and uses evidence in the scientific sense, as a premise to form and test hypotheses. The latter group argues that reality is a construction of human thought and that "the facts" are statements endorsed by the group of people most well equipped to impose their values over others. Evidence in this cases speaks to historical and social arrangements rather than to generalized conclusions about reality.

Evidence in archives is closely related to the use of evidence by historians. Archival records can be viewed as evidence of "what really happened" or they can be a means of understanding the social structures and processes that contributed to record generation.

Documents contain information, they have immediate properties of being informative. They have meaning (either objectively or subjectively construed) and are a source of relevant ideas. Records stored in archives are different because in addition to having meaning they serve as potential evidence. Their "evidentiariness" is the relationship between the existence of record and the occurrence of the events that produced the record. From the record and its existence (premise) we can infer not only about the content of the record, but about events that produced the record (conclusion). To realize their potential as evidence, archival records must be reliable (their creators are truthful) and authentic (the content is truthful). Records as evidence can be used for making inferences to a)context (the circumstances of objects's creation and identify of its creators), b) function (uses of object), c) meaning (individualized or conventionalized expressions of ideas).

This is an interesting and dense paper. However, after reading it, my doubts about conceptual analysis as a useful method of inquiry only increased. One of the main conclusions of the paper seems to be that information and evidence are not that different conceptually. They are used differently in information science vs archival science, but that's a matter of practice, rather than their inherent definitions. Both documents and archival records can serve as information or evidence, it depends on their use. Concepts, especially those that are in use in different areas, stabilize in practice and function well without clear definitions. In archival science the question of how to evaluate records as potential evidence and, especially, what to include into or exclude from the archive is crucial (in other words, we need to know that what we save today can be used to make reliable inferences in the future). Is social epistemology, which was suddenly brought at the end of the paper, an invitation to have an open discussion about it or a way to overcome positivist approach to evidence?