Apr 15, 2014

Survey of digital curation and curators

I am conducting a survey of digital curation and digital curators.

If you are involved in taking care of digital materials of any type, form and purpose and are interested in the advancement of digital curation as a professional field, feel free to take the survey and share it with colleagues. The survey takes about 20 min to complete and can be found at http://bit.ly/1osgTQ7

Feb 12, 2014

Kinds of data and their implications for data re-use

Notes on a paper "What Are Data? The Many Kinds of Data and Their Implications for Data Re-Use" (Journal of Computer-Mediated Communication, 2007, Vol. 12, N 2, pp. 635–651):

The paper reports on an ethnographic research of data sharing practices in four projects that served as case studies. The goal - to reflect on the technical and social contexts of enabling data for re-use. The four projects were:

  • SkyProject – a collaborative project to build the data grid infrastructure for U.K. astronomy
  • SurveyProject – a project to produce a yearly large-scale complex survey dataset (~10,000 U.K. households)
  • CurationProject – a digitization and access project of artifacts and photographs collected since 1884
  • AnthroProject – a digitization of anthropological materials collected in a range of countries over one researcher's academic career

The fieldwork included interviews of participants recruited via snowballing technique and via paths that the data took through each project plus document analysis and observation (including project websites, conference and other face-to-face meetings). Participants included people involved in data collection, processing, analysis and reuse.

Below are some observations on the data practices at four stages of data lifecycle:

  • Data Collection: some disciplines produce digital data, while others (e.g., AnthroProject) work with a mix of digital, non-digital and legacy data (tapes, diaries, photographs, etc.). The labor of digitization is often ignored, but it's still very important at the stage of data collection.
  • Data Formatting: data need to be transformed (converted, re-formatted, flattened, etc.) to be re-usable by others. In SurveyProject, for example, variables are renamed and recoded, files are renamed and loaded into a database. The processes of converting variables (e.g., words into numbers) and of successive renaming and restructuring make collected materials visible, manageable, communicable, and intelligible for others. Such transformations into manageable and communicable chunks are difficult for disciplines that see their primary goal as describing the specificities of particular contexts and drawing distinctions as opposed to generalizations and simplification.
  • Data Release: ownership, consent, and ethics differ depending on whether people are represented in the source data or not. In AnthroProject, where the point of the data is their subject-specificity, anonymization is largely impossible to achieve.
  • Data Re-Use: the case studies suggested that histories and configurations of research communities influence how data are documented, contextualized and checked for quality. Overall, the following aspects are important for data to become re-usable: conditions and context of data capture (e.g., atmospheric conditions or community place-time); instrument quality and calibration techniques; data points and variables to be collected; transormation techniques (e.g., statistical methods and parameters)

This study introduced some interesting observations, but there were not enough details to make them meaningful. With such lack of details and specificity, it's hard to be convinced in the claims that are being made. For example, what questions were asked during interviews? Were there any differences between responses and information in documents, meetings, etc.? Could we make comparisons across all four projects for each stage of data lifecycle? Are quotes mere illustrations or they are indicative of some patterns?

Jan 17, 2014

Evidence, a conceptual analysis

Notes from "Conceptual Analysis: A Method for Understanding Information as Evidence, and Evidence as Information" by J. Furner (Archival Science, 2004, 4, 233–265, pdf)

The paper examines the concept of evidence and compares it to the concept of information. Conceptual analysis treats concepts as classes of objects or events and seeks to define the meaning of a given concept by specifying the conditions under which any entity could be classified under the concept in question. The goal of conceptual analysis as a method of inquiry is to improve our understanding of the ways in which particular concepts are used for communicating ideas. The utility of this method is to produce provocative or interesting ideas which can be pushed as directions of further research.

Definition:Evidence is that which we consider or interpret in order to draw or infer a conclusion about some aspect of the world.

Evidence is a concept that is used in science, law, history and archival practice/science. In science evidence is an important component of theories of induction (inference from a set of premises) and explanation (specification of a sequence of cause and effect). Induction can be done by generalizing evidence (e.g., by observing a white swan we conclude that all swans are white), by deducing conclusions from a hypothesis (we hypothesize that cigarette smoking causes cancer and expect a smoker to have lung disease - if the smoker has the disease, the hypothesis is confirmed), or by calculating the effect of observed evidence on our degree of belief in a hypothesis (evidence raises our degree of belief and increases the probability of explanation). Hypothetico-deductive reasoning has a problem of underdetermination (the observed evidence can be entailed by more than one hypothesis). Deductive explanations also assume existence of a law (the deductive-nomological model), which is problematic for the social sciences. In the social sciences this model is rejected in favor of other accounts of explanation that allow for the effects of human intentionality and free will and that require the explainer to engage in imaginative interpretation of the meanings of events.

Evidence in legal context is information presented to prove or disprove a given fact. Eye witness accounts under oath are examples of direct evidence, fingerprints are an example of circumstantial evidence.

Evidence in history depends on whether a historian belongs to the positivist or postmodernist camp. The former camp acknowledges the existence of reality that is external and independent of human thought and uses evidence in the scientific sense, as a premise to form and test hypotheses. The latter group argues that reality is a construction of human thought and that "the facts" are statements endorsed by the group of people most well equipped to impose their values over others. Evidence in this cases speaks to historical and social arrangements rather than to generalized conclusions about reality.

Evidence in archives is closely related to the use of evidence by historians. Archival records can be viewed as evidence of "what really happened" or they can be a means of understanding the social structures and processes that contributed to record generation.

Documents contain information, they have immediate properties of being informative. They have meaning (either objectively or subjectively construed) and are a source of relevant ideas. Records stored in archives are different because in addition to having meaning they serve as potential evidence. Their "evidentiariness" is the relationship between the existence of record and the occurrence of the events that produced the record. From the record and its existence (premise) we can infer not only about the content of the record, but about events that produced the record (conclusion). To realize their potential as evidence, archival records must be reliable (their creators are truthful) and authentic (the content is truthful). Records as evidence can be used for making inferences to a)context (the circumstances of objects's creation and identify of its creators), b) function (uses of object), c) meaning (individualized or conventionalized expressions of ideas).

This is an interesting and dense paper. However, after reading it, my doubts about conceptual analysis as a useful method of inquiry only increased. One of the main conclusions of the paper seems to be that information and evidence are not that different conceptually. They are used differently in information science vs archival science, but that's a matter of practice, rather than their inherent definitions. Both documents and archival records can serve as information or evidence, it depends on their use. Concepts, especially those that are in use in different areas, stabilize in practice and function well without clear definitions. In archival science the question of how to evaluate records as potential evidence and, especially, what to include into or exclude from the archive is crucial (in other words, we need to know that what we save today can be used to make reliable inferences in the future). Is social epistemology, which was suddenly brought at the end of the paper, an invitation to have an open discussion about it or a way to overcome positivist approach to evidence?

Jan 13, 2014

Identifier Test-bed Activities Report (ESIP Federation)

Below is a brief summary from a recent report to ESIP Federation's Data Stewardship Committee that evaluated identifier schemes for Earth system science data and information(see also executive summary and links). The report seems to be a hands-on continuation of the paper published in 2011 "On the utility of identification schemes for digital earth science data: an assessment and recommendations" by Ruth Duerr and others(link).

The paper introduced four uses cases and three assessment criteria:

Use cases:
  • unique identification (identify a piece of data, no matter which copy)
  • unique location (locate an authoritative copy)
  • citable location (identify cited data)
  • scientifically unique identification (to tell whether two data instances have the same info even if the formats are different)
Assessment criteria:
  • Technical value (e.g., scalability, interoperability, security, compatibility, technological viability)
  • User value (e.g., publishers' commitment, transparency)
  • Archive value (e.g., maintenance, cost, versatility)

The report took those use cases, expanded assessment criteria and used all of it to test the implementation of 8 identification schemes, DOI, ARK, UUID, XRI, OID, Handles, PURL, LSID, and URI/URN/URL, using two datasets: the Glacier Photo Collection from the National Snow and Ice Data Center (JPEG and TIFF images) and a numerical data set from the NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) sensor.

Report recommendations:

  • UUID are most appropriate as unique identifiers, any other use requires effort.
  • DOI, ARK and Handles are the most suitable as unique locators, DOI and ARK also support citable locators. Handles need a local dedicated server. ARKs are cheaper than others, but DOIs are accepted by publishers.
  • PURL has no means for creating opaque identifiers and the API support for batch operations is poor.
  • The rest of the ID schemes are less suitable.

It seems that the overall conclusion is that DOI and ARK are generally better, but there is a need for support of multiple ID schemes in a system. From the report I didn't quite get whether any of the ID schemes can support the fourth use case - scientifically unique identification. The paper argued that "none of the identifier schemes assessed here even minimally address this use case".

Nov 19, 2013

DLF forum notes: Data – sharing – libraries – culture

A recent Digital Library Federation (DLF) forum started with an inspiring keynote by R. D. Lankes and his challenging of the monopolies of content delivery in higher education and the neutrality of the library profession. He argued that librarians should improve society by facilitating knowledge creation, which includes not only providing access to resources, but also teaching literacy, genres, and communication, creating learning environments and motivating people to learn and research. He also emphasized that a librarian is somebody who has either training, professional experience, or spirit, thereby shifting the emphasis from degrees to actual knowledge and passion.

Of lesser success was my leading of a birds-of-a-feather session about Research Data Alliance (RDA). BoFs happened during lunch and it’s probably not a good time to learn about a new organization. Spreading the word about new initiatives is also hard because they are new and there is more anticipation and preparatory work, rather than something that is ready to use. I had several good conversations about RDA, but there could have been more.

All sessions were informative and productive, but the one that got me thinking was a session that I couldn’t attend "Creating the New Normal: Fostering a Culture of Data Sharing with Researchers". It’s a rich topic, so below is some food for thought - my interpretation of the session theme based on the materials prepared for the session by organizers and community notes taken by participants during the session.

The session was based on the Data Information Literacy (DIL) project that is looking to identify skills, capacities and toolkits for data management and leverage information literacy to bridge the disconnect between faculty, graduate students (who are often the data managers in scientific labs) and librarians.

The disconnect stems from different needs in relation to data – data collection vs analysis vs preservation and access. Even though researchers may recognize the importance of data management, they rarely consider it being part of the curriculum or articulated research culture. To simplify, graduate students collect data and both faculty and students analyze it and work on publications. The messiness of data collection and storage is something that bothers many researchers, but they don’t necessarily know what to do about it.

Librarians are increasingly interested in incorporating research data into their library collections and providing access to it as part of their service. They would like datasets to be better prepared for storage and sharing, and they are ready to help. They also have a mission of helping others to address their information needs, but they don’t necessarily have the power to do that. How could these disconnects be bridged?

By working through three scenarios proposed for discussion, the session seemed to approach disconnect bridging via embedding librarians/data specialists into research teams and projects, learning about researchers’ needs and helping them with their data, while teaching them good practices of data management. Librarians’ expertise can be useful in the areas of file organization, metadata, and tools for storage and sharing.

This approach is good, but it’s quite demanding in terms of resources. I agree that data literacy should be embedded in a larger teaching of proper information management (including ethics, security, authoritativeness, etc.) and this could help re-use existing channels of library instruction and minimize resources. At the same time, I wonder whether we should also think about the culture of data sharing in terms of private/public epistemic objects. Data is still a private object, which will constantly undermine the "new normal" of sharing, because we don’t typically share objects we consider private. The new norm then should be that data are "shareable" from the beginning.

In some organizations and domains data are already open and shareable, for example, data from the large observatories supported by the US government. Consequently, data producers in those organizations may have better data information literacy. For other "private" domains, particularly, in the social sciences, there are still a lot of barriers and fears. Would a public culture of data sharing mean peer review of data collection instruments? Survey respondents as co-owners of data? Data curators as independent decision-makers with regard to choosing and preparing data for public use? A lot of possibilities come with the idea of data sharing cultures.