Oct 2, 2013

About research objects

Notes from the article by Bechhofer, Buchan, De Roure, Missier, Ainsworth et al. "Why linked data is not enough", Future Generation Computer Systems, 2011, (pdf).

Scientific research is increasingly digital and collaborative, therefore a new framework is needed that would facilitate the reuse and exchange of digital knowledge. Simply publishing data fails to reflect the research methodology and respect the rights and reputation of the researcher.

The concept of Research Objects (ROs) as semantically rich aggregations of resources can serve as a cornerstone of such new framework. ROs would include research questions, hypotheses, abstracts, organisational context (e.g., ethical and governance approvals, investigators, etc.), study design, methods (workflows, scripts, services, software packages. etc.), data, results, answers (e.g., publications, slides, DOIs), etc. The authors argue that this approach is better than linked data, but later they acknowledge that linked data works fine, it just needs to be revised and extended.

Important assumptions in the paper:

  • ROs work well in the context of e-Laboratories - environments that are mostly based on automated management systems and execution of in silico experiments
  • Reproducible research is ultimately possible in any domain and always desirable.
  • All elements of scientific research can be made explicit and encoded in a machine-readable way, if not now, then in the future.

Terms that refer to different ways of reusability:

  • Reusable - reuse as a whole or single entity.
  • Repurposeable - reuse as parts, e.g., taking an RO and substituting alternative services or data for those used in the study.
  • Repeatable - repeat the study, perhaps years later.
  • Reproducible - reproduce or replicate a result (start with the same inputs and methods and see if a prior result can be confirmed).
  • Replayable - automated studies can be replayed rather than executed again.
  • Referenceable - citataions for ROs.
  • Revealable - audit the steps performed in the research in order to be convinced of the validity of results.
  • Respectful - credit and attribution.

The authors describe several environments that try to implement aggregation of resources into ROs approach.

  • myExperiment Virtual Research Environment relies on the notion of "packs", collections of items that can be shared as a single entity.
  • Systems Biology of Microorganisms (SysMO) project has a web plaform SysMO-DB and a catalog SysmoSEEK. It relies on a JERM (Just Enough Results Model), which is based on the ISA (Investigation/Study/Assay) format. Another approach to support ROs within the systems biology community is SBRML (Systems Biology Results Markup Language). Most of the experiments in this domain are wet lab experiments, so traceability and referenceability are more relevant than repeatability and replayability.
  • MethodBox is part of the Obesity e-Lab, that allows researchers to "shop for variables" from studies related to obesity in the UK. The paper doesn't describe what method is used to support RO aggregations.

Packs in myExperiment is the most advanced implementation of the idea of ROs and, ironically, it's based on linked data: "Work in myExperiment makes use of the OAI-ORE vocabulary and model in order to deliver ROs in a Linked Data friendly way" (p. 10).

OAI-ORE defines standards for the description and exchange of aggregations of Web resources. It is agnostic to relationship types, so it needs to be extended. The authors propose the following extensions: the Research Objects Upper Model (ROUM) and the Research Object Domain Schemas (RODS). ROUM provides basic vocabulary to describe general properties of RO, such as the basic lifecycle states. RODS provide domain specific vocabulary. Not much details are provided about these two extensions.

Rather than arguing that linked data is not enough, it seems that the paper argues that current implementations of linked data in packaging scientific results needs to be revised to explicitly include the structure of aggregations. The purpose of articulating structure in a machine-readable way is to create an environment where every component of research (including hypotheses, methods, data and results) can be re-enacted. A more obvious and important conclusion from the discussion about ROs is that a) we need to keep encouraging exchange and sharing of research in ways that are more transparent; b) there is still a shortage of platforms to do that. MyExperiment is a nice example, but it's still domain and platform-specific.

The approach described in this paper is quite forward-looking. It is a call for rather radical changes in scientific practices. I wonder how many labs have automated experiment management environments where all datasets, workflows, scripts and results can be connected and reconstructed without much back-channeling. Another question is how much effort it takes to create ROs in a way that would make science fully "re-enactable". We probably won't be able to do that with legacy data.