DIKW: Data, Information, Knowledge, Wisdom: metadata

Showing posts with label metadata. Show all posts

Oct 19, 2015

Making progress in data sharing

A few useful tips on making progress in data sharing in a blog post "Data Sharing: Access, Language, and Context All Matter":

To make the global data system less fragmented and disorganized, create data portals with good human-centered designs and support users with varying levels of expertise

JSON and XML are great, but humans read data too. These formats are critical to fueling innovation, but make sure CSVs are available as well

Responsible data use demands proper attention to metadata. Document datasets and don't ignore ReadMe files while re-using them

Feb 20, 2015

Repository features to motivate more data sharing

One of the challenges of creating data stewardship infrastructure is engaging the users and meeting and prioritizing their needs, particularly the needs of long-tail science research. "What would motivate researchers to make their data available?" is a question we continuously grapple with. A recent study "Potential contributor perspectives on desirable characteristics of an online data environment for spatially referenced data" published in First Monday asked a very similar question in the context of geographic data. The researchers hypothesized that potential data contributors of small scale, local spatial data would be more willing to share their data if a repository included a simple, clear licensing mechanism, a simple process for attaching descriptions to the data, and a simple post-publication peer evaluation/commenting mechanism.

The paper draws on 10 qualitative interviews and 110 responses to an online questionnaire. The qualitative interview responses were mixed; they don't seem to reveal any patterns or unusual concerns. Some of the quantitative results were also mixed, but some provide good numbers to support the hypotheses:

90% of respondents said attribution (licensing) is important

62% think that non-commercial attribution is important
54% think that restricting re-use is important, i.e., others may use the data but not modify it in any way

93% said ability to attach keywords or other descriptions to data is important
78% said that commenting capability is important
85% said that stability and long-term maintenance of the repositories matters

Conclusion:

This research, subject to the caveats listed below, suggests that it would be desirable from the perspective of potential contributors of data to provide infrastructure capability that would:

allow users to attach conditions to the use of their data;

provide basic information that could be translated into standards based metadata; and,

receive comments and feedback from users.

Jan 5, 2015

Data politics and the dark sides of data

A bit lengthy post about the dark sides of data discusses whether data and its vast amounts and ubiquitous collection mechanisms help to "tell the truth to power", i.e., to change the world for the better. Will picking up the traces and revealing wrongdoing fix the world? Most likely not, because it's not clear whether we will care or do anything because of data. Here is a great quote:

... lawyers cannot fix human rights abuses, scientists cannot fix global warming, whistle-blowers cannot fix secret services and activists cannot fix politics, and nobody really knows how global finances work - regardless of the data they have at hand...

The framework of countering power and problems with data need to be revised. It's not about quantity or even quality of data, but about using whatever little we know to address not only our understandings (i.e., our rational capacities), but also our feelings and beliefs. Here is what gets in the way (those dark sides that everyone should think about):

corporate infrastructure for data and its cultures that creates an illusion of free and neutral services (i.e., services that have no monetary and no political cost)
non-transparency of most digital data and lack of control over it, which prevents us from copying, deleting, or processing our own data
de-politicizing of digital data or constructing data as fuel for innovation and services rather than a ground for moral, ethical, and political decisions

The post is rather pessimistic, but changes do not happen at once, so we should probably keep trying.

Jan 13, 2014

Identifier Test-bed Activities Report (ESIP Federation)

Below is a brief summary from a recent report to ESIP Federation's Data Stewardship Committee that evaluated identifier schemes for Earth system science data and information(see also executive summary and links). The report seems to be a hands-on continuation of the paper published in 2011 "On the utility of identification schemes for digital earth science data: an assessment and recommendations" by Ruth Duerr and others(link).
The paper introduced four uses cases and three assessment criteria:

Use cases:

unique identification (identify a piece of data, no matter which copy)

unique location (locate an authoritative copy)

citable location (identify cited data)

scientifically unique identification (to tell whether two data instances have the same info even if the formats are different)

Assessment criteria:

Technical value (e.g., scalability, interoperability, security, compatibility, technological viability)

User value (e.g., publishers' commitment, transparency)

Archive value (e.g., maintenance, cost, versatility)

The report took those use cases, expanded assessment criteria and used all of it to test the implementation of 8 identification schemes, DOI, ARK, UUID, XRI, OID, Handles, PURL, LSID, and URI/URN/URL, using two datasets: the Glacier Photo Collection from the National Snow and Ice Data Center (JPEG and TIFF images) and a numerical data set from the NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) sensor.
Report recommendations:

UUID are most appropriate as unique identifiers, any other use requires effort.
DOI, ARK and Handles are the most suitable as unique locators, DOI and ARK also support citable locators. Handles need a local dedicated server. ARKs are cheaper than others, but DOIs are accepted by publishers.
PURL has no means for creating opaque identifiers and the API support for batch operations is poor.
The rest of the ID schemes are less suitable.

It seems that the overall conclusion is that DOI and ARK are generally better, but there is a need for support of multiple ID schemes in a system. From the report I didn't quite get whether any of the ID schemes can support the fourth use case - scientifically unique identification. The paper argued that "none of the identifier schemes assessed here even minimally address this use case".

Oct 2, 2013

About research objects

Notes from the article by Bechhofer, Buchan, De Roure, Missier, Ainsworth et al. "Why linked data is not enough", Future Generation Computer Systems, 2011, (pdf).

Scientific research is increasingly digital and collaborative, therefore a new framework is needed that would facilitate the reuse and exchange of digital knowledge. Simply publishing data fails to reflect the research methodology and respect the rights and reputation of the researcher.

The concept of Research Objects (ROs) as semantically rich aggregations of resources can serve as a cornerstone of such new framework. ROs would include research questions, hypotheses, abstracts, organisational context (e.g., ethical and governance approvals, investigators, etc.), study design, methods (workflows, scripts, services, software packages. etc.), data, results, answers (e.g., publications, slides, DOIs), etc. The authors argue that this approach is better than linked data, but later they acknowledge that linked data works fine, it just needs to be revised and extended.

Important assumptions in the paper:

ROs work well in the context of e-Laboratories - environments that are mostly based on automated management systems and execution of in silico experiments
Reproducible research is ultimately possible in any domain and always desirable.
All elements of scientific research can be made explicit and encoded in a machine-readable way, if not now, then in the future.

Terms that refer to different ways of reusability:

Reusable - reuse as a whole or single entity.
Repurposeable - reuse as parts, e.g., taking an RO and substituting alternative services or data for those used in the study.
Repeatable - repeat the study, perhaps years later.
Reproducible - reproduce or replicate a result (start with the same inputs and methods and see if a prior result can be confirmed).
Replayable - automated studies can be replayed rather than executed again.
Referenceable - citataions for ROs.
Revealable - audit the steps performed in the research in order to be convinced of the validity of results.
Respectful - credit and attribution.

The authors describe several environments that try to implement aggregation of resources into ROs approach.

myExperiment Virtual Research Environment relies on the notion of "packs", collections of items that can be shared as a single entity.
Systems Biology of Microorganisms (SysMO) project has a web plaform SysMO-DB and a catalog SysmoSEEK. It relies on a JERM (Just Enough Results Model), which is based on the ISA (Investigation/Study/Assay) format. Another approach to support ROs within the systems biology community is SBRML (Systems Biology Results Markup Language). Most of the experiments in this domain are wet lab experiments, so traceability and referenceability are more relevant than repeatability and replayability.
MethodBox is part of the Obesity e-Lab, that allows researchers to "shop for variables" from studies related to obesity in the UK. The paper doesn't describe what method is used to support RO aggregations.

Packs in myExperiment is the most advanced implementation of the idea of ROs and, ironically, it's based on linked data: "Work in myExperiment makes use of the OAI-ORE vocabulary and model in order to deliver ROs in a Linked Data friendly way" (p. 10).

OAI-ORE defines standards for the description and exchange of aggregations of Web resources. It is agnostic to relationship types, so it needs to be extended. The authors propose the following extensions: the Research Objects Upper Model (ROUM) and the Research Object Domain Schemas (RODS). ROUM provides basic vocabulary to describe general properties of RO, such as the basic lifecycle states. RODS provide domain specific vocabulary. Not much details are provided about these two extensions.

Rather than arguing that linked data is not enough, it seems that the paper argues that current implementations of linked data in packaging scientific results needs to be revised to explicitly include the structure of aggregations. The purpose of articulating structure in a machine-readable way is to create an environment where every component of research (including hypotheses, methods, data and results) can be re-enacted. A more obvious and important conclusion from the discussion about ROs is that a) we need to keep encouraging exchange and sharing of research in ways that are more transparent; b) there is still a shortage of platforms to do that. MyExperiment is a nice example, but it's still domain and platform-specific.

The approach described in this paper is quite forward-looking. It is a call for rather radical changes in scientific practices. I wonder how many labs have automated experiment management environments where all datasets, workflows, scripts and results can be connected and reconstructed without much back-channeling. Another question is how much effort it takes to create ROs in a way that would make science fully "re-enactable". We probably won't be able to do that with legacy data.

Apr 25, 2013

Strategy for Civil Earth Observations - Data Management for Societal Benefit

The US National Science and Technology Council recently released a National Strategy for Civil Earth Observations. The goal of this strategy is to provide a framework for developing a more detailed plan that would enable "stable, continuous, and coordinated global Earth observation capabilities for the benefit of society."

The strategy establishes a way to evaluate Earth-observing systems and their information products around 12 societal benefit areas: agriculture and forestry, biodiversity, climate, disasters, ecosystems (terrestrial and freshwater), energy and mineral resources, human health, ocean and coastal resources, space weather, transportation, water resources, weather, and reference measurements. The production and dissemination of information products should be based on the following principles:

Full and open access
Timeliness
Non-discrimination
Minimum cost
Preservation
Information quality
Ease of use

Data management in federal agencies that are responsible for earth science data is described based on the three components of the data life cycle: planning and production, data management, and usage. The latter two components are the main focus of the data management strategy. The suggested activities for those are:

Data management

Data collection and processing - initial steps to store data and create usable data records.
Quality control - follow the principles of the “Quality Assurance Framework for Earth Observation” (QA4EO)
Documentation - basic information about the sensor systems, location and time available at the moment of data collection, etc.
Dissemination - data should be offered in formats that are known to work with a broad range of scientific or decision-support tools. Common vocabularies, semantics, and data models should be employed.
Cataloging - establishing formal standards-based catalog services, building thematic or agency-specific portals, enabling commercial search engines to index data holdings, and implementing emerging techniques such as feeds, self-advertising data, and casting.
Preservation and stewardship - guarantee the authenticity and quality of digital holdings over time.
Usage tracking - measuring whether the data are actually being used; to enable better usage tracking, data should be made available through application programming interfaces (APIs).
Final disposition - not all data and derived products must be archived, derived products that most users have access to may adequately replace raw data and processing algorithms.

Usage activities

Discovery - enabled by dissemination, cataloging and documentation activities.
Analysis - includes quick evaluaionts to assess the usefulness of a data set and an actual scientific analysis.
Product generation - creating new products by averaging, combining, differencing, interpolating, or assimilating data.
User feedback - mechanisms to provide feedback to improve usability and resolve data-related issues.
Citation - different data products, e.g., classifications, model runs, data subsets, etc., need to be citable.
Tagging - identify a data set as relevant to some event, phenomenon, purpose, program, or agency without needing to modify the original metadata.
Gap analysis - the determination by users that more data are needed, which influences the requirements-gathering for new data life cycles.

Each activity raises a lot of questions and challenges. The activities of cataloging, usage tracking, final disposition, tagging and gap analysis are particularly interesting. They raise questions that are rarely addressed in the data management literature. Does anybody use data that are being shared? Do all the data need to be preserved? How can we avoid duplicates and unnecessary modifications of metadata if data are being re-used? To what extent do we need to serve immediate user interests versus the future possibilities for research?

Apr 19, 2013

NIH report: Big data recommendations based on small data?

I've been browsing slides from the last BRDI Symposium, "Finding the Needle in the Haystack: A Symposium on Strategies for Discovering Research Data Online", and found a report for the National Institutes of Health about the management and analysis of large biomedical research data (pdf available here).

It is an interesting report that provides a lot of details about data and technologies in biomedical research as well as about existing efforts in data sharing. The recommendations make sense, since they follow most of the recommendations with regard to research data - more money, more policy, more training:

Promote data sharing by establishing a minimal metadata framework for data sharing, creating catalogs and tools and enhancing data sharing policy for NIH-funded research.
Support the development and dissemination of informatics methods and applications by funding software development.
Train the workforce in quantitative sciences by funding quantitative training initiatives and enhancing review expertise in quantitative methods of bioinformatics and biostatistics.

Even more interesting is what evidence is provided to support these recommendations. The report is based on a relatively small literature corpus (~25 citations plus footnotes) and on the analysis of comments that were solicited via an NIH request for information on management, integration, and analysis of large biomedical datasets. Overall, 50 respondents replied and made 244 suggestions. Is it enough data to make recommendations for NIH? If we begin with the assumption that more support for large datasets and biomedical computations is needed (which seems to be the case with this report), then there is almost no need to analyze costs and benefits of data sharing, the role of large datasets in providing solutions for biomedical problems, and so on.

Sep 6, 2012

Digital preservation issues

Notes from Digital preservation, archival science and methodological foundations for digital libraries (S. Ross, 2012, New Review of Information Networking, 17:1, 43-68, doi).

At the beginning the article makes an important observation - there is more to preserving digital objects than saving the content. Approaches to preservation should also include a) retaining the environment and context of creation and use and b) reproducing the experience of use.

The middle of the article is of lesser interest. Many points, such as a lack of systematic practices, policies or research strategies in preservation, can be skipped.

Suggestions for research agenda in this paper come primarily from the DigitalPreservationEurope project (DPE). There are nine important areas of work in digital preservation:

Restoration - restoring damaged digital objects, including content, context and experience and verifying their completeness.
Conservation - saving digital objects before they are damaged and making sure they cannot be damaged or destroyed in the future.
Collection management - making decisions about what goes in and out, etc.
Risk management - determining and quantifying uncertainties and minimizing various threats.
Interpretability and functionality - making sure digital objects remain meaningful, authentic, and usable.
Cohesion and interoperability - maintaining connections and transitions across systems, time, and repositories.
Automation - developing tools for handling big quantities of information.
Preserving the context - retaining information about how the object was created and used.
Storage - developing infrastructure for storing digital objects.

The article concludes with the statement that there is an urgent need for a theory of digital preservation and curation. To me it seems that we have enough theories to rely on. Once structures (technological and social) that support digital preservation become adopted and used, we can start observing existing practices and then decide whether we need a new theory. Otherwise, there is a danger of coming up with something trivial and calling it a new theory.

Aug 23, 2012

Metadata webinar

Notes from the NISO / DCMI webinar "Metadata for managing scientific research data".

General impression: it seems that people who research metadata (and larger information/knowledge organization issues) are so deep into their domains that they think everybody else knows nothing about data/metadata. Perhaps, the audience of this webinar consisted largely of people who are unaware of anything relate to this topic. And that's why the first half hour was spent on pretty simple and uninformative issues of "what is data-metadata-science".

I heard such conversations so many times without any progress, that I began to think we should just skip it and move on. No agreed upon definitions can ever be provided for any more-less complex concept. And still talking about metadata as "data about data" is almost embarrassing. It's better to emphasize that having a shared description of data, e.g., who created them, where they come from, what they are about, etc., helps to produce good and verifiable research and to (re)use the data in the future.

As for how to create metadata, it seems that it still needs to be figured out and systematized, so researchers and librarians are on their own. The metadata world is messy. Possible criteria for the selection and evaluation of metadata schemes include:

From Public Broadcasting Metadata Dictionary Project

Objectives/principles, such as interoperability, specific needs, expertise required.
Domains (genre focus, format variation)
Architectural layout (flat, hierarchical, granular, etc.)

And below are some common schemes according to their level of complexity:

Simple (interoperable, easy to generate, multidisciplinary, flat, 15-25 properties)	Moderate (requires some expertise, more domain focused, extensible via connecting to other schemes)	Complex (requires domain expertise, hierarchical, many properties):
Dublin Core	Darwin Core	FGDC Content standard for digital geospatial metadata
MARC	Access to biological collections data (ABCD)	FGDC Content standard for digital geospatial metadata
DataCite	Ecological metadata language	Data Documentation Initiative (DDI)

A couple of interesting questions/challenges: how to integrate metadata creation into social settings / workflows, automated generation of metadata, metadata as linked data.

Pages