Apr 25, 2013

Strategy for Civil Earth Observations - Data Management for Societal Benefit

The US National Science and Technology Council recently released a National Strategy for Civil Earth Observations. The goal of this strategy is to provide a framework for developing a more detailed plan that would enable "stable, continuous, and coordinated global Earth observation capabilities for the benefit of society."

The strategy establishes a way to evaluate Earth-observing systems and their information products around 12 societal benefit areas: agriculture and forestry, biodiversity, climate, disasters, ecosystems (terrestrial and freshwater), energy and mineral resources, human health, ocean and coastal resources, space weather, transportation, water resources, weather, and reference measurements. The production and dissemination of information products should be based on the following principles:

  • Full and open access
  • Timeliness
  • Non-discrimination
  • Minimum cost
  • Preservation
  • Information quality
  • Ease of use
Data management in federal agencies that are responsible for earth science data is described based on the three components of the data life cycle: planning and production, data management, and usage. The latter two components are the main focus of the data management strategy. The suggested activities for those are:

  • Data management
    • Data collection and processing - initial steps to store data and create usable data records.
    • Quality control - follow the principles of the “Quality Assurance Framework for Earth Observation” (QA4EO)
    • Documentation - basic information about the sensor systems, location and time available at the moment of data collection, etc.
    • Dissemination - data should be offered in formats that are known to work with a broad range of scientific or decision-support tools. Common vocabularies, semantics, and data models should be employed.
    • Cataloging - establishing formal standards-based catalog services, building thematic or agency-specific portals, enabling commercial search engines to index data holdings, and implementing emerging techniques such as feeds, self-advertising data, and casting.
    • Preservation and stewardship - guarantee the authenticity and quality of digital holdings over time.
    • Usage tracking - measuring whether the data are actually being used; to enable better usage tracking, data should be made available through application programming interfaces (APIs).
    • Final disposition - not all data and derived products must be archived, derived products that most users have access to may adequately replace raw data and processing algorithms.
  • Usage activities
    • Discovery - enabled by dissemination, cataloging and documentation activities.
    • Analysis - includes quick evaluaionts to assess the usefulness of a data set and an actual scientific analysis.
    • Product generation - creating new products by averaging, combining, differencing, interpolating, or assimilating data.
    • User feedback - mechanisms to provide feedback to improve usability and resolve data-related issues.
    • Citation - different data products, e.g., classifications, model runs, data subsets, etc., need to be citable.
    • Tagging - identify a data set as relevant to some event, phenomenon, purpose, program, or agency without needing to modify the original metadata.
    • Gap analysis - the determination by users that more data are needed, which influences the requirements-gathering for new data life cycles.

Each activity raises a lot of questions and challenges. The activities of cataloging, usage tracking, final disposition, tagging and gap analysis are particularly interesting. They raise questions that are rarely addressed in the data management literature. Does anybody use data that are being shared? Do all the data need to be preserved? How can we avoid duplicates and unnecessary modifications of metadata if data are being re-used? To what extent do we need to serve immediate user interests versus the future possibilities for research?

Apr 19, 2013

NIH report: Big data recommendations based on small data?

I've been browsing slides from the last BRDI Symposium, "Finding the Needle in the Haystack: A Symposium on Strategies for Discovering Research Data Online", and found a report for the National Institutes of Health about the management and analysis of large biomedical research data (pdf available here).

It is an interesting report that provides a lot of details about data and technologies in biomedical research as well as about existing efforts in data sharing. The recommendations make sense, since they follow most of the recommendations with regard to research data - more money, more policy, more training:

  • Promote data sharing by establishing a minimal metadata framework for data sharing, creating catalogs and tools and enhancing data sharing policy for NIH-funded research.
  • Support the development and dissemination of informatics methods and applications by funding software development.
  • Train the workforce in quantitative sciences by funding quantitative training initiatives and enhancing review expertise in quantitative methods of bioinformatics and biostatistics.
Even more interesting is what evidence is provided to support these recommendations. The report is based on a relatively small literature corpus (~25 citations plus footnotes) and on the analysis of comments that were solicited via an NIH request for information on management, integration, and analysis of large biomedical datasets. Overall, 50 respondents replied and made 244 suggestions. Is it enough data to make recommendations for NIH? If we begin with the assumption that more support for large datasets and biomedical computations is needed (which seems to be the case with this report), then there is almost no need to analyze costs and benefits of data sharing, the role of large datasets in providing solutions for biomedical problems, and so on.

Feb 11, 2013

Philosophy and science: A need for Ph in PhD

An interesting piece published recently in the Science magazine: Shaking Up Science by Jennifer Couzin-Frankel. It's behind the paywall and pretty long, so here is a quick summary.

The essay is a story about two biology scientists, Ferric Fang from University of Washington and Arturo Casadevall from Albert Einstein College of Medicine in the Bronx, New York. They were brought together by "disenchantment", i.e., they both had worries about what is going on in academia and science:

Discovery for its own sake was being sidelined by a push to publish in high-impact journals. Funding was scarcer than ever. Scientists focused on narrow fields and often couldn't communicate their professional passions at a cocktail party.
They were both editors of an immunology journal, so they started writing opinion pieces about grants, peer review process and so on. At some point they got interested in research misconduct, more specifically, how many papers are being retracted, where and why. First, they wanted to see if there is a connection between a journal's impact factor and its retraction rate. They searched the PubMed database and found a robust correlation - the higher the impact factor, the more retractions the journal had.
Then they looked closely to retractions between 1977 and 2000 and found that about 67% of all the retractions were attributed to scientific misconduct, including fraud and plagiarism.

The next step was (and it's usually the most difficult one) to figure out why it happens. I like the possible explanation, but it's not clear from the essay whether it was supported with evidence or not. It makes a lot of sense though.

The scientists believe that the race for grants and funding encourages misconduct.

"It's all about money," Fang says. "How can you be sure that you get money?" The answer comes back to publications—and sometimes skirting the rules to get them.
The story up to this point is more or less obvious. A lot of people talk about problems with depending on grants in funding science and scientists (soft money) and peer review. What is interesting is what kind of solutions are proposed. The scientists argue for more generalized science education instead of the extreme specialization. And for more philosophical training, particularly in epistemology and metaphysics that encourages asking questions like "What is it that you know?" and "How do you know what you know?"

Even though we may never go back to making philosophy a required subject (which I had as part of my graduate studies in Russia), I think it'd be great. Asking broad questions about the nature of knowledge and, more importantly, its justifications and limits, encourages people to step back, look at the larger picture and think critically about what they're doing. By doing that the sciences can be what they're supposed to be - a self-correcting institution based on the Mertonian norms of communalism, universalism, disinterestedness, originality and skepticism.

Jan 9, 2013

Librarians and their skill set

CLIR blog has recently posted a piece on re-skilling for librarians by Christa Williford, focusing on digital humanities librarianship. What kind of skills do librarians need in order to be relevant in contemporary research environments? The list can be pretty long, moreover, there might be multiple lists.

Another list was proposed in a report that Christa mentioned, “Re-skilling for research” by Research Libraries UK (RLUK). The report contains results of a series of studies that aimed to map the needs of researchers onto tasks to be undertaken by subject librarians.

The report is long, but the message is the same over and over: librarians’ roles and skills are quite limited and traditional; they do not match the needs. Subject librarians are not involved at the early stages of research that involve conceptualization and planning. Most of the services are still offered in the areas of literature search and information management (how to store and organize everything). Services that are related to data collection, management, analysis and preservation are in their infancy at best.

The list of 32 skills in this report includes:

  • deep knowledge of the discipline/subject
  • understanding of research experiences and workflows
  • knowledge of funding sources and mandates of funding agencies
  • knowledge of storage and information management techniques
  • knowledge of data sources and data manipulation techniques and good familiarity with metadata and emerging technologies
In addition to all those skills, the report suggests that subject librarians need to move from the liaison model to the engagement model. In other words, librarians need to become proactive and seek opportunities to contribute to research teams at every stage of research.

While all those skill suggestions sound reasonable, I’m not sure that it’s realistic for librarians to sustain such a huge change. The skill sets described above require completely different training. Once a person goes through such training, librarianship might not be the most fulfilling career path for them. Perhaps, re-skilling in librarianship should happen not in the areas of actual skills and knowledge sets, but in the orientation of library services - from offering advise and consultation to enabling various forms of scholarly activities. As enablers, librarians are providers of spaces, infrastructure, resources, tools, etc. that facilitate preservation and dissemination of knowledge. In this case it's not about their skills per se, it's about finding the right partners and promoting the right cause.

Dec 5, 2012

Max Weber on ethical neutrality in the social sciences

Finally finished a piece by Max Weber "The meaning of 'ethical neutrality' in sociology and economics" (Methodology of social sciences, 2011, google books link).

In this piece Weber asks whether the social sciences can be ethically neutral and what it means in terms of their research questions and methods. He addresses this issue by distinguishing between value-judgments and factual assertions. Value-judgments are evaluations of phenomena that can be satisfactory or unsatisfactory (positive or negative). They are derived from ethical principles or cultural ideals, which are subjective and therefore cannot be discussed scientifically. Factual statements are logically deducible and empirically observable. While the distinction between empirical statements and value-judgments is difficult to make, it is important according to Weber to keep making this distinction to maintain rigor in the social sciences. Avoiding taking a moral stand as part of one's research is what makes the social sciences science.

Weber's position is that education (and lectures as its ultimate manifestation in his times) should not be based on value-judgments. Students attend education institutions to cultivate their capacities for observation and reasoning, and a certain body of factual information. Evaluations, which cannot be contested in a lecture hall, should take place somewhere else. However, Weber writes that university decision-makers can decide which path to choose: to include value-judgments in education or not. It depends on whether they believe that education is about molding human beings and developing their political, ethical, and cultural attitudes, or whether it should focus on specialized training.

The methodological question in empirical sciences is not how to avoid value-judgments, but how to distinguish between them and empirical propositions and use both accordingly. Science can ask questions about things which convention makes self-evident. Evaluations (value-judgments) often seem self-evident. They can be examined by empirical sciences with respect to the conditions of their emergence and existence. This leads to an “understanding", i.e., a greater awareness of the issues and reasons for persistence of norms and opinions as well as conflicts. Empirical sciences can help to understand the means, the repercussions, and the conditioned competition of various evaluations, but choices between means, consequences and ultimately evaluations are matters of choice and compromise.

There is no (rational or empirical) scientific procedure of any kind whatsoever which can provide us with a decision here. The social sciences, which are strictly empirical sciences, are the least fitted to presume to save the individual the difficulty of making a choice, and they should therefore not create the impression that they can do so. - p. 19

One of the tasks of an empirically neutral social science is to analyze standpoints and reduce them to rational, internally consistent forms and investigate the pre-conditions of their existence and their implications. It can be done by using theoretical constructs, ideal types, which are pure fiction and should be used as such. Ideal types, or rationally correct and consistent Utopian constructions of patterns or behaviors are useful in comparing them with empirical reality in order to establish its divergences or similarities and to understand or explain them causally. Ideal types should not be used for establishing moral imperatives.

In theory, Weber's approach makes sense. Especially, when he talks about the danger of presenting value-judgments as factual statements and making them imperatives. It's obvious that mixing evaluations with facts makes a bad science. But what happens when we make a conscious choice to remain ethically neutral when studying sensitive issues or vulnerable populations? Also, if we become aware of means and repercussions of evaluations, why doesn't it help us to make better choices?