Jul 7, 2015

Climategate study - simpler methods into the mix

Joe Denier gets a new "climategate" hat
Image from a HuffPo article by Shan Wells

Climategate was a controversy unfolded in November 2009 after thousands of emails and files from the Climatic Research Unit (CRU) at the University of East Anglia (UEA) were published online without the owners' consent. The climate change opponents used the content of the emails to argue that scientists manipulated data to prove their argument for human responsibility of climate change. Several investigations didn't find any scientific misconduct at the CRU, but the reports called for opening up access to research data and more transparency in methods and communication of results (see Climactic Research Unit email controversy in Wikipedia).


Controversies are always hard to sort through, but they present an interesting research case for those like me who are interested in discourse, language, and media. A recent study "The creation of the climategate hype in blogs and newspapers: mixed methods approach" (paywalled) looked at the Climategate controversy and compared discussions in blogs and newspapers.

Newspaper and blog data were collected from the LexisNexis Academic database using the search term ‘climategate’. Two methods were used to analyze the data: a) ARIMA (Auto Regressive Integrated Moving Average) modeling to create a model of the daily frequencies of postings and to examine the mutual influence of newspapers and blogs and b) semantic co-word maps of blogs and newspaper headlines to compare framings of climategate.

The results of the modeling seemed a bit confusing as they showed a significant link between a high number of blogs and a high change in newspapers articles (either increase or decrease) on the same day. (I'd really like to see simple descriptive statistics of posts per day, etc. Also, a pre-print where all the images and tables are at the end of the article is very hard to read). At the same time an increase in newspaper articles on one day had no effect on the number of blog postings on the next day. The conclusion of the article is that blogs influenced newspapers, but not the other way around. The semantic maps showed (predictably) that the blogs used a more informal language and framed the topics more negatively, while the newspapers were more formal and stayed more neutral. Both blogs and newspapers picked up similar sub-topics, such as climate change, scientists, and so on, although the word "climategate" occurred more in blogs.

Several thoughts / questions upon reading this interesting, although a bit too methodologically complicated for such simple variables and questions, study:

  • How different are "traditional" and "new" media nowadays? They may be still different in their language style, but what about the speed of publication, audiences, contributors, and so on? The headlines don't get to the differences in main posts and comments either.
  • The word "climagate" did originate in a blog, but it was a journalist who picked it up and popularized it via a newspaper-hosted blog (see Climategate: how the 'greatest scientific scandal of our generation' got its name). Does it change the conclusion that "blogs were independent of the attention in newspapers" (p. 20) if journalists write for both media?
  • It would've been helpful to establish the actual sequence of events via an additional documentary analysis. The paper argues that the word "climategate" originated in blogs, which promoted the hype. But according to the Wikipedia article, news about emails release were published almost simultaneously in blogs and newspapers - on November 20, 2009. So is the hype about the word or other, more nuanced exchanges and actions as well?
  • Three blogs received links to leaked documents. It seems that it was intentional - the blogs were skeptical of climate change. Did it matter for how the hype have originated and developed? Again, what is the connection between the actual controversy and its naming as climategate?
  • How can the link between the large number of blog posts and the decrease in newspapers articles be explained? More quotes and examples of interactions and influences between blogs and newspapers could be very helpful in illustrating all the findings.

Overall, it seems that the studies of controversies benefit from careful tracings of words and actor connections rather than from complicated modeling that is rather confusing and not so eye-opening.

Feb 20, 2015

Repository features to motivate more data sharing

One of the challenges of creating data stewardship infrastructure is engaging the users and meeting and prioritizing their needs, particularly the needs of long-tail science research. "What would motivate researchers to make their data available?" is a question we continuously grapple with. A recent study "Potential contributor perspectives on desirable characteristics of an online data environment for spatially referenced data" published in First Monday asked a very similar question in the context of geographic data. The researchers hypothesized that potential data contributors of small scale, local spatial data would be more willing to share their data if a repository included a simple, clear licensing mechanism, a simple process for attaching descriptions to the data, and a simple post-publication peer evaluation/commenting mechanism.

The paper draws on 10 qualitative interviews and 110 responses to an online questionnaire. The qualitative interview responses were mixed; they don't seem to reveal any patterns or unusual concerns. Some of the quantitative results were also mixed, but some provide good numbers to support the hypotheses:

  • 90% of respondents said attribution (licensing) is important
    • 62% think that non-commercial attribution is important
    • 54% think that restricting re-use is important, i.e., others may use the data but not modify it in any way
  • 93% said ability to attach keywords or other descriptions to data is important
  • 78% said that commenting capability is important
  • 85% said that stability and long-term maintenance of the repositories matters

Conclusion:

This research, subject to the caveats listed below, suggests that it would be desirable from the perspective of potential contributors of data to provide infrastructure capability that would:

  • allow users to attach conditions to the use of their data;
  • provide basic information that could be translated into standards based metadata; and,
  • receive comments and feedback from users.

Feb 18, 2015

Research Data Alliance/US Call for Fellows

I'm a co-PI on a project that provides a great opportunity to the early career researchers and professionals to engage with the Research Data Alliance and help to improve data practices and make data management and data sharing easier and more transparent. Below are the details from the call for fellows:
The Research Data Alliance (RDA) invites applications for its newly redesigned fellowship program. The program’s goal is to engage early career researchers in the US in Research Data Alliance (RDA), a dynamic and young global organization that seeks to eliminate the technical and social barriers to research data sharing.

The successful Fellow will engage in the RDA through a 12-18 month project under the guidance of a mentor from the RDA community. The project is carried out within the context of an RDA Working Group (WG), Interest Group (IG), or Coordination Group (i.e., Technical Advisory Board), and is expected to have mutual benefit to both Fellow and the group’s goals.

Fellows receive a stipend and travel support and must be currently employed or appointed at a US institution.

Fellows have a chance to work on real-world challenges of high importance to RDA, for instance:
  • Engage with social sciences experts to study the human and organizational barriers to technology sharing
  • Apply a WG product to a need in the Fellow’s discipline
  • Develop plan and disseminate RDA research data sharing practices
  • Develop and test adoption strategies
  • Study and recommend strategies to facilitate adoption of outputs from WGs into the broader RDA membership and other organizations
  • Engage with potential adopting organizations and study their practices and needs
  • Develop outreach materials to disseminate information about RDA and its products
  • Adapt and transfer outputs from WGs into the broader RDA membership and other organizations
The program involves one or two summer internships and travel to RDA plenaries during the duration of the fellowship (international and domestic travel). Fellows will receive a $5000 stipend for each summer of the fellowship. Fellows will be paired with a mentor from the RDA community.

Through the RDA Data Share program, fellows will participate in a cohort building orientation workshop offering training in RDA and data sciences. This workshop is held at the beginning of the fellowship. RDA Data Share program coordinators will work with Fellows and mentors to clarify roles and responsibilities at the start of the fellowship.

Criteria for selection: The Fellows engaging in the RDA Data Share program are sought from a variety of backgrounds: communications, social, natural and physical sciences, business, informatics, and computer science. The RDA Data Share program will look for a T-shaped skill set, where early signs of cross discipline competency are combined with evidence of teamwork and communication skills, and a deep competency in one discipline.

Additional criteria include: interest in and commitment to data sharing and open access; demonstrated ability to work in teams and within a limited time framework; and benefit to the applicant’s career trajectory.

Eligibility: Graduate students and postdoctoral researchers at institutions of higher education in the United States, and early career researchers at U.S.-based research institutions who graduated with a relevant master’s or PhD and are no more than three years beyond receipt of their degree. Applications from traditionally underserved populations are strongly encouraged to apply.

To apply: Interested candidates are invited to submit their resume/curriculum vitae and a 300-500 word statement that briefly describes their education, interests in data issues, and career goals to datashare-inquiry-l@list.indiana.edu. Candidates are encouraged to browse the RDA website https://rd-alliance.org/ and pages of interest and working groups to identify relevant topics and mutual interests.

Important dates:
April 16, 2015 – Fellowship applications are due
May 1, 2015 – Award notifications
June 18-19, 2015 – Fellowship begins with the orientation workshop in Bloomington, IN

RDA Data Share, funded by the Alfred P. Sloan Foundation under award G-2014-13746, engages students and early career researchers in the Research Data Alliance. This engagement builds on foundational infrastructure funded by the National Science Foundation grant # ACI-1349002.

Feb 11, 2015

Institutional analysis of data practices

A short summary of a paper published in JASIST recently: Mayernik, M. S. (2015), Research data and metadata curation as institutional issues. J Assn Inf Sci Tec. doi:10.1002/asi.23425.

The paper begins by noticing a mismatch between the findings of two studies on the data practices in climate science. One of them (a report commissioned by the UK Research Information Network RIN) described the level of data sharing in climate science as low and the other (the book by Edwards "A vast machine...") argued that data sharing was a strong and common norm in climate science. Which one is true? Or, could it be that both studies are correct and climate science includes both the high and the low data sharing levels?

Data practices are institutionalized within a number of social systems, including formal organizations (such as universities and research centers), rules and sanctions (such as funding agency requirements and professional guidelines), and the norms of modern Western science, so the case study analysis in this paper is grounded in the institutional framework that has five characteristics: (a) norms and symbols, (b) intermediaries, (c) routines, (d) standards, and (e) material objects. Norms are largely associated with the norms of science (Merton and later work), symbols are logos and other visible signs of collective identity, but also terminological choices and metaphors. Intermediaries are individuals or collectives who connect resources and facilitate relationships. Routines are frequently repeated patterns of action and interaction, for example, meal or socializing routines. Standards are rules and specifications that define how information can be presented, organized, and transferred. Material objects are ... material objects.

The case studies are comparisons between data practices at the Center for Embedded Networked Sensing (CENS) and the Long Term Ecological Research (LTER) network and between the University Corporation for Atmospheric Research (UCAR)and the National Center for Atmospheric Research (NCAR).

Although there are some interesting observations in these case studies, it seemed that the first, conceptual part of the paper was stronger than the second. The five characteristics of the institutional framework were applied rather narrowly, without revealing many interconnections and directionality. For example, the standards section focuses on metadata standards and their choice. Are there any other standards relevant to data practices? How does the choice of standards affect norms and what is the role of intermediaries in establishing routines and other aspects of data practices? Another much more important question is: Once we describe the variability of data practices within and across disciplines, what's next? What exactly is the role of each institutional carrier in data practices?

Jan 5, 2015

Data politics and the dark sides of data

A bit lengthy post about the dark sides of data discusses whether data and its vast amounts and ubiquitous collection mechanisms help to "tell the truth to power", i.e., to change the world for the better. Will picking up the traces and revealing wrongdoing fix the world? Most likely not, because it's not clear whether we will care or do anything because of data. Here is a great quote:

... lawyers cannot fix human rights abuses, scientists cannot fix global warming, whistle-blowers cannot fix secret services and activists cannot fix politics, and nobody really knows how global finances work - regardless of the data they have at hand...

The framework of countering power and problems with data need to be revised. It's not about quantity or even quality of data, but about using whatever little we know to address not only our understandings (i.e., our rational capacities), but also our feelings and beliefs. Here is what gets in the way (those dark sides that everyone should think about):

  • corporate infrastructure for data and its cultures that creates an illusion of free and neutral services (i.e., services that have no monetary and no political cost)
  • non-transparency of most digital data and lack of control over it, which prevents us from copying, deleting, or processing our own data
  • de-politicizing of digital data or constructing data as fuel for innovation and services rather than a ground for moral, ethical, and political decisions

The post is rather pessimistic, but changes do not happen at once, so we should probably keep trying.