Showing posts with label data stories. Show all posts
Showing posts with label data stories. Show all posts

Sep 16, 2016

Data for humanitarian purposes

Unni Karunakara, a former president of "Doctors without borders", gave a talk at International Data Week 2016 on September 13 about the role of data in humanitarian organizations. The talk was very powerful in its simplicity and urgent need for better data and its management and dissemination. It was a story of human suffering, but also a story of care and integrity in using data to alleviate it.

Humanitarian action can be defined as moral activity grounded in the ethics of assistance to those in need. Four principles guide humanitarian action:
  • humanity (respect for the human)
  • impartiality (provide assistance because of person's need, not politics or religion
  • neutrality (tell the truth regardless of interests)
  • independence (work independently from governments, businesses, or other agencies)
These principles affect how to collect and use data and how to ensure that data helps. Data collected for humanitarian action is evidence that can be used for direct medical action and for bearing witness, which is a very important activity of humanitarian organizations:  
“We are not sure that words can always save lives, but we know that silence can certainly kill." (quoted from another MSF president)
Awareness of serious consequences of data for humanitarian action makes "Doctors without borders" work only with data they collect themselves and use stories they witnessed firsthand. Restraint and integrity in data collection is crucial in maintaining credibility of the organization.

Lack of data or lack of mechanisms to deliver necessary data hurts people. Thus, in Ebola outbreak it took the World Health Organization about 8 months to declare emergency and 3000 people died because data was not available in time or in the right form. The Infectious Diseases Data Observatory (IDDO) was created to help with tracking and researching infectious diseases by sharing data, but many ethical, legal, etc. issues still need to be solved.

Humanitarian organizations often do not have trustworthy data available, either because of competing definitions or lack of data collection systems. For example, because of the differences in defining "civilian casualty" numbers of civilians killed in drone strikes range from a hundred to thousands. Or, in developing countries or conflict zones where census activities are absent or dangerous, counting graves or tents becomes a proxy of mortality, mobility rates and other important indicators. Crude estimates then are the only available evidence.

"Doctors without borders" (MSF) does a lot to share and disseminate its information. It has an open data / access policy and aspires to share data, while placing high value on security and well-being of people it helps.

Apr 18, 2016

Dataset on Parkinson's disease

In March 2016 Sage Bionetworks released a dataset that captures the everyday experiences of over 9,500 people with Parkinson's disease (press release). The data described in the data paper "The mPower study, Parkinson disease mobile data collected using ResearchKit" was collected via the mPower iPhone app, where participants were presented with tasks (referred to as ‘memory’, ‘tapping’, ‘voice’, and ‘walking’ activities) and asked to fill out surveys.

Not everybody agreed to share their data broadly with the research community. Out of 14,684 verified participants 9,520 (65%) agreed to share broadly, the rest split between withdrawing from the study and agreeing to share narrowly with the team only:

Study cohort description
Figure 1: mPower study cohort description. From

To provide proper safeguards and to balance sharing and privacy, the research team established a data governance structure. Access is granted to qualified researchers who agree to specific conditions for use, including the following:

  • participants cannot be re-identified
  • the data may not be redistributed
  • findings need to be published in open access venues
  • both participants and research team need to be acknowledged as data contributors
This effort is another example of the newly forming data sharing culture. And it uses Synapse that seems to make sharing easier from both technical and policy perspectives.

Jan 11, 2016

Pantheon 1.0: A manually verified dataset of globally famous biographies

Scientific Data has published a description of an interesting dataset: "Pantheon 1.0, a manually verified dataset of globally famous biographies". This data collection effort contributes to quantitative data for studying historical information, especially, the information about famous people and events.

Data collection workflow
Workflow diagram (Image from the paper)
The authors retrieved over 2 mln records about famous ("globally known") individuals from Google's Freebase, narrowed down the dataset to individuals who have metadata in English Wikipedia and then reduced it further to people who have records in more than 25 different languages in Wikipedia.

Manual cleaning and verification includes a controlled vocabulary for occupations, popularity metrics (defined as a number of Wikipedia edits adjusted by age and pageviews).

The dataset is available for download at Harvard Dataverse Another entertaining part is a visualization interface at that allows to explore the data and answer questions like "Where were globally known individuals in Math born?" (21% in France) or "Who are the globally known people born within present day by country?". Turns out that Russia produced a lot of politicians and writers, while the US gave us many actors, singers and musicians. 

Globally known people born in the US (from

Oct 19, 2015

Making progress in data sharing

A few useful tips on making progress in data sharing in a blog post "Data Sharing: Access, Language, and Context All Matter":

  • To make the global data system less fragmented and disorganized, create data portals with good human-centered designs and support users with varying levels of expertise

  • JSON and XML are great, but humans read data too. These formats are critical to fueling innovation, but make sure CSVs are available as well

  • Responsible data use demands proper attention to metadata. Document datasets and don't ignore ReadMe files  while re-using them

Sep 17, 2015

Valuable lessons from sharing and non-sharing of data

A vivid story from Buzzfeed "Scientists Are Hoarding Data And It’s Ruining Medical Research" describes two related cases - one where researchers voluntarily shared their entire dataset and how the re-analysis  found errors and miscalculations and another one where the data or any results from a largest drug trial were not released for 7 years because the researchers feared criticism and continued double-checking their data. The details from each of the cases are worth following up, but the author of the story comes to an important conclusion that we need to accept that science works through checks and corrections, stop unfair criticisms and doubts in researchers' credibility, and start sharing data for better science, better knowledge, and ultimately, better informed decisions that impact our lives:

And here is where I think the threads come together. The press releases on the reanalysis of the Miguel and Kremer deworming trial in Kenya will go live this week. Somewhere, I’m sure, people will attack or mock them for their errors. One way or another, I can’t believe they won’t feel bruised by the reanalysis. And that is where we have gone wrong. It’s not just naive to expect that all research will be perfectly free from errors, it’s actively harmful.

There is a replication crisis throughout research. When subjected to independent scrutiny, results routinely fail to stand up. We are starting to accept that there will always be glitches and flaws. Slowly, as a consequence, the culture of science is shifting beneath everyone’s feet to recognise this reality, work with it, and create structural changes or funding models to improve it.


Sep 15, 2015

Dataset: Roads and cities of 18th century France

An interesting dataset has been described in the Scientific Data journal and shared via the Harvard Dataverse repository - "Roads and cities of 18th century France".
The database presented here represents the road network at the french national level described in the historical map of Cassini in the 18th century. The digitization of this historical map is based on a collaborative methodology that we describe in detail. This dataset can be used for a variety of interdisciplinary studies, covering multiple spatial resolutions and ranging from history, geography, urban economics to network science.

The repository page showed 268 downloads on Sept 15, 2015, so hopefully, some examples of data re-use will follow this publication.

Aug 31, 2015

Lessons from replication of research in psychology

Science magazine has published an article “Estimating the reproducibility of psychological science”, which reports the first findings from 100 replications completed by 270 contributing authors. A quasi-random sample was drawn from three psychology journals: Psychological Science (PSCI), Journal of Personality and Social Psychology (JPSP), and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP:LMC). The replications were performed by teams and then independently reviewed by other researchers and reproduced by another analyst. The reproducibility was evaluated using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analyses of effect sizes. Some highlights from the results:

  • 35 studies in the replications showed positive effect of p < 0.05 compared to 97 original studies

  • 82 studies showed a stronger effect size in the original study than in the replication

  • Effect size comparisons showed a 47.4% replication success rate

  • 39 studies were subjectively rated as successfully replicated

While some news about this publication reported failures in the test (e.g., Nature’s "Over half of psychology studies fail reproducibility test"), the Science article emphasized the challenges of reproducibility itself and care with which interpretations of successes and failures need to be made. The authors of the study pointed out that while replications produced weaker evidence for the original findings,
“It is too easy to conclude that successful replication means that the theoretical understanding of the original finding is correct. Direct replication mainly provides evidence for the reliability of a result. If there are alternative explanations for the original finding, those alternatives could likewise account for the replication. Understanding is achieved through multiple, diverse investigations that provide converging support for a theoretical interpretation and rule out alternative explanations.

It is also too easy to conclude that a failure to replicate a result means that the original evidence was a false positive. Replications can fail if the replication methodology differs from the original in ways that interfere with observing the effect. We conducted replications designed to minimize a priori reasons to expect a different result by using original materials, engaging original authors for review of the designs, and conducting internal reviews. Nonetheless, unanticipated factors in the sample, setting, or procedure could still have altered the observed effect magnitudes.”

Aug 10, 2015

Losing data from the National Centre for e-Social Science (NCeSS) portal

Submitted by Andy Turner

Edited by Inna Kouper

The National Centre for e-Social Science (NCeSS) was a UK based program established around 2004 to stimulate the development of digital tools and services for social scientists. In around 2008 it adopted the use of Sakai as a system for communicating, developing information, storing and managing access to data. NCeSS was configured with a “hub” at the University of Manchester and a network of research nodes across the UK (see the Digital Social Research page for the list of nodes, many now archived).

Andy Turner, a researcher from the University of Leeds, worked on a project to develop demographic models for geographical simulation system. The project, abbreviated as MoSeS (Modelling and Simulation for e-Social Science), was one of the first phase research nodes of NCeSS. Some information is available on Andy’s page, but many links from there are now unavailable. Andy explains why:

“In 2011 the NCeSS Sakai Portal went off-line following a server failure and because there were no more resources for replacing the server. All the data was stored in a database on a National Grid Service server which for some reason had a catastrophic failure. All that remained for me to salvage were some backup database dumps, which fortunately also contained the portal front end configuration which enabled me with the help of my local IT team to get a database reader set up and a version of the NCeSS Sakai Portal working almost, but not quite as it had been. This was good enough to get some data out, but my local IT were not willing to make the system accessible again for security reasons. As a consequence of the problems some detailed social simulation model run results were lost. These would take a lot of time and effort to reproduce as they were generated on a fairly massive computer, which we got access to thanks to the UK-CERN collaboration GridPP and my collaborator Tom Doherty from the University of Glasgow. The work with Tom was undertaken as part of the Jisc funded project NeISS (a project to establish a National e-Infrastructure for Social Simulation), which was by the time of the server failure supporting the NCeSS Sakai Portal.

In theory, sufficient metadata has been stored from the simulation runs so that the results can be readily produced, but this is unlikely to transpire as the results were really only academically interesting as their inherent uncertainties were too great to make them of practical use. Anyway, I have given up on all that for now. I have moved on, but at the time it was rather painful seeing what probably amounted to almost three years of my effort turn to nothing. I may still get something more out of it in the long run because of the learning involved in this process. Explaining what happened to my academic superiors who desperately wanted research outputs was hard. One day I may return to research that pushes the boundaries of what we can and can’t do, but I know that is risky as failure is not tolerated well in academia."

Reflecting on the importance of preservation and curation of data, Andy writes:

“Preservation and curation are not easy. Sustaining research effort that may one day generate useful data and software is also not easy, especially when the goal is aspirational and probably quite a long way off and the steps are necessarily baby steps to begin with. In NCeSS, issues of sustainability were discussed from early on for each NCeSS research project and for the organisation itself. Documentation about this from 2008 was stored on the portal and so is now also inaccessible…

The soft learning experiences of failure and how these relate to sustainability and the importance of promoting collaboration, re-use and enrichment in the research process are key, but where can these be written up in the academic literature? The blog might seem ephemeral, but these days they can be captured by a directed Internet Archive WayBack Machine and preserved for the future.”

Our data stories blog is one of the places where such discussions can be recorded, and while we don’t have a solid sustainability plan, we do keep external backup copies of the stories. If you have stories similar to Andy’s to share, use our form, send it directly to or register on the website and become a contributor.

Jul 20, 2015

Study: Biomedical data sharing and reuse

A recent publication in PLoS ONE "Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff" surveyed the Intramural Research Program at the US National Institutes of Health (NIH) with regard to data management, data sharing, and data re-use.  The authors received 190 responses and analyzed 135 (scientific and clinical staff). Below are the highlights from their findings:

  • ~60% of respondents rated relevance of data re-use as high, while ~15% rated it as low

  • ~25% rated their expertise in re-using data as high, while ~45% rated it as low

  • ~61% reported that they had never uploaded a dataset into a repository, while ~71% said they had shared data directly with another researcher

  • `30%  indicated that it took them more than 10 hours to prepare data for sharing

  • Only 20 respondents provided reasons for not sharing data and their reasons were pretty scattered (see image below):

Image from the study (t016), responses to "...the reason(s) for not sharing your data"

The data from this study is available on Figshare, but it is not the full survey dataset, it's a subset to support the results of this publication. And for some reason it doesn't contain free text responses to the non-sharing question. It's always informative to see what people say beyond the provided standard categories (that is usually the most interesting story in my mind). Perhaps, there were no free text responses.

Citation: Federer LM, Lu Y-L, Joubert DJ, Welsh J, Brandys B (2015) Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff. PLoS ONE 10(6): e0129506. doi:10.1371/journal.pone.0129506

Jul 15, 2015

Archaeology meets modern scanning technology for preservation and re-use

Submitted by Annemiek van der Kuil, edited by Inna Kouper

Image from "The strange case of 60 frothy beads: puzzling Early Iron Age glass beads from the Netherlands" conference paper by D.J. Huisman et al.
Dr. Dominique Ngan-Tillard, a professor at the Faculty of Civil Engineering and Geosciences at Delft University of Technology, the Netherlands, has deposited a dataset into the 3TU.Datacentrum repository that contains tomography scans of early Iron Age glass beads found during the archaeological excavations in the Netherlands.

The dataset supports a conference publication by Dr. Ngan-Tillard and others “The strange case of 60 frothy beads: puzzling Early Iron Age glass beads from the Netherlands”. The micro-CT scans helped to identify gas bubbles and mineral and metal inclusions in the glass beads, which allowed the researchers to conclude that “the Zutphen glass beads are the result of local, inexpert, reworking of imported glass objects” (p. 231, conference paper).

In addition to the in-depth analysis of the beads’ structure, the scans serve as a form of virtual preservation of the ornaments. Stored in a data repository and made publicly available, they can help other archaeologists, as well as material scientists and museums in their research and educational activities. In the future 3D prints of the ornaments can be produced for a better understanding of the art of making glass and jewels.

According to Dr. Ngan-Tillard’ comment on the 3TU.Datacentrum website, storing digital collections of archaeological remains together with their meta-data and interpretation will help advance both arts and research and create more challenges for our knowledge.

Watch a short video about frothy beads or see the full story at

Jul 10, 2015

Withholding data - questionable science or scientific misconduct

Nicole Janz writes in the LSE Impact of Social Sciences blog  that not sharing one’s research data should be considered a scientific misconduct. This will help to fight data secrecy and establish better research practices. A few key points from the post:

  • Many researchers don’t share data even if they promise to do so - see, for example, Krawczyk and Reuben’s 2012 study “(Un)available upon request: field experiment on researchers' willingness to share supplementary materials [see also “How and why researchers share data (and why they don’t)”]

  • Scientific misconduct definitions usually includes fabrication, falsification or plagiarism. Sharing research data provides evidence that there was no fabrication or falsification involved, hence it’s crucial in avoiding misconduct allegations and demonstrating proper conduct.

  • A broader definition of scientific misconduct includes departure from accepted standards and practices of a research community. As many research communities strive to be open with regard to the evaluation of their knowledge claims, obligations to share data can be seen as part of the research standards and practices. Hence, data secrecy can be considered a questionable research practice or a misconduct.

The continuum of research practices described by Janz ranges from the gold standards of open data, open code, pre-registration and version control to questionable research practices of p-hacking, sloppy statistical methods and other manipulations to withholding data to misconduct with its fabrication, falsification, and plagiarism.

[caption id="" align="aligncenter" width="433"]From Janz' LSE Impact of Social Sciences blog post: Research practices continuum From Janz' LSE Impact of Social Sciences blog post: Research practices continuum[/caption]

Jul 7, 2015

Climategate study - simpler methods into the mix

Joe Denier gets a new "climategate" hat
Image from a HuffPo article by Shan Wells

Climategate was a controversy unfolded in November 2009 after thousands of emails and files from the Climatic Research Unit (CRU) at the University of East Anglia (UEA) were published online without the owners' consent. The climate change opponents used the content of the emails to argue that scientists manipulated data to prove their argument for human responsibility of climate change. Several investigations didn't find any scientific misconduct at the CRU, but the reports called for opening up access to research data and more transparency in methods and communication of results (see Climactic Research Unit email controversy in Wikipedia).

Controversies are always hard to sort through, but they present an interesting research case for those like me who are interested in discourse, language, and media. A recent study "The creation of the climategate hype in blogs and newspapers: mixed methods approach" (paywalled) looked at the Climategate controversy and compared discussions in blogs and newspapers.

Newspaper and blog data were collected from the LexisNexis Academic database using the search term ‘climategate’. Two methods were used to analyze the data: a) ARIMA (Auto Regressive Integrated Moving Average) modeling to create a model of the daily frequencies of postings and to examine the mutual influence of newspapers and blogs and b) semantic co-word maps of blogs and newspaper headlines to compare framings of climategate.

The results of the modeling seemed a bit confusing as they showed a significant link between a high number of blogs and a high change in newspapers articles (either increase or decrease) on the same day. (I'd really like to see simple descriptive statistics of posts per day, etc. Also, a pre-print where all the images and tables are at the end of the article is very hard to read). At the same time an increase in newspaper articles on one day had no effect on the number of blog postings on the next day. The conclusion of the article is that blogs influenced newspapers, but not the other way around. The semantic maps showed (predictably) that the blogs used a more informal language and framed the topics more negatively, while the newspapers were more formal and stayed more neutral. Both blogs and newspapers picked up similar sub-topics, such as climate change, scientists, and so on, although the word "climategate" occurred more in blogs.

Several thoughts / questions upon reading this interesting, although a bit too methodologically complicated for such simple variables and questions, study:

  • How different are "traditional" and "new" media nowadays? They may be still different in their language style, but what about the speed of publication, audiences, contributors, and so on? The headlines don't get to the differences in main posts and comments either.
  • The word "climagate" did originate in a blog, but it was a journalist who picked it up and popularized it via a newspaper-hosted blog (see Climategate: how the 'greatest scientific scandal of our generation' got its name). Does it change the conclusion that "blogs were independent of the attention in newspapers" (p. 20) if journalists write for both media?
  • It would've been helpful to establish the actual sequence of events via an additional documentary analysis. The paper argues that the word "climategate" originated in blogs, which promoted the hype. But according to the Wikipedia article, news about emails release were published almost simultaneously in blogs and newspapers - on November 20, 2009. So is the hype about the word or other, more nuanced exchanges and actions as well?
  • Three blogs received links to leaked documents. It seems that it was intentional - the blogs were skeptical of climate change. Did it matter for how the hype have originated and developed? Again, what is the connection between the actual controversy and its naming as climategate?
  • How can the link between the large number of blog posts and the decrease in newspapers articles be explained? More quotes and examples of interactions and influences between blogs and newspapers could be very helpful in illustrating all the findings.

Overall, it seems that the studies of controversies benefit from careful tracings of words and actor connections rather than from complicated modeling that is rather confusing and not so eye-opening.

The International Polar Year 2007-2008 (IPY-4) and the importance of data management

The International Polar Year is an international collaboration that focuses on the Arctic and the Antarctic, or polar regions. The polar regions have many unique phenomena, but the cold harsh environment makes them expensive to visit and study. It takes a large multi-country collaborative effort to put together expeditions, install equipment, and collect data. The first three IPYs occurred in 1882–1883,1932–1933, and 1957–58 respectively. The fourth IPY took place between March 2007 and March 2009.

The fourth IPY was dramatically different from the previous efforts (Mokraine and Parsons, 2013). A $1.2 bln effort with participants from more than 60 countries, it had an ambitious vision to enable international sharing and reuse of multidisciplinary datasets and keep the data discoverable, open, linked, useful, and safe (Parsons, Godoy et al., 2011). The enormous efforts to initiate, coordinate, improve, and sustain IPY data stewardship have seen both successes and failures, with some components of the IPY infrastructure struggling to exist and be useful (Lessons and legacies..., 2012).

A fair amount of IPY data is available via such online portals as the IPY data page at the National Snow and Ice Data Center (NSIDC) in the US, the NASA Global Change Master Directory (GCMD) IPY portal, or the Global Cryosphere Watch portal. Some of it, such as a global IPY Data and Information System (IPYDIS)  or the Discovery, Access, and Delivery of Data for IPY (DADDI) are broken. Most importantly though, missing is a way to track and access all the IPY data via a federated or centralized catalog. There is no good consistent way of international polar data to “function locally and reach globally”, to use Mokraine and Parsons’ words.

The challenges of making heterogeneous data and metadata work together were exacerbated by the lack of focused international funding for planning data archiving post-IPY, differences in data policies and researchers “hoarding” data (Lessons and legacies..., 2012; Carlson, 2011). Despite many IPY projects adopting a free and open data-sharing policy, compliance with it and, ultimately, sharing was rather low. Additionally, the researchers in IPY-4 didn’t have access to data from the first IPY projects, some of the data were not available in the digital form, while others were scattered or lost. The data centers (WDCs) that were supposed to support the increasing IPY data streams, lacked mechanisms of working with heterogeneous data, e.g., they couldn’t support social and ecological data.

Despite the difficulties, the IPY data management experience is crucial to the advancement of global data services and the norms of data sharing and re-use. As Mark Parsons, Secretary General of the  Research Data Alliance and former Senior Associate Scientist and the Lead Project Manager at NSIDC put it,
“We were perhaps rather naive going in to IPY. Many of the organizers came from the geoscience background of earlier the IPYs and assumed data systems would exist that could handle IPY data. We weren’t prepared for the incredible diversity of IPY4 with data ranging from Indigenous knowledge to satellite remote sensing to genomic sequencing to cosmology. Although it is unclear what percentage of IPY data are available and much is surely lost, new data services were created and sustained, international coordination continues in sustained organisations, and we learned a lot about different disciplinary cultures and their attitudes to data sharing. The IPY Data Policy was aggressive and not fully honored, but it did drive changes in national policies towards more timely and open release of data. Most critically we saw a change in the conversation within polar science from whether to share to when to share and now how to share. We have a long way to go, but polar data are significantly more accessible than they were prior to IPY.”

Mark’s and others’ publications, some of which are listed below, are a good source of all the lessons learned from IPY data stewardship efforts, one important lesson being that “[e]xperts in data management are critical members of any team attempting internationally coordinated science ...” (Lessons and legacies..., 2012).


Carlson D. 2011. A lesson in sharing. Nature 469 (293).

Lessons and legacies of the International Polar Year 2007-2008. 2012.

Mokrane M and MA Parsons. 2014. Learning from the international polar year to build the future of polar data management. Data Science Journal 13.

Parsons MA, Ø Godøy, E LeDrew, TF de Bruin, B Danis, S Tomlinson, and D Carlson. 2011. A conceptual framework for managing very diverse data for complex interdisciplinary science. Journal of Information Science 37 (6): 555-569.

Parsons MA, T de Bruin, S Tomlinson, H Campbell, Ø Godøy, J LeClert, and IPY Data Policy and Management SubCommittee. 2011. The state of polar data—the IPY experience. In Understanding Earth’s Polar Challenges: International Polar Year 2007-2008. Ed. Krupnik I, I Allison, R Bell, P Cutler, D Hik, J López-Martínez, V Rachold, E Sarukhanian, and C Summerhayes. Edmonton, Canada: CCI Press.

Jul 5, 2015

Don't be the next data disaster!

Amy Hodge, Science data librarian at Stanford University libraries brings us a story of data loss from the 1970s.

Described as a "gut-wrenching tale", this is the story of a fire that destroyed records made by  Dr. Srinivas, related to his observations made during a year-long visit to India over four decades earlier.

It is a salutary reminder that if efforts are not taken to protect data, years of research could be ruined, whether by arsonists or a hurricane, or a dropped laptop or spilt cup of coffee. Eighteen years of work went into processing the field records lost by Dr. Srinivas - all three copies were held in the same building where the fire was started.

Source: Case study: Data storage and back up. Stanford University Libraries

Jun 29, 2015

Increasing sharing, expanding user base, and estimating impact of your research data using service tools and social media - a use case study

It has become increasingly important to communicate and share your research with users and estimate the potential impact of your research data, both during and after its development. Dr. Ge Peng, a research scholar at Cooperative Institute for Climate and Satellites in North Carolina (CICS-NC), and Thomas Maycock, Science Public Information Officer, share their thoughts on different ways of sharing your data, expanding your user base, and broadening the impact of your research. They compare several communication platforms and service tools in this set of slides, outlining some advantages and disadvantages based on their own experiences.

Below, they describe how tools and services have been used with one of the main research products: the scientific data stewardship maturity assessment model.

The main research product featured in this presentation is the scientific data stewardship maturity assessment model, in the form of a matrix, which is jointly developed by scientists and data managers from CICS-NC and NOAA’s National Centers for Environmental Information (NCEI). The matrix provides a unified framework for assessing stewardship practices applied to individual digital Earth Science datasets and is published by the Data Science Journal (

Prior to baselining and publishing the matrix in the peer-reviewed journal, the vehicle of was utilized to allow public viewing and, by invitation only, downloads of the beta versions of the matrix along with a set of background slides. The slides were used in communicating, either directly or via e-mail lists, to people at data-management-oriented conferences, groups, and organizations. This communication has proven to be very beneficial in improving the consistency of content in the matrix by obtaining feedback from a much wider pool of experts in the field.

The capability of creating shorter and meaningful URLs by was useful for customizing URL links for use in tweets, e-mails, and presentations.

The vehicle of was used to issue a persistent digital object identifier (DOI) in a timely fashion. This DOI is included in the matrix journal publication to provide users with sustained and trackable access to the latest version of the matrix.

Dr. Peng indicated that keeping both sets of slides (matrix and high-level background) at after the publication of the matrix allows her to continue to reach out to users in domains and countries beyond her original expectations. The analytics provided by provide a good indicator of potential interest and impact.

The web stories and social media are coordinated efforts between CICS-NC and NCEI by the communication teams from both organizations. As indicated in slide 6, they bring noticeable traffic to the site. Again, the online presence of those web stories, tweets and Facebook has a long-lasting effect.

Submitted by Ge Peng, Cooperative Institute for Climate and Satellites in North Carolina (CICS-NC)

Jun 8, 2015

Data recovery over the years

Muller Media conversions based in the US shares a number of cases where they took up the challenge to recover data at risk of loss through decaying media or obsolescent formats , making acquaintance with some interesting data sets along the way!

A set of slides summarises the case studies, starting at page 8.  The interventions included comprise:

  • In 1992, political rivals of a contender for a presidential nomination were curious about his visit to Moscow in 1969.  An FOIA request turned up some Old State Department tape, which was in an undocumented file format, and required some hacking skills to reveal the content. (Page 8)

  • The National Archives and Records Administration set up an Archival Preservation System in 1992 as a records maintenance system, dealing with government's electronic mail and other data stored in digital formats. (Page 9)

  • In 1994 a law firm in Little Rock, Arkansas needed the recovery of  data related to legal work for a real estate project named "Castle Grande" from the 1980s.  The related records had mysteriously vanished, but were recovered from Wang 8 inch floppies. (Page 10)

  • Presidential appointments calendar and notes from the 1970s used a proprietary database for storage by a Whitehouse mainframe, with tape in danger of decay and unknown data formats.  Similarities in data structure with Vietnam-era military records helped to unravel the data and store them in a new database. (Page 11)

  • Student records from the late 70s were held in a basement of a school on Vydec floppy discs. The school was the subject of a litigation as oil wells on the school property  were suspected of endangering the health of school pupils (later made into a film). (Page 12)

  • With an international angle, tapes and discs with population data from all over the world, are collected and published by the Minnesota Population Centre and  The case of staff at the Bangladesh Bureau of statistics illustrates the world-wide need for data rescue.  Daily power cuts were among the challenges faced by staff in keeping legacy tapes stored properly. (Pages 13-14)

  • Dr. Siebert's collection on the Penobscot language included interviews with native speakers and a dictionary in a rare format. (Page 15)

Slides submitted by Chris Muller, Muller Media Conversions. To re use any of these stories please contact Chris Muller

Jun 5, 2015

Data Sharing in Human Paleogenetics

A study published in PLOS ONE “When Data Sharing Gets Close to 100%: What Human Paleogenetics Can Teach the Open Science Movement”  examined patterns of sharing data related to polymorphisms in ancient mitochondrial DNA and Y and X chromosomes. The authors focused on data that can be considered derivative, i.e., they’re derived from the processing of raw data obtained via such methods as DNA purification, Polymerase Chain Reaction, and so on. 162 PubMed papers on ancient human DNA containing a total of 207 datasets were retrieved.

The authors classified types of sharing (in the text, files for download, supplementary materials, online database) and sent out a survey to the papers’ authors about their choices in sharing human DNA data.

202 datasets out of 207 (97.6%) were fully available and reusable. Five datasets that were initially withheld, have been published later. At the same time more than half of the datasets (57.7%) were shared in the body of the published article, rather than via a database or in separate files.

Among the 33 researchers who responded to the survey, most of them acknowledged the importance of making their studies open to scientific inquiry. Many (97%) also agreed that data sharing should be a common practice in science. The authors of the PLOS paper make a conclusion that a) awareness of the importance of openness in science may help achieve a high data sharing rate; b) modality of sharing (i.e., how data is shared) plays an important role in sharing behavior, and c) openness to the scientific scrutiny of data in human paleogenetics coupled with the adoption of rigorous standards and cross-laboratory validation has been crucial in establishing the field and its scientific rigor and data reliability.

May 27, 2015


Inna Kouper will be presenting a poster on the data stories blog at IASSIST 2015.  If you are attending the event, pop by and have a chat. Would you be able to contribute a data story of your own?

In preparation for the poster, Inna carried out a quick analysis of the categories for data aspect assigned to the current entries tackled in each post. These can be seen in pie-chart form below:

dataStoroiesPieChartWhat stories would you find most helpful, and in which form?  Let us know by leaving a comment or emailing data

May 11, 2015

Stolen laptop had PhD research

A story from a local newspaper based in Surrey, British Columbia from 2008 about a laptop stolen from a car at a shopping centre which held the only copy of research data collected for a PhD.

Submitted by Isabel Chadwick, Open Univeristy

May 7, 2015

Thesis writing: backing up

A humorous blogpost from a PhD student about the importance of backing up data while writing your PhD  - what the author calls "thesis insurance".  The post includes an example of a near-miss data loss and its recovery (at cost), as well as an amusing cartoon from

Submitted by Isabel Chadwick, Open University