DIKW: Data, Information, Knowledge, Wisdom: data loss

Showing posts with label data loss. Show all posts

Aug 10, 2015

Losing data from the National Centre for e-Social Science (NCeSS) portal

Submitted by Andy Turner

Edited by Inna Kouper

The National Centre for e-Social Science (NCeSS) was a UK based program established around 2004 to stimulate the development of digital tools and services for social scientists. In around 2008 it adopted the use of Sakai as a system for communicating, developing information, storing and managing access to data. NCeSS was configured with a “hub” at the University of Manchester and a network of research nodes across the UK (see the Digital Social Research page for the list of nodes, many now archived).

Andy Turner, a researcher from the University of Leeds, worked on a project to develop demographic models for geographical simulation system. The project, abbreviated as MoSeS (Modelling and Simulation for e-Social Science), was one of the first phase research nodes of NCeSS. Some information is available on Andy’s page, but many links from there are now unavailable. Andy explains why:

“In 2011 the NCeSS Sakai Portal went off-line following a server failure and because there were no more resources for replacing the server. All the data was stored in a database on a National Grid Service server which for some reason had a catastrophic failure. All that remained for me to salvage were some backup database dumps, which fortunately also contained the portal front end configuration which enabled me with the help of my local IT team to get a database reader set up and a version of the NCeSS Sakai Portal working almost, but not quite as it had been. This was good enough to get some data out, but my local IT were not willing to make the system accessible again for security reasons. As a consequence of the problems some detailed social simulation model run results were lost. These would take a lot of time and effort to reproduce as they were generated on a fairly massive computer, which we got access to thanks to the UK-CERN collaboration GridPP and my collaborator Tom Doherty from the University of Glasgow. The work with Tom was undertaken as part of the Jisc funded project NeISS (a project to establish a National e-Infrastructure for Social Simulation), which was by the time of the server failure supporting the NCeSS Sakai Portal.
In theory, sufficient metadata has been stored from the simulation runs so that the results can be readily produced, but this is unlikely to transpire as the results were really only academically interesting as their inherent uncertainties were too great to make them of practical use. Anyway, I have given up on all that for now. I have moved on, but at the time it was rather painful seeing what probably amounted to almost three years of my effort turn to nothing. I may still get something more out of it in the long run because of the learning involved in this process. Explaining what happened to my academic superiors who desperately wanted research outputs was hard. One day I may return to research that pushes the boundaries of what we can and can’t do, but I know that is risky as failure is not tolerated well in academia."

Reflecting on the importance of preservation and curation of data, Andy writes:

“Preservation and curation are not easy. Sustaining research effort that may one day generate useful data and software is also not easy, especially when the goal is aspirational and probably quite a long way off and the steps are necessarily baby steps to begin with. In NCeSS, issues of sustainability were discussed from early on for each NCeSS research project and for the organisation itself. Documentation about this from 2008 was stored on the portal and so is now also inaccessible…
The soft learning experiences of failure and how these relate to sustainability and the importance of promoting collaboration, re-use and enrichment in the research process are key, but where can these be written up in the academic literature? The blog might seem ephemeral, but these days they can be captured by a directed Internet Archive WayBack Machine and preserved for the future.”

Our data stories blog is one of the places where such discussions can be recorded, and while we don’t have a solid sustainability plan, we do keep external backup copies of the stories. If you have stories similar to Andy’s to share, use our form, send it directly to datastories@dcc.ac.uk or register on the website and become a contributor.

Jul 5, 2015

Don't be the next data disaster!

Amy Hodge, Science data librarian at Stanford University libraries brings us a story of data loss from the 1970s.

Described as a "gut-wrenching tale", this is the story of a fire that destroyed records made by Dr. Srinivas, related to his observations made during a year-long visit to India over four decades earlier.

It is a salutary reminder that if efforts are not taken to protect data, years of research could be ruined, whether by arsonists or a hurricane, or a dropped laptop or spilt cup of coffee. Eighteen years of work went into processing the field records lost by Dr. Srinivas - all three copies were held in the same building where the fire was started.

Source: Case study: Data storage and back up. Stanford University Libraries http://library.stanford.edu/research/data-management-services/case-studies/case-study-data-storage-and-backup

Jun 8, 2015

Data recovery over the years

Muller Media conversions based in the US shares a number of cases where they took up the challenge to recover data at risk of loss through decaying media or obsolescent formats , making acquaintance with some interesting data sets along the way!

A set of slides summarises the case studies, starting at page 8. The interventions included comprise:

In 1992, political rivals of a contender for a presidential nomination were curious about his visit to Moscow in 1969. An FOIA request turned up some Old State Department tape, which was in an undocumented file format, and required some hacking skills to reveal the content. (Page 8)

The National Archives and Records Administration set up an Archival Preservation System in 1992 as a records maintenance system, dealing with government's electronic mail and other data stored in digital formats. (Page 9)

In 1994 a law firm in Little Rock, Arkansas needed the recovery of data related to legal work for a real estate project named "Castle Grande" from the 1980s. The related records had mysteriously vanished, but were recovered from Wang 8 inch floppies. (Page 10)

Presidential appointments calendar and notes from the 1970s used a proprietary database for storage by a Whitehouse mainframe, with tape in danger of decay and unknown data formats. Similarities in data structure with Vietnam-era military records helped to unravel the data and store them in a new database. (Page 11)

Student records from the late 70s were held in a basement of a school on Vydec floppy discs. The school was the subject of a litigation as oil wells on the school property were suspected of endangering the health of school pupils (later made into a film). (Page 12)

With an international angle, tapes and discs with population data from all over the world, are collected and published by the Minnesota Population Centre and IPUMS.org. The case of staff at the Bangladesh Bureau of statistics illustrates the world-wide need for data rescue. Daily power cuts were among the challenges faced by staff in keeping legacy tapes stored properly. (Pages 13-14)

Dr. Siebert's collection on the Penobscot language included interviews with native speakers and a dictionary in a rare format. (Page 15)

Slides submitted by Chris Muller, Muller Media Conversions. To re use any of these stories please contact Chris Muller chris.muller@mullermedia.com

May 11, 2015

Stolen laptop had PhD research

A story from a local newspaper based in Surrey, British Columbia from 2008 about a laptop stolen from a car at a shopping centre which held the only copy of research data collected for a PhD.

Submitted by Isabel Chadwick, Open Univeristy

May 7, 2015

Thesis writing: backing up

A humorous blogpost from a PhD student about the importance of backing up data while writing your PhD - what the author calls "thesis insurance". The post includes an example of a near-miss data loss and its recovery (at cost), as well as an amusing cartoon from www.phdcomics.com

Submitted by Isabel Chadwick, Open University

May 5, 2015

The mystery of a missing dataset

An interesting blog post from computer scientist and engineer David Rosenthal details his stepdaughter's quest to locate a dataset which she had downloaded in 2011, but which is now unavailable.

Although this was an important study in her field (sustainability and life cycle analysis) the original link to download the data is broken and there appear to be no archived versions of the dataset.

Submitted by Isabel Chadwick, Open University

Apr 6, 2015

Digital vellum and other projects to preserve digital information

Submitted by Isabel Chadwick, revised by Inna Kouper

Vint Cerf , the co-designer of the Internet architecture and Google's Vice president and Chief Internet Evangelist spoke at the 2015 American Association for the Advancement of Science’s annual meeting in San Jose, CA and warned the audience that we are faced with a forgotten generation or even a forgotten century because we don't have a regime that preserves digital information in a rational and systematic manner. Many computer files of various nature, including correspondence, entertainment, education, jobs and so on, are at risk of becoming unreadable.

He proposed a digital vellum, a system capable of preserving the meaning of the digital objects over hundreds to thousands of years. Thinking about access to documents hundreds of years later is challenging, especially when every new technology has a risk of being incompatible with the old ones. Atari game cartridges, floppy disks, zip drives, and many other older technologies are now hard or impossible to access. One of the efforts to preserve software, games, and other executable content is Olive Executable Archive, which creates virtual machines that simulate executable environments. Unfortunately, as the project website says, “for legal reasons, the VMs are currently accessible only to our research collaborators.”

See also:

Cerf, V. Digital Vellum and the Expansion of the Internet into the Solar System, video of the similar talk at Carnegie Mellon University

Cerf, V. Digital Vellum, abstract of the talk at AAAS-2015 annual meeting

Sample, I. Google boss warns of 'forgotten century' with email and photos at risk

Gibbs, S. What is 'bit rot' and is Vint Cerf right to be worried?

Mar 30, 2015

Dedoose crash and data loss

Dedoose is a web application that supports qualitative and mixed methods research that relies on text, images, audio- videos, spreadsheets, and so on. It was developed at University of California, Los Angeles (UCLA) with support from the William T. Grant Foundation. Web accessibility coupled with cloud storage and processing are among the key features of Dedoose and its “Anytime, Anywhere, Any Internet” motto. Researchers store data on Dedoose servers and can access it from anywhere and on any platform.

On May 6, 2014, Dedoose platform crashed. The cascading system failure coincided with a full database encryption and backup and resulted in the corruption of the entire storage system. The team wrote on its blog that “data added to Dedoose up to mid-April will be recovered and restored. … we are not optimistic we will be able to recover data added to the system for roughly the 2 – 3 week window preceding the failure”. It is not clear how many and how much, but researchers lost their data on Dedoose. Some comments below from “Hazards of the Cloud: Data-Storage Service’s Crash Sets Back Researchers” illustrate the issue:

“... I lost about 20 hours of work, which isn't the end of the world, but hurts when you are trying to finish a PhD and work full time. The reason why people don't have back ups is because the back up isn't necessarily useful. The file that you work on in the program is essentially an annotated document (or audio/video file) that you select chunks as excerpts and then apply codes to, so that later you can analyze the corpus of documents for themes. The export from Dedoose is simply an excel file of the excerpts you've made, so it helps to have as a reference, but you wouldn't be able to work from it the way you can work from a word file that you've backed up.”

“... Many of us DID back up... however, I don't think you understand that backing up coded video and or audio files in Dedoose does not back up the project as you would view it within Dedoose (online)... only as a spreadsheet... You CAN, however, fully back up an NVIVO project or any file on your hard drive as an EXACT duplicate (not the case with Dedoose). ... I am completely dependent on them and their promise that they backup nightly and protect our data so well that we don't have to worry about it.”

“Allow me to add only that the fact that Dedoose apparently outputs only a spreadsheet evidences that these platforms, for all their bells and whistles, are databases. It is important, IMHO, that researchers become adapt at building their databases from the ground up, and only after doing so use any CAQDA. This doesn't (always) mean learning mySQL, Phyton, or other programing languages. It does mean knowing your way around Excell (or other spreadsheet app) and how to structure your data so that it can be moved into and out of platforms such as Dedoose.”

“Now I feel kind of empowered by my "keep the data in the hard drive, backup to the cloud, and once a semester, to an external hard drive" regimen.”

The crash raises some interesting questions about cloud vs local storage, backup possibilities and the responsibility of clients and vendors. How can we backup data in the cloud if some of the processing (visualizations, annotations, etc.) are not exportable? How many copies is good enough? What does the client (user) need to check for before signing up for cloud services?

Mar 9, 2015

The Availability of Research Data Declines Rapidly with Article Age

The Availability of Research Data Declines Rapidly with Article Age - article from Current Biology (2014)

Highlights:

The availability of data from 516 studies between 2 and 22 years old was studied

The odds of a data set being reported as extant fell by 17% per year

Broken e-mails and obsolete storage devices were the main obstacles to data sharing

Policies mandating data archiving at publication are clearly needed

Submitted by Isabel Chadwick (Open University)

Source

Vines et al. The Availability of Research Data Declines Rapidly with Article Age Current Biology Volume 24, Issue 1, 6 January 2014, Pages 94–97 doi:10.1016/j.cub.2013.11.014

http://www.sciencedirect.com/science/article/pii/S0960982213014000