Nov 16, 2012

Workflows for Digital Preservation and Curation

Notes from today workshop on workflows for digital preservation and curation.

Curation means adding value to the data during its life cycle, i.e., making sure that has meaning and context and can be re-used outside the creator's environment. Curation infrastructure includes repositories, access procedures, policies, processes and institutional support. The unit of curation and preservation is a file. It's important to maintain files integrity by ensuring their fixity, duplicate storage and format validation.
To preserve files, they need to be in proper formats, i.e., durable (transparent, documented, used widely, renderable) and supported with standards (syntactic and semantic). Syntactic standards don't have context (e.g., in CSV we don't know how columns were created and what they mean). Semantic standards are better.

In preservation a good practice is to use a master file for preservation (highest quality and fidelity) and derivative files for active use and delivery. For example, high-resolution TIFF for preservation and lossy JPEG for viewing.

To implement curation and preservation and practices mentioned above, the following activities are often part of the workflow: ongoing verification (file integrity and object integrity), metadata management, management of obsolescence (hardware, software, formats, documentation).
Workflows systems simplify repetitive and mundane activities, facilitate best practices and coordination and outreach. Various systems that support scientific, research, and software development workflows exist, e.g., Kepler , Triana, Taverna , Ptolemy II and BPEL.

Trident is an open source software package from Microsoft with many components and functions. Data to Insight Center has been working on developing components to support data ingest and curation activities.
The following components has been developed so far: fixity (MD5 checksum), data integrity (JHOVE for format verification and validation), metadata creation (MIX and METS data generator and validator), format normalization and generation (PPT/DOCX to PDF, XLSX to CSV, TIFF to JPEG and some more), persistent identification (DOI generator), repository integration (ingest to Dspace via Sword, DOI

Example of Single Object Workflow

Digital Curation Activities in Trident
It's been great to play with these components and see how repetitive and boring tasks can be automated. The biggest question, of course, is whether the tool is ready for wider adoption. It's relatively easy to work with existing components, but it requires coding experience to modify them and create new ones.