DIKW: Data, Information, Knowledge, Wisdom: (cyber)infrastructure

Showing posts with label (cyber)infrastructure. Show all posts

Mar 29, 2018

Cybersecurity Curricular Guideline

A report released by the Task Force on Cybersecurity Education provides a comprehensive framework and guidelines for cybersecurity post-secondary education (pdf). According to the presentation of one of the task force co-chairs, Diana Burley, it was a huge effort with many consultations, travel, and experts involved. And it went through the endorsement process with four major computing organizations: ACM, IEEE, Association for Information Systems Special Interest Group on Security (AIS SIGSEC) and International Federation for Information Processing Technical Committee on Information Security Education (IFIP). The resulting report can hopefully help to define cybersecurity as a discipline, describe proficiency needed for cybersecurity experts, and connect academic programs with industry needs. Ultimately, bringing some common understanding and standardization into cybersecurity education should improve the education and help fill a shortage of security professionals.

In terms of definition, cybersecurity involves the creation, operation, analysis, and testing of secure computer systems. The report assumes that while it is an interdisciplinary area that includes law, policy, human factors, ethics, and risk management, it is fundamentally a computing-based discipline. One of the challenges in developing curricula guidelines was to accommodate large variability of cybersecurity programs - depending on in which department or program they're created, there can be significantly different content and emphasis. So the guidelines are designed to have some flexibility through the notion of disciplinary lens. The program should be based on a solid computer science foundation with input from computer and software engineering and information systems and technologies and include cross-cutting concepts such as confidentiality, integrity, risk, and systems thinking.

The report shows a serious effort to be comprehensive and yet flexible. It includes eight knowledge areas: data, software, components and connections, system, human, organization, and society. Each area has several comprising units along with described essentials and learning outcomes. There is some overlap between areas and units, which again, helps to accommodate the variety of existing education efforts. Below is a summary that provides a quick overview of some areas:

It is nice to see that ethics is a significant and explicit component of the curriculum. While it doesn't remove the challenge of educating technical professionals on ethics and human behavior, it certainly provides space for discussions. More information about the guideline and the task force is at http://cybered.acm.org/

Jun 8, 2016

Cyberinfrastructure studies overview

In their introduction to the special issue on sociotechnical studies of cyberinfrastructure (CI) and e-research Ribes and Lee identify current themes and methodologies of CI studies (Computer Supported Cooperative Work (CSCW), 2010, Volume 19, Issue 3, pp 231-244, doi: 10.1007/s10606-010-9120-0)

Cyberinfrastructure (CI) is one of the current terms for the technologies that support scientific activities such as collaboration, data sharing and dissemination of findings. CI features that distinguish it from other CSCW work include: community wide and cross-disciplinary scope, computational orientation, and end-to-end (data-to-knowledge-to-user) integration.

Themes in CI studies:

Relationality. What is supporting the work of another and who is sustaining those relationships?
Integration of heterogeneity. CI involves computer specialists, data and information managers, domain scientists, and so on, but also non-human actors such as sensors and databases.
Sustainability. What makes CI a long-term resource?
Standardization. Ways to achieve integration on the technical and human levels.
Scale. How to plan for change and growth in the number of collaborators, the quantity of data, and the geographical reach.
The distribution between human work and technological delegation.

Methods include historical, ethnographic, documentary, and interview-based approaches that focus on the following:

Investigations of ongoing planning, development and deployment efforts
Activities of maintenance, upgrade and breakdown
Adoption of certain expressions of scientific activity and changes in their use
Adoption of new technological artifacts

Units of analysis can be a project or CI as a whole (focus on national policies and funding incentives). The introduction concludes by calling for more studies:

The stories of cyberinfrastructure are revealed by looking across multiple levels of granularity, various facets of social life, and diverse technological actors. Much remains to be studied in the areas of supporting domain specific practice, data sharing and curating, and infrastructural organizings. This is an exciting time for CI studies. Research is occurring in new and unexpected places, drawing on and bringing together the traditions of CSCW, information science, organizational studies, and science and technology studies. This cross-pollination, as exemplified by the papers in this issue, seems to be not only fruitful, but also very necessary.

Apr 18, 2016

Dataset on Parkinson's disease

In March 2016 Sage Bionetworks released a dataset that captures the everyday experiences of over 9,500 people with Parkinson's disease (press release). The data described in the data paper "The mPower study, Parkinson disease mobile data collected using ResearchKit" was collected via the mPower iPhone app, where participants were presented with tasks (referred to as ‘memory’, ‘tapping’, ‘voice’, and ‘walking’ activities) and asked to fill out surveys.

Not everybody agreed to share their data broadly with the research community. Out of 14,684 verified participants 9,520 (65%) agreed to share broadly, the rest split between withdrawing from the study and agreeing to share narrowly with the team only:

Figure 1: mPower study cohort description. From http://www.nature.com/articles/sdata201611#methods

To provide proper safeguards and to balance sharing and privacy, the research team established a data governance structure. Access is granted to qualified researchers who agree to specific conditions for use, including the following:

participants cannot be re-identified
the data may not be redistributed
findings need to be published in open access venues
both participants and research team need to be acknowledged as data contributors

This effort is another example of the newly forming data sharing culture. And it uses Synapse that seems to make sharing easier from both technical and policy perspectives.

Oct 19, 2015

Making progress in data sharing

A few useful tips on making progress in data sharing in a blog post "Data Sharing: Access, Language, and Context All Matter":

To make the global data system less fragmented and disorganized, create data portals with good human-centered designs and support users with varying levels of expertise

JSON and XML are great, but humans read data too. These formats are critical to fueling innovation, but make sure CSVs are available as well

Responsible data use demands proper attention to metadata. Document datasets and don't ignore ReadMe files while re-using them

Aug 10, 2015

Losing data from the National Centre for e-Social Science (NCeSS) portal

Submitted by Andy Turner

Edited by Inna Kouper

The National Centre for e-Social Science (NCeSS) was a UK based program established around 2004 to stimulate the development of digital tools and services for social scientists. In around 2008 it adopted the use of Sakai as a system for communicating, developing information, storing and managing access to data. NCeSS was configured with a “hub” at the University of Manchester and a network of research nodes across the UK (see the Digital Social Research page for the list of nodes, many now archived).

Andy Turner, a researcher from the University of Leeds, worked on a project to develop demographic models for geographical simulation system. The project, abbreviated as MoSeS (Modelling and Simulation for e-Social Science), was one of the first phase research nodes of NCeSS. Some information is available on Andy’s page, but many links from there are now unavailable. Andy explains why:

“In 2011 the NCeSS Sakai Portal went off-line following a server failure and because there were no more resources for replacing the server. All the data was stored in a database on a National Grid Service server which for some reason had a catastrophic failure. All that remained for me to salvage were some backup database dumps, which fortunately also contained the portal front end configuration which enabled me with the help of my local IT team to get a database reader set up and a version of the NCeSS Sakai Portal working almost, but not quite as it had been. This was good enough to get some data out, but my local IT were not willing to make the system accessible again for security reasons. As a consequence of the problems some detailed social simulation model run results were lost. These would take a lot of time and effort to reproduce as they were generated on a fairly massive computer, which we got access to thanks to the UK-CERN collaboration GridPP and my collaborator Tom Doherty from the University of Glasgow. The work with Tom was undertaken as part of the Jisc funded project NeISS (a project to establish a National e-Infrastructure for Social Simulation), which was by the time of the server failure supporting the NCeSS Sakai Portal.
In theory, sufficient metadata has been stored from the simulation runs so that the results can be readily produced, but this is unlikely to transpire as the results were really only academically interesting as their inherent uncertainties were too great to make them of practical use. Anyway, I have given up on all that for now. I have moved on, but at the time it was rather painful seeing what probably amounted to almost three years of my effort turn to nothing. I may still get something more out of it in the long run because of the learning involved in this process. Explaining what happened to my academic superiors who desperately wanted research outputs was hard. One day I may return to research that pushes the boundaries of what we can and can’t do, but I know that is risky as failure is not tolerated well in academia."

Reflecting on the importance of preservation and curation of data, Andy writes:

“Preservation and curation are not easy. Sustaining research effort that may one day generate useful data and software is also not easy, especially when the goal is aspirational and probably quite a long way off and the steps are necessarily baby steps to begin with. In NCeSS, issues of sustainability were discussed from early on for each NCeSS research project and for the organisation itself. Documentation about this from 2008 was stored on the portal and so is now also inaccessible…
The soft learning experiences of failure and how these relate to sustainability and the importance of promoting collaboration, re-use and enrichment in the research process are key, but where can these be written up in the academic literature? The blog might seem ephemeral, but these days they can be captured by a directed Internet Archive WayBack Machine and preserved for the future.”

Our data stories blog is one of the places where such discussions can be recorded, and while we don’t have a solid sustainability plan, we do keep external backup copies of the stories. If you have stories similar to Andy’s to share, use our form, send it directly to datastories@dcc.ac.uk or register on the website and become a contributor.

Jul 7, 2015

The International Polar Year 2007-2008 (IPY-4) and the importance ofdata management

The International Polar Year is an international collaboration that focuses on the Arctic and the Antarctic, or polar regions. The polar regions have many unique phenomena, but the cold harsh environment makes them expensive to visit and study. It takes a large multi-country collaborative effort to put together expeditions, install equipment, and collect data. The first three IPYs occurred in 1882–1883,1932–1933, and 1957–58 respectively. The fourth IPY took place between March 2007 and March 2009.

The fourth IPY was dramatically different from the previous efforts (Mokraine and Parsons, 2013). A $1.2 bln effort with participants from more than 60 countries, it had an ambitious vision to enable international sharing and reuse of multidisciplinary datasets and keep the data discoverable, open, linked, useful, and safe (Parsons, Godoy et al., 2011). The enormous efforts to initiate, coordinate, improve, and sustain IPY data stewardship have seen both successes and failures, with some components of the IPY infrastructure struggling to exist and be useful (Lessons and legacies..., 2012).

A fair amount of IPY data is available via such online portals as the IPY data page at the National Snow and Ice Data Center (NSIDC) in the US, the NASA Global Change Master Directory (GCMD) IPY portal, or the Global Cryosphere Watch portal. Some of it, such as a global IPY Data and Information System (IPYDIS) or the Discovery, Access, and Delivery of Data for IPY (DADDI) are broken. Most importantly though, missing is a way to track and access all the IPY data via a federated or centralized catalog. There is no good consistent way of international polar data to “function locally and reach globally”, to use Mokraine and Parsons’ words.

The challenges of making heterogeneous data and metadata work together were exacerbated by the lack of focused international funding for planning data archiving post-IPY, differences in data policies and researchers “hoarding” data (Lessons and legacies..., 2012; Carlson, 2011). Despite many IPY projects adopting a free and open data-sharing policy, compliance with it and, ultimately, sharing was rather low. Additionally, the researchers in IPY-4 didn’t have access to data from the first IPY projects, some of the data were not available in the digital form, while others were scattered or lost. The data centers (WDCs) that were supposed to support the increasing IPY data streams, lacked mechanisms of working with heterogeneous data, e.g., they couldn’t support social and ecological data.

Despite the difficulties, the IPY data management experience is crucial to the advancement of global data services and the norms of data sharing and re-use. As Mark Parsons, Secretary General of the Research Data Alliance and former Senior Associate Scientist and the Lead Project Manager at NSIDC put it,

“We were perhaps rather naive going in to IPY. Many of the organizers came from the geoscience background of earlier the IPYs and assumed data systems would exist that could handle IPY data. We weren’t prepared for the incredible diversity of IPY4 with data ranging from Indigenous knowledge to satellite remote sensing to genomic sequencing to cosmology. Although it is unclear what percentage of IPY data are available and much is surely lost, new data services were created and sustained, international coordination continues in sustained organisations, and we learned a lot about different disciplinary cultures and their attitudes to data sharing. The IPY Data Policy was aggressive and not fully honored, but it did drive changes in national policies towards more timely and open release of data. Most critically we saw a change in the conversation within polar science from whether to share to when to share and now how to share. We have a long way to go, but polar data are significantly more accessible than they were prior to IPY.”

Mark’s and others’ publications, some of which are listed below, are a good source of all the lessons learned from IPY data stewardship efforts, one important lesson being that “[e]xperts in data management are critical members of any team attempting internationally coordinated science ...” (Lessons and legacies..., 2012).

Resources

Carlson D. 2011. A lesson in sharing. Nature 469 (293).

Lessons and legacies of the International Polar Year 2007-2008. 2012.

Mokrane M and MA Parsons. 2014. Learning from the international polar year to build the future of polar data management. Data Science Journal 13.

Parsons MA, Ø Godøy, E LeDrew, TF de Bruin, B Danis, S Tomlinson, and D Carlson. 2011. A conceptual framework for managing very diverse data for complex interdisciplinary science. Journal of Information Science 37 (6): 555-569.

Parsons MA, T de Bruin, S Tomlinson, H Campbell, Ø Godøy, J LeClert, and IPY Data Policy and Management SubCommittee. 2011. The state of polar data—the IPY experience. In Understanding Earth’s Polar Challenges: International Polar Year 2007-2008. Ed. Krupnik I, I Allison, R Bell, P Cutler, D Hik, J López-Martínez, V Rachold, E Sarukhanian, and C Summerhayes. Edmonton, Canada: CCI Press.

Apr 6, 2015

Digital vellum and other projects to preserve digital information

Submitted by Isabel Chadwick, revised by Inna Kouper

Vint Cerf , the co-designer of the Internet architecture and Google's Vice president and Chief Internet Evangelist spoke at the 2015 American Association for the Advancement of Science’s annual meeting in San Jose, CA and warned the audience that we are faced with a forgotten generation or even a forgotten century because we don't have a regime that preserves digital information in a rational and systematic manner. Many computer files of various nature, including correspondence, entertainment, education, jobs and so on, are at risk of becoming unreadable.

He proposed a digital vellum, a system capable of preserving the meaning of the digital objects over hundreds to thousands of years. Thinking about access to documents hundreds of years later is challenging, especially when every new technology has a risk of being incompatible with the old ones. Atari game cartridges, floppy disks, zip drives, and many other older technologies are now hard or impossible to access. One of the efforts to preserve software, games, and other executable content is Olive Executable Archive, which creates virtual machines that simulate executable environments. Unfortunately, as the project website says, “for legal reasons, the VMs are currently accessible only to our research collaborators.”

See also:

Cerf, V. Digital Vellum and the Expansion of the Internet into the Solar System, video of the similar talk at Carnegie Mellon University

Cerf, V. Digital Vellum, abstract of the talk at AAAS-2015 annual meeting

Sample, I. Google boss warns of 'forgotten century' with email and photos at risk

Gibbs, S. What is 'bit rot' and is Vint Cerf right to be worried?

Mar 30, 2015

Dedoose crash and data loss

Dedoose is a web application that supports qualitative and mixed methods research that relies on text, images, audio- videos, spreadsheets, and so on. It was developed at University of California, Los Angeles (UCLA) with support from the William T. Grant Foundation. Web accessibility coupled with cloud storage and processing are among the key features of Dedoose and its “Anytime, Anywhere, Any Internet” motto. Researchers store data on Dedoose servers and can access it from anywhere and on any platform.

On May 6, 2014, Dedoose platform crashed. The cascading system failure coincided with a full database encryption and backup and resulted in the corruption of the entire storage system. The team wrote on its blog that “data added to Dedoose up to mid-April will be recovered and restored. … we are not optimistic we will be able to recover data added to the system for roughly the 2 – 3 week window preceding the failure”. It is not clear how many and how much, but researchers lost their data on Dedoose. Some comments below from “Hazards of the Cloud: Data-Storage Service’s Crash Sets Back Researchers” illustrate the issue:

“... I lost about 20 hours of work, which isn't the end of the world, but hurts when you are trying to finish a PhD and work full time. The reason why people don't have back ups is because the back up isn't necessarily useful. The file that you work on in the program is essentially an annotated document (or audio/video file) that you select chunks as excerpts and then apply codes to, so that later you can analyze the corpus of documents for themes. The export from Dedoose is simply an excel file of the excerpts you've made, so it helps to have as a reference, but you wouldn't be able to work from it the way you can work from a word file that you've backed up.”

“... Many of us DID back up... however, I don't think you understand that backing up coded video and or audio files in Dedoose does not back up the project as you would view it within Dedoose (online)... only as a spreadsheet... You CAN, however, fully back up an NVIVO project or any file on your hard drive as an EXACT duplicate (not the case with Dedoose). ... I am completely dependent on them and their promise that they backup nightly and protect our data so well that we don't have to worry about it.”

“Allow me to add only that the fact that Dedoose apparently outputs only a spreadsheet evidences that these platforms, for all their bells and whistles, are databases. It is important, IMHO, that researchers become adapt at building their databases from the ground up, and only after doing so use any CAQDA. This doesn't (always) mean learning mySQL, Phyton, or other programing languages. It does mean knowing your way around Excell (or other spreadsheet app) and how to structure your data so that it can be moved into and out of platforms such as Dedoose.”

“Now I feel kind of empowered by my "keep the data in the hard drive, backup to the cloud, and once a semester, to an external hard drive" regimen.”

The crash raises some interesting questions about cloud vs local storage, backup possibilities and the responsibility of clients and vendors. How can we backup data in the cloud if some of the processing (visualizations, annotations, etc.) are not exportable? How many copies is good enough? What does the client (user) need to check for before signing up for cloud services?

Mar 11, 2015

Depositing in Netherlands repository adds value, researchers say

3TU.Datacentrum offers researchers within the technical and engineering sciences in the Netherlands several research data services. One of these services is a certified data repository.

A brochure describes good practices on some of the data sets in 3TU.Datacentrum. The brochure also contains a few good examples (stories), where researchers explain how depositing their research data to 3TU.Datacentrum, and making it openly available, has added value to their research.

Contributed by Annemiek van der Kuil, 3TU.Datacentrum

Feb 20, 2015

Repository features to motivate more data sharing

One of the challenges of creating data stewardship infrastructure is engaging the users and meeting and prioritizing their needs, particularly the needs of long-tail science research. "What would motivate researchers to make their data available?" is a question we continuously grapple with. A recent study "Potential contributor perspectives on desirable characteristics of an online data environment for spatially referenced data" published in First Monday asked a very similar question in the context of geographic data. The researchers hypothesized that potential data contributors of small scale, local spatial data would be more willing to share their data if a repository included a simple, clear licensing mechanism, a simple process for attaching descriptions to the data, and a simple post-publication peer evaluation/commenting mechanism.

The paper draws on 10 qualitative interviews and 110 responses to an online questionnaire. The qualitative interview responses were mixed; they don't seem to reveal any patterns or unusual concerns. Some of the quantitative results were also mixed, but some provide good numbers to support the hypotheses:

90% of respondents said attribution (licensing) is important

62% think that non-commercial attribution is important
54% think that restricting re-use is important, i.e., others may use the data but not modify it in any way

93% said ability to attach keywords or other descriptions to data is important
78% said that commenting capability is important
85% said that stability and long-term maintenance of the repositories matters

Conclusion:

This research, subject to the caveats listed below, suggests that it would be desirable from the perspective of potential contributors of data to provide infrastructure capability that would:

allow users to attach conditions to the use of their data;

provide basic information that could be translated into standards based metadata; and,

receive comments and feedback from users.

Feb 18, 2015

Research Data Alliance/US Call for Fellows

I'm a co-PI on a project that provides a great opportunity to the early career researchers and professionals to engage with the Research Data Alliance and help to improve data practices and make data management and data sharing easier and more transparent. Below are the details from the call for fellows:

The Research Data Alliance (RDA) invites applications for its newly redesigned fellowship program. The program’s goal is to engage early career researchers in the US in Research Data Alliance (RDA), a dynamic and young global organization that seeks to eliminate the technical and social barriers to research data sharing.

The successful Fellow will engage in the RDA through a 12-18 month project under the guidance of a mentor from the RDA community. The project is carried out within the context of an RDA Working Group (WG), Interest Group (IG), or Coordination Group (i.e., Technical Advisory Board), and is expected to have mutual benefit to both Fellow and the group’s goals.

Fellows receive a stipend and travel support and must be currently employed or appointed at a US institution.

Fellows have a chance to work on real-world challenges of high importance to RDA, for instance:

Engage with social sciences experts to study the human and organizational barriers to technology sharing

Apply a WG product to a need in the Fellow’s discipline

Develop plan and disseminate RDA research data sharing practices

Develop and test adoption strategies

Study and recommend strategies to facilitate adoption of outputs from WGs into the broader RDA membership and other organizations

Engage with potential adopting organizations and study their practices and needs

Develop outreach materials to disseminate information about RDA and its products

Adapt and transfer outputs from WGs into the broader RDA membership and other organizations

The program involves one or two summer internships and travel to RDA plenaries during the duration of the fellowship (international and domestic travel). Fellows will receive a $5000 stipend for each summer of the fellowship. Fellows will be paired with a mentor from the RDA community.

Through the RDA Data Share program, fellows will participate in a cohort building orientation workshop offering training in RDA and data sciences. This workshop is held at the beginning of the fellowship. RDA Data Share program coordinators will work with Fellows and mentors to clarify roles and responsibilities at the start of the fellowship.

Criteria for selection: The Fellows engaging in the RDA Data Share program are sought from a variety of backgrounds: communications, social, natural and physical sciences, business, informatics, and computer science. The RDA Data Share program will look for a T-shaped skill set, where early signs of cross discipline competency are combined with evidence of teamwork and communication skills, and a deep competency in one discipline.

Additional criteria include: interest in and commitment to data sharing and open access; demonstrated ability to work in teams and within a limited time framework; and benefit to the applicant’s career trajectory.

Eligibility: Graduate students and postdoctoral researchers at institutions of higher education in the United States, and early career researchers at U.S.-based research institutions who graduated with a relevant master’s or PhD and are no more than three years beyond receipt of their degree. Applications from traditionally underserved populations are strongly encouraged to apply.

To apply: Interested candidates are invited to submit their resume/curriculum vitae and a 300-500 word statement that briefly describes their education, interests in data issues, and career goals to datashare-inquiry-l@list.indiana.edu. Candidates are encouraged to browse the RDA website https://rd-alliance.org/ and pages of interest and working groups to identify relevant topics and mutual interests.

Important dates:
April 16, 2015 – Fellowship applications are due
May 1, 2015 – Award notifications
June 18-19, 2015 – Fellowship begins with the orientation workshop in Bloomington, IN

RDA Data Share, funded by the Alfred P. Sloan Foundation under award G-2014-13746, engages students and early career researchers in the Research Data Alliance. This engagement builds on foundational infrastructure funded by the National Science Foundation grant # ACI-1349002.

Jan 13, 2014

Identifier Test-bed Activities Report (ESIP Federation)

Below is a brief summary from a recent report to ESIP Federation's Data Stewardship Committee that evaluated identifier schemes for Earth system science data and information(see also executive summary and links). The report seems to be a hands-on continuation of the paper published in 2011 "On the utility of identification schemes for digital earth science data: an assessment and recommendations" by Ruth Duerr and others(link).
The paper introduced four uses cases and three assessment criteria:

Use cases:

unique identification (identify a piece of data, no matter which copy)

unique location (locate an authoritative copy)

citable location (identify cited data)

scientifically unique identification (to tell whether two data instances have the same info even if the formats are different)

Assessment criteria:

Technical value (e.g., scalability, interoperability, security, compatibility, technological viability)

User value (e.g., publishers' commitment, transparency)

Archive value (e.g., maintenance, cost, versatility)

The report took those use cases, expanded assessment criteria and used all of it to test the implementation of 8 identification schemes, DOI, ARK, UUID, XRI, OID, Handles, PURL, LSID, and URI/URN/URL, using two datasets: the Glacier Photo Collection from the National Snow and Ice Data Center (JPEG and TIFF images) and a numerical data set from the NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) sensor.
Report recommendations:

UUID are most appropriate as unique identifiers, any other use requires effort.
DOI, ARK and Handles are the most suitable as unique locators, DOI and ARK also support citable locators. Handles need a local dedicated server. ARKs are cheaper than others, but DOIs are accepted by publishers.
PURL has no means for creating opaque identifiers and the API support for batch operations is poor.
The rest of the ID schemes are less suitable.

It seems that the overall conclusion is that DOI and ARK are generally better, but there is a need for support of multiple ID schemes in a system. From the report I didn't quite get whether any of the ID schemes can support the fourth use case - scientifically unique identification. The paper argued that "none of the identifier schemes assessed here even minimally address this use case".

Nov 11, 2013

Human infrastructure - build it bottom-up

An article by Procter et al "Fostering the human infrastructure of e-research" (2013, restricted access) discusses the challenges of embedding computing resources and systems into research. E-infrastructures (aka cyberinfrastructures in the US) are defined as digital information and communication technologies (ICTs) that can provide fast and scalable access to remote resources and increase discovery and innovation. Human infrastructure is arrangements of actors and organizations that make computer-research systems work. It is often acknowledged that human infrastructure is neglected compared to the investments in the technical infrastructure. And that's where the problem is. Cyberinfrastructures don't work without human adoption and use. As authors of the article state:

"So far, despite substantial investment, the desired transformative impact has yet to be achieved."

The article describes the Enabling Wider Uptake of e-Infrastructure Services project (ENGAGE/e-Uptake) that was designed to identify inhibitors and enablers of the adoption of e-Infrastructure services. The identification is based on interviews with ~50 researchers from higher education institutions and ~50 "intermediaries", or technical specialists who support researchers in their use of ICTs.

Findings

Obstacles in e-Infrastructure adoption and use:

Lack of training - many researchers hear about research computing services, but they often don't know about the nature of the services and the benefits of using them.
Lack of local research support - support is often basic, limited, fragmented and difficult to access.
Poor project management - projects that involve technical and research personnel have their own managing needs, i.e., the need to manage collaborations between people with their own research agendas and temporarily aligned interests. Managers who don't have such skills may make biased decisions and favor one type of team members over others.

Conclusions

There is complexity in divisions of labor and in organizational structures that may be historical. In cyberinfrastructure projects we may need more flexible and flatter approaches.
More teaching and training is needed - not only teaching of e-Research methods in classes, but also lifecycle outreach from the collaborative formation of projects through the acquisition of skills and the appropriation of technologies to the dissemination of experiences back into the community (see, for example, eIUS project for a collection of use cases and tools used in them).
User engagement can take a form of relying on "hybrids", i.e., people with both technical and domain expertise, or a form of co-locating technical experts and users throughout projects. More research is needed into how to do that plus how to leverage community engagement.
New practices must be embraced not only by researchers, but by the organizations within which researchers work.

The article reinforces the idea that by default software and computing tools are hard to learn and use. Why is that? A common argument is that complex problems require complex solutions. Doesn't a simple fix sometimes work better? Or, perhaps, it's ok to have complex solutions, but they arise from a number of simple solutions combined and overlapped. I wonder whether we should start with building simple local systems ("recognized routes" or local roads) rather than large and multi-purpose systems (interstate highways, to continue the infrastructure metaphor). Once local needs are met and served well, we can move into connecting local systems (i.e., building bridges, gateways, etc.). It circles back to the investment in human infrastructure and bottom-up rather than top-down approaches.

Nov 16, 2012

Workflows for Digital Preservation and Curation

Notes from today workshop on workflows for digital preservation and curation.

Curation means adding value to the data during its life cycle, i.e., making sure that has meaning and context and can be re-used outside the creator's environment. Curation infrastructure includes repositories, access procedures, policies, processes and institutional support. The unit of curation and preservation is a file. It's important to maintain files integrity by ensuring their fixity, duplicate storage and format validation.
To preserve files, they need to be in proper formats, i.e., durable (transparent, documented, used widely, renderable) and supported with standards (syntactic and semantic). Syntactic standards don't have context (e.g., in CSV we don't know how columns were created and what they mean). Semantic standards are better.

In preservation a good practice is to use a master file for preservation (highest quality and fidelity) and derivative files for active use and delivery. For example, high-resolution TIFF for preservation and lossy JPEG for viewing.

To implement curation and preservation and practices mentioned above, the following activities are often part of the workflow: ongoing verification (file integrity and object integrity), metadata management, management of obsolescence (hardware, software, formats, documentation).
Workflows systems simplify repetitive and mundane activities, facilitate best practices and coordination and outreach. Various systems that support scientific, research, and software development workflows exist, e.g., Kepler , Triana, Taverna , Ptolemy II and BPEL.

Trident is an open source software package from Microsoft with many components and functions. Data to Insight Center has been working on developing components to support data ingest and curation activities.
The following components has been developed so far: fixity (MD5 checksum), data integrity (JHOVE for format verification and validation), metadata creation (MIX and METS data generator and validator), format normalization and generation (PPT/DOCX to PDF, XLSX to CSV, TIFF to JPEG and some more), persistent identification (DOI generator), repository integration (ingest to Dspace via Sword, DOI
generator).

Example of Single Object Workflow

Digital Curation Activities in Trident

It's been great to play with these components and see how repetitive and boring tasks can be automated. The biggest question, of course, is whether the tool is ready for wider adoption. It's relatively easy to work with existing components, but it requires coding experience to modify them and create new ones.

Aug 7, 2012

(Cyber)infrastructures

Thoughts based upon the readings about infrastructure, especially “Understanding infrastructures: Dynamics, Tensions, and Designs”, a great report by P. Edwards, S. Jackson, G. Bowker, and C. Knobel

Development of (cyber)infrastructures is not a merely technical/engineering issue. To ensure success we need to be aware of historical context and socio-political issues as well as the messiness of everyday practices.

Historical (dis)continuities underlie many infrastructural projects. Cyberinfrastructures and data science / curation problems did not appear out of nowhere in the 20th century. They have historical precursors, such as:

information gathering activities by the state (statistics as science of state) and the development of sciences as accumulation of records
the development of technologies and organizational practices to sort, sift and store information

Questions of ownership, management, control, and access are always present in infrastructural developments. With regard to data, years of private ownership in data has led to many idiosyncratic practices and formats, which, along with an absence of the metadata, prevent understanding and use by other scientists.

A good quote: “The consequence is that much “shared” data remains useless to others; the effort required for one group to understand another’s output, apply quality controls, and reformat it to fit a different purpose often exceeds that of generating a similar data set from scratch.” (p. 19 of the report)

Cyberinfrastructure development means system building. Successful system-builder teams are made up of technical “wizards”, who envision and create the system, a “maestro,” who orchestrates the organizational, financial, and marketing aspects of the system, and a “champion” who stimulates interest in the project, promotes it and generates adoption. During infrastructural growth, users and user communities can also become critical to success or failure.

Design-level perspective differs from the perspective on the ground. The former can be neat and organized, while the latter can be disorderly and requiring a lot of work. Finding ways to translate between these two perspectives and to incorporate lessons learned from “below” into design from “above” is a challenge and a crucial element of success.

A great quote: “It is also possible that a tech-centered approach to the challenge of data sharing inclines us toward failure from the beginning, because it leaves untouched underlying questions of incentives, organization, and culture that have in fact always structured the nature and viability of distributed scientific work.” (p. 32 of the report)

Additional reading – my other post about general issues in data curation.

Jul 30, 2012

Digital science ecosystem

From the GRDI2020 Final roadmap report: Global scientific data infrastructures: The big data challenges (pdf):

Data- any digitally encoded information, including data from instruments and simulations; results from previous research; material produced by publishing, broadcasting and entertainment; digitized representations of diverse collections of objects, e.g. of museums’ curated objects.

Research Data Infrastructures - managed networked environments (services and tools) that support the whole research cycle and the movement of data and information across domains and agencies.

An ecosystem metaphor is used to conceptualize science universe and its processes. A digital science ecosystem is composed of:

Digital Data Libraries that are designed to ensure the long-term stewardship and provision of quality-assessed data and data services.

Digital Data Archives that consist of older data that is still important and necessary for future reference, as well as data that must be retained for regulatory compliance.

Digital Research Libraries as a collection of electronic documents.

Communities of Research as communities organized around disciplines, methodologies, model systems, project types, research topics, technologies, theories, etc.

While I can see how the metaphor of ecosystem can be beneficial in conceptualizing science universe, I don’t think it was developed enough here. The whole report is structured around tools and infrastructure as it is understood rather narrowly. It seems that the biggest roadblocks are in the domains of human interactions: all those issues of social hierarchies and capital built into our social institutions.

Paul Edwards (one of the authors of another reading that seemed more sophisticated to me) somewhat wrote about it in his book “A vast machine” about infrastructure surrounding weather forecasting and climate change. He talks about how many-many efforts of various social actors facilitated the creation and inversion of infrastructure by constantly questioning data, models, and prognoses. Here is a large quote from the conclusion chapter of that book to demonstrate the emphasis on people and the making of data-knowledge-infrastructure (in bold, which is mine):

“Beyond the obvious partisan motives for stoking controversy, beyond disinformation and the (very real) “war on science,” these debates regenerate for a more fundamental reason. In climate science you are stuck with the data you already have: numbers collected decades or even centuries ago. The men and women who gathered those numbers are gone forever. Their memories are dust. Yet you want to learn new things from what they left behind, and you want the maximum possible precision. You face not only data friction (the struggle to assemble records scattered across the world) but also metadata friction (the labor of recovering data’s context of creation, restoring the memory of how those numbers were made). The climate knowledge infrastructure never disappears from view, because it functions by infrastructural inversion : continual self-interrogation, examining and reexamining its own past. The black box of climate history is never closed. Scientists are always opening it up again, rummaging around in there to find out more about how old numbers were made. New metadata beget new data models; those data models, in turn, generate new pictures of the past.” (P. N. Edwards, “A vast machine”, p. 432)

Why should we trust climate change and its infrastructures? Because of a “vast machine” that is built by a large community of researchers who constantly try to invert it. So in order to understand, develop and advance data-intensive environments, we shouldn’t consider social forces as external. They are part, if not the foundation, of the data universe. So I’d propose to equally emphasize tools (storage-, transfer- and sharing tools) and social arrangements (individuals, institutions, political contexts, events, and so on) as elements of ecosystem.

Pages