DIKW: Data, Information, Knowledge, Wisdom: data

Showing posts with label data. Show all posts

Feb 13, 2017

Data quality - a short overview

A short overview of data quality definitions and challenges in support of the Love Your Data week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: Defining Data Quality, Documenting, Describing, Defining, Good Data Examples, Finding the Right Data, Rescuing Unloved Data

Data quality is the degree to which data meets the purposes and requirements of its use. Good data, therefore, is the data that can be used for the task at hand even if it has some issues (e.g., missing data, poor metadata, value inconsistencies, etc.) Data that has errors, is hard to retrieve or understand, or has no context or traces of where it came from is generally considered bad.

Numerous attempts to define data quality over the last few decades relied on diverse methodologies and identified multiple dimensions of data or information quality (Price and Shanks, 2005). The importance of quality of data is recognized in business and commercial data warehousing (Fan and Geertz, 2012; Redman, 2001), in government operations (Information Quality Act, 2001) and by international agencies involved in data-intensive activities (IMF, 2001). Many research domains have also developed frameworks to evaluate quality of information, including decision, measurement, test, and estimation theories (Altman, 2012).

Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).

Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):

how to assess quality
how to manage quality
what impact quality has on organization

The multitude of definitions, frameworks, and contexts in which data quality is used demonstrate that making data quality a useful paradigm is a persisting challenge that can benefit from establishing a dynamic network of researchers and practitioners in the area of data quality and from developing a framework that would be general and yet flexible enough to accommodate highly specific attributes and measurements from particular domains.

3 Reasons Why Data Quality Should Be Your Top Priority This Year

Each dimension of data quality, such as completeness, accuracy, timeliness, or consistency creates challenges for data quality.

Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.

As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:

How does non-compliance with policies undermine the quality of data?
What role does scientific community consensus play in establishing the quality of data?
Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized?

Another aspect of data quality that complicates formalized treatment of initial dimensions is that data is often heterogeneous and can be applied in varied contexts. As has been pointed above, data quality frameworks and approaches are being developed in business, government, and research contexts and quality solutions have to consider structured, semi-structured, and unstructured data and their combinations. Most of the previous data quality research focused on structured or semi-structured data. Additionally, spatial, temporal, and volume dimensions of data contribute to quality assessment and management.

Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.

Most likely, some combination of the above is the best approach. Quality depends on how data was collected as well as on how it was subsequently stored, curated, and made available to others. Data quality is a responsibility that is shared between data providers, data curators and data consumers. While data providers can ensure the quality of their individual datasets, curators help with consistency, coverage and metadata. Maintaining current and consistent metadata across copies and systems also benefits contributions from those who intend to re-use the data. Data and software documentation is another aspect of data quality that cannot be solved technically and needs a combination of organizational / information science solutions.

References and further reading:

Altman, M. (2012). Mitigating Threats to Data Quality Throughout the Curation Lifecycle.
Ballou, D. P., & Pazer, H. L. (1985). Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems. Management Science, 31(2), 150–162.
Fan, W., & Geerts, F. (2012). Foundations of data quality management. Morgan & Claypool.
Ge, M., & Helfert, M. (2007). A review of information quality research - develop a research agenda. In The International Conference on Information Quality (pp. 76--91).
International Monetary Fund. (2001). Fourth Review of the Fund’s Data Standards’ Initiatives.
Letzter, R. (2017, February 9). No, the NOAA didn’t fake climate change data. Science Alert.
Madnick, S. E., Wang, R. Y., Lee, Y. W., & Zhu, H. (2009). Overview and Framework for Data and Information Quality Research. Journal of Data and Information Quality, 1(1), 1–22.
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211.
Price, R., & Shanks, G. (2005). A semiotic information quality framework: development and comparative analysis. Journal of Information Technology, 20(2), 88–102.
Redman, T. C. (2001). Data Quality: The Field Guide. Boston: Digital Press.
Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95.

Sep 2, 2016

Workshop: Data Quality in Era of Big Data

The center where I work organizes a workshop of possible interest to many who work with data. Scholarships are available.

Data Quality in Era of Big Data

Bloomington, Indiana

28-29 September 2016

http://d2i.indiana.edu/mbdh

Throughout the history of modern scholarship, the exchange of scholarly data was undertaken through personal interactions among scholars or through highly curated data archives. In either case, implicit or explicit provenance mechanisms gave a relatively high degree of insurance of the quality of the data. However, the ubiquity of the web and mobile digital culture has produced disruptive new forms of data. We need to ask ourselves what we know about the data and what we can trust. Failure to answer these questions endangers the integrity of the science produced from these data.

The workshop will examine questions of quality:

· Citizen science data

· Health records

· Integrity

· Completeness; boundary conditions

· Instrument quality

· Data trustworthiness

· Data provenance

· Trust in data publishing

The 2 day workshop begins with a half day of tutorials. The main workshop begins early afternoon on 28 September and continuing to noon on the 29September. With sufficient interest, there may be another training session following noon conclusion of the main workshop on 29 September.

Early Career Travel Funds:

Travel funds are available for early career researchers, scholars, and practitioners http://d2i.indiana.edu/mbdh/#scholarships

Important Dates:

· Workshop: Sep 28-29, 2016

· Deadline for requesting early career travel funds: Sep 9, 2016 midnight EDT

· Notification of travel funding: Sep 13, 2016

· Registration deadline: Sep 19, 2016

Organizing Committee:

General Chairs: Beth Plale, Indiana University

Program Committee

Carl Lagoze, University of Michigan, chair

Devan Donaldson, Indiana University

H.V. Jagadish, University of Michigan

Xiaozhong Liu, Indiana University

Jill Minor, Indiana University

Val Pentchev, Indiana University

Hridesh Rajan, Iowa State University

Early Career Chairs

Devan Donaldson, Indiana University

Xiaozhong Liu, Indiana University

Local Arrangements Chair

Jill Minor, Indiana University

Jun 6, 2016

The Net Data directory

The Berkman Center for Internet & Society announced the launch of the Net Data Directory - a free, publicly available database of data about the Internet that covers topics such as cyber-security, civil and human rights, social media and many more. The directory currently contains about 150 data source records and includes many types of sources, including website rankings, opinion surveys, maps of activities and so on.

The press release says that records are maintained by researchers at the Berkman Center, which means that keeping the directory current, relevant and error-free will be a challenge. As the number of sources grows, it will also be harder to navigate the directory through search and browse, without more sophisticated tools of filtering, recommendations, and visualizations.

Jan 11, 2016

Pantheon 1.0: A manually verified dataset of globally famous biographies

Scientific Data has published a description of an interesting dataset: "Pantheon 1.0, a manually verified dataset of globally famous biographies". This data collection effort contributes to quantitative data for studying historical information, especially, the information about famous people and events.

Workflow diagram (Image from the paper)

The authors retrieved over 2 mln records about famous ("globally known") individuals from Google's Freebase, narrowed down the dataset to individuals who have metadata in English Wikipedia and then reduced it further to people who have records in more than 25 different languages in Wikipedia.

Manual cleaning and verification includes a controlled vocabulary for occupations, popularity metrics (defined as a number of Wikipedia edits adjusted by age and pageviews).

The dataset is available for download at Harvard Dataverse http://dx.doi.org/10.7910/DVN/28201. Another entertaining part is a visualization interface at http://pantheon.media.mit.edu that allows to explore the data and answer questions like "Where were globally known individuals in Math born?" (21% in France) or "Who are the globally known people born within present day by country?". Turns out that Russia produced a lot of politicians and writers, while the US gave us many actors, singers and musicians.

Globally known people born in the US (from http://pantheon.media.mit.edu/treemap/country_exports/US/all/-4000/2010/H15/pantheon)

Feb 18, 2015

Research Data Alliance/US Call for Fellows

I'm a co-PI on a project that provides a great opportunity to the early career researchers and professionals to engage with the Research Data Alliance and help to improve data practices and make data management and data sharing easier and more transparent. Below are the details from the call for fellows:

The Research Data Alliance (RDA) invites applications for its newly redesigned fellowship program. The program’s goal is to engage early career researchers in the US in Research Data Alliance (RDA), a dynamic and young global organization that seeks to eliminate the technical and social barriers to research data sharing.

The successful Fellow will engage in the RDA through a 12-18 month project under the guidance of a mentor from the RDA community. The project is carried out within the context of an RDA Working Group (WG), Interest Group (IG), or Coordination Group (i.e., Technical Advisory Board), and is expected to have mutual benefit to both Fellow and the group’s goals.

Fellows receive a stipend and travel support and must be currently employed or appointed at a US institution.

Fellows have a chance to work on real-world challenges of high importance to RDA, for instance:

Engage with social sciences experts to study the human and organizational barriers to technology sharing

Apply a WG product to a need in the Fellow’s discipline

Develop plan and disseminate RDA research data sharing practices

Develop and test adoption strategies

Study and recommend strategies to facilitate adoption of outputs from WGs into the broader RDA membership and other organizations

Engage with potential adopting organizations and study their practices and needs

Develop outreach materials to disseminate information about RDA and its products

Adapt and transfer outputs from WGs into the broader RDA membership and other organizations

The program involves one or two summer internships and travel to RDA plenaries during the duration of the fellowship (international and domestic travel). Fellows will receive a $5000 stipend for each summer of the fellowship. Fellows will be paired with a mentor from the RDA community.

Through the RDA Data Share program, fellows will participate in a cohort building orientation workshop offering training in RDA and data sciences. This workshop is held at the beginning of the fellowship. RDA Data Share program coordinators will work with Fellows and mentors to clarify roles and responsibilities at the start of the fellowship.

Criteria for selection: The Fellows engaging in the RDA Data Share program are sought from a variety of backgrounds: communications, social, natural and physical sciences, business, informatics, and computer science. The RDA Data Share program will look for a T-shaped skill set, where early signs of cross discipline competency are combined with evidence of teamwork and communication skills, and a deep competency in one discipline.

Additional criteria include: interest in and commitment to data sharing and open access; demonstrated ability to work in teams and within a limited time framework; and benefit to the applicant’s career trajectory.

Eligibility: Graduate students and postdoctoral researchers at institutions of higher education in the United States, and early career researchers at U.S.-based research institutions who graduated with a relevant master’s or PhD and are no more than three years beyond receipt of their degree. Applications from traditionally underserved populations are strongly encouraged to apply.

To apply: Interested candidates are invited to submit their resume/curriculum vitae and a 300-500 word statement that briefly describes their education, interests in data issues, and career goals to datashare-inquiry-l@list.indiana.edu. Candidates are encouraged to browse the RDA website https://rd-alliance.org/ and pages of interest and working groups to identify relevant topics and mutual interests.

Important dates:
April 16, 2015 – Fellowship applications are due
May 1, 2015 – Award notifications
June 18-19, 2015 – Fellowship begins with the orientation workshop in Bloomington, IN

RDA Data Share, funded by the Alfred P. Sloan Foundation under award G-2014-13746, engages students and early career researchers in the Research Data Alliance. This engagement builds on foundational infrastructure funded by the National Science Foundation grant # ACI-1349002.

Jan 5, 2015

Data politics and the dark sides of data

A bit lengthy post about the dark sides of data discusses whether data and its vast amounts and ubiquitous collection mechanisms help to "tell the truth to power", i.e., to change the world for the better. Will picking up the traces and revealing wrongdoing fix the world? Most likely not, because it's not clear whether we will care or do anything because of data. Here is a great quote:

... lawyers cannot fix human rights abuses, scientists cannot fix global warming, whistle-blowers cannot fix secret services and activists cannot fix politics, and nobody really knows how global finances work - regardless of the data they have at hand...

The framework of countering power and problems with data need to be revised. It's not about quantity or even quality of data, but about using whatever little we know to address not only our understandings (i.e., our rational capacities), but also our feelings and beliefs. Here is what gets in the way (those dark sides that everyone should think about):

corporate infrastructure for data and its cultures that creates an illusion of free and neutral services (i.e., services that have no monetary and no political cost)
non-transparency of most digital data and lack of control over it, which prevents us from copying, deleting, or processing our own data
de-politicizing of digital data or constructing data as fuel for innovation and services rather than a ground for moral, ethical, and political decisions

The post is rather pessimistic, but changes do not happen at once, so we should probably keep trying.

Nov 11, 2014

4th RDA Plenary - Breakout session on engagement

Below is a summary from a breakout session on engagement that I co-chaired with Andrew Maffei at the Research Data Alliance 4th plenary in Amsterdam, the Netherlands (Monday, September 22, 2014).

Introduction / Overview

The session had about 25 people in attendance.

I provided an overview of the group and its activities. The group receives strong interest and support at plenaries, but in between the interest drops.
Activities to date include working on the model to connect technically oriented groups and domain interest groups (Domain Interest Group Form and Function model, or DIG-FF), a summer internship project, and participation in the RDA/US advisory committee.

DIG-FF Model: we need to observe inter-group interactions and support form and function of these groups as we can. It may be too early to propose a model. Rather, we can focus on small but practical things can facilitate inter-group communication (e.g., creating information-collection instruments, disseminating information, etc.).

Objectives for P4:

Present the summer internship project
Modify the case statement (create a charter)
Attend breakout meetings of the domain-specific groups and collect information about their work and outcomes
Find opportunities to work on the amplification and adoption theme promoted by the RDA/US within the group and through collaborations

RDA/US summer internship project
The project was done by the RDA/US intern Rene Patnode from the University of California San Diego under the mentorship of EIG chairs. Rene interviewed 16 chairs of the domain interest groups (DIGs) over the phone and email. The goal of the project was to understand the barriers for researchers to data sharing.
Observations and findings:

There is a significant representation of information systems professionals rather than researchers in RDA
Responses were consistent with the literature about barriers: sharing is extra work, user interfaces are poor, no fit with current research culture, no funding for data, lack of good data sources
To remove those barriers we might try to make data sharing enjoyable and social (e.g., more interaction between researchers, etc.).
Gamification (e.g., adding points, badges, etc.) is one possible approach. Citizen science is another mechanism for data collection and sharing.
IT solutions need to better mirror the workflow that is currently in use
Suggestions for RDA role: make processes of RDA engagement clear and transparent, support cross-pollination, take a political stance by lobbying, encourage better technical development

Discussion
Many interesting points and questions were raised during the discussion. Below are some of them:

Collaborative virtual research environments are one way to improve inter-communication and incentivizing.
Does funders requirement for data management plans and its implementation actually improve the outcomes of data stewardship and sharing?
Data needs to be useful for someone else to create “an appetite” for removing burdens
Cultural change usually means that you have to address ALL the stakeholders. Hence the idea for RDA to take a more political role.
Grant budgets need to support data management plans, which need resources.
Knowledge Exchange (http://www.knowledge-exchange.info/) is an organization that has interests similar to this group.
Fun in sharing is good, but what are the other reasons for sharing? We might want to ask the question “What would you like?” and work on that. Dig into the benefits and show scientists in various areas how sharing data can be of benefit.
We have talked a lot about domains. Another orthogonal axis is to look at organizations. Can you get universities, institutions, and membership organizations declare values around data sharing?
Cultural differences in data sharing are often ignored. For example, there are different approaches to privacy and consent.

Next steps for the group

Develop a form to collect stories about benefits and pains of data management / sharing
Start collecting stories Identify and reach out to champions of data sharing
Design an ISHARE t-shirt
Long term: build practical tools for engagement, pay attention to our own data practices, share the data from RDA, advocate for better RDA website, think about focusing on organizations instead of (or in addition to) domains, collaborate with the “Digital Practices in History and Ethnography” group on studying RDA as an organization

Conclusion
Engagement in RDA is very important, we need to keep going!

More about our group here: RDA Engagement Interest Group

May 9, 2014

Big data report from the White House

Another big data review, this time from the White House - "Big Data: Seizing Opportunities, Preserving Values" (pdf). The report explains what big data is (large, diverse, complex, longitudinal, distributed, making possible unexpected discoveries and creating an asymmetry of power between those who hold the data and those who intentionally or inadvertently supply it) and describes implications of big data for public and private sectors. In addition to many known and less known examples of how big data can be good or bad, the report provides initial thoughts on recommendations for big data governance. It divided its approach to policy framework into four overlapping core areas:

1. Big data and citizens - improve public services while preventing the government from accruing unlimited power by using increased surveillance, algorithmic profiling, and metadata tracking.

2. Big data and consumers - reduce cost of commercial services and personalize them while mitigating security breaches and risks of discrimination based on consumer profiles and lack of consumer awareness and data transparency.

3. Big data and discrimination - do less harm and prevent discriminatory uses of identification and re-identification techniques.

4. Big data and privacy - get used to less privacy while reconsidering the notice and consent framework.

In the concluding section the report had the following recommendations:

Advance the Consumer Privacy Bill of Rights.
Pass National Data Breach Legislation.
Extend Privacy Protections to non-U.S. Persons.
Ensure Data Collected on Students in School is Used for Educational Purposes.
Expand Technical Expertise to Stop Discrimination.
Amend the Electronic Communications Privacy Act.

It's a thorough report and is definitely worth a read, but similarly to my and my colleagues big data review (pre-print), it's just the beginning of studying implications and governance of big data.

May 2, 2014

Summary of drivers and barriers in data sharing

Nice summary of the drivers, barriers, and enablers that determine stakeholder engagement based on expert interviews in Dallmeier-Tiessen et al., 2014, Enabling Sharing and Reuse of Scientific Data (restricted access).

Drivers and benefits

Societal beneﬁts - economic/commercial beneﬁts; continued education; inspiring the young; allowing the exploitation of the cognitive surplus in society; better quality decision making in government and commerce; citizens being able to hold governments to accountable.
Academic beneﬁts - the integrity of science; increased public understanding of science.
Research beneﬁts - validation of scientiﬁc results by other scientists; recognition of their contribution; reuse of data in meta-studies to ﬁnd hidden effects/trends; testing new theories against past data; doing new science not considered when data was collected without repeating the experiment; easing discovery of data by searching/mining across large datasets with beneﬁts of scale; easing discovery and understanding of data across disciplines to promote interdisciplinary studies; combining with other data (new or archived) in the light of new ideas.
Organizational beneﬁts - publication of high quality data and citation of data enhance organizational proﬁle; preserved data linked to published articles adds value to the product; data preservation is more business; reputation of institution as “data holder with expert support” is increased; combining data from multiple sources helps to make policy decisions; reuse of data instead of new data collection reduces time and cost to new research results; use of data for teaching purposes.
Individual contributor beneﬁts - preserving data for the contributor to access later — sharing with your future self; peer visibility and increased respect achieved through publications and citation; increased research funding; when more established in their careers through increased control of organizational resources; the socio-economic impact of their research (e.g., spin-out companies, patent licenses, inspiring legislation); status, promotion and pay increase with career advancement; status conferring awards and honors.

Barriers and Enablers are Related to:

Individual contributor incentives
Availability of a sustainable preservation infrastructure
Trustworthiness of the data, data usability, pre-archive activities
Data discovery
Academic defensiveness
Finance
Subject anonymity and personal data conﬁdentiality
Legislation/regulation

Nov 19, 2013

DLF forum notes: Data – sharing – libraries – culture

A recent Digital Library Federation (DLF) forum started with an inspiring keynote by R. D. Lankes and his challenging of the monopolies of content delivery in higher education and the neutrality of the library profession. He argued that librarians should improve society by facilitating knowledge creation, which includes not only providing access to resources, but also teaching literacy, genres, and communication, creating learning environments and motivating people to learn and research. He also emphasized that a librarian is somebody who has either training, professional experience, or spirit, thereby shifting the emphasis from degrees to actual knowledge and passion.

Of lesser success was my leading of a birds-of-a-feather session about Research Data Alliance (RDA). BoFs happened during lunch and it’s probably not a good time to learn about a new organization. Spreading the word about new initiatives is also hard because they are new and there is more anticipation and preparatory work, rather than something that is ready to use. I had several good conversations about RDA, but there could have been more.

All sessions were informative and productive, but the one that got me thinking was a session that I couldn’t attend "Creating the New Normal: Fostering a Culture of Data Sharing with Researchers". It’s a rich topic, so below is some food for thought - my interpretation of the session theme based on the materials prepared for the session by organizers and community notes taken by participants during the session.

The session was based on the Data Information Literacy (DIL) project that is looking to identify skills, capacities and toolkits for data management and leverage information literacy to bridge the disconnect between faculty, graduate students (who are often the data managers in scientific labs) and librarians.

The disconnect stems from different needs in relation to data – data collection vs analysis vs preservation and access. Even though researchers may recognize the importance of data management, they rarely consider it being part of the curriculum or articulated research culture. To simplify, graduate students collect data and both faculty and students analyze it and work on publications. The messiness of data collection and storage is something that bothers many researchers, but they don’t necessarily know what to do about it.

Librarians are increasingly interested in incorporating research data into their library collections and providing access to it as part of their service. They would like datasets to be better prepared for storage and sharing, and they are ready to help. They also have a mission of helping others to address their information needs, but they don’t necessarily have the power to do that. How could these disconnects be bridged?

By working through three scenarios proposed for discussion, the session seemed to approach disconnect bridging via embedding librarians/data specialists into research teams and projects, learning about researchers’ needs and helping them with their data, while teaching them good practices of data management. Librarians’ expertise can be useful in the areas of file organization, metadata, and tools for storage and sharing.

This approach is good, but it’s quite demanding in terms of resources. I agree that data literacy should be embedded in a larger teaching of proper information management (including ethics, security, authoritativeness, etc.) and this could help re-use existing channels of library instruction and minimize resources. At the same time, I wonder whether we should also think about the culture of data sharing in terms of private/public epistemic objects. Data is still a private object, which will constantly undermine the "new normal" of sharing, because we don’t typically share objects we consider private. The new norm then should be that data are "shareable" from the beginning.

In some organizations and domains data are already open and shareable, for example, data from the large observatories supported by the US government. Consequently, data producers in those organizations may have better data information literacy. For other "private" domains, particularly, in the social sciences, there are still a lot of barriers and fears. Would a public culture of data sharing mean peer review of data collection instruments? Survey respondents as co-owners of data? Data curators as independent decision-makers with regard to choosing and preparing data for public use? A lot of possibilities come with the idea of data sharing cultures.

Oct 10, 2013

Research Data Alliance (RDA) plenary thoughts

My notes from the Research Data Alliance plenary (RDA) are posted on the Digital Library Federation site - http://www.diglib.org/archives/5142/

Aug 5, 2013

Research Data Alliance (RDA) Plenary

Spreading the word about an interesting event/initiative I'm involved with:

Second Plenary of the Research Data Alliance, National Academy of Sciences, Washington, DC, September 16–18, 2013

The Research Data Alliance invites participants from all walks of the research/data world to join us at the National Academy of Sciences for our Second Plenary!

Great keynote speakers — including John Wilbanks of Sage Bionetworks and Carole Palmer, Univ. of Illinois
Representatives from supporting agencies like the US National Science Foundation, the US National Institute of Standards and Technology, the European Commission, and the Department of Innovation through the Australian National Data Service (ANDS)
People from organizations like ESIP, CODATA, Microsoft Research, and W3C
Opportunity to participate in a poster session
Working sessions, breakouts, working and interest groups and much more

More info at https://www.rd-alliance.org/future-events.

Jul 15, 2013

The Practice of Data Curation - Archive Journal issue

A little bit of self-promotion.

Archive Journal in its latest 360° section focuses on the practice of data curation.

As research and teaching produce ever-increasing amounts of data in analog and digital forms, what we do with that data is a question that librarians, archivists, scholars, teachers, and students must address. The four contributors discuss what “data curation” is and might become. We invite you to read through the responses by author or by question.

I was one of the contributing authors. It was a great pleasure and a challenge to write my responses.

Apr 25, 2013

Strategy for Civil Earth Observations - Data Management for Societal Benefit

The US National Science and Technology Council recently released a National Strategy for Civil Earth Observations. The goal of this strategy is to provide a framework for developing a more detailed plan that would enable "stable, continuous, and coordinated global Earth observation capabilities for the benefit of society."

The strategy establishes a way to evaluate Earth-observing systems and their information products around 12 societal benefit areas: agriculture and forestry, biodiversity, climate, disasters, ecosystems (terrestrial and freshwater), energy and mineral resources, human health, ocean and coastal resources, space weather, transportation, water resources, weather, and reference measurements. The production and dissemination of information products should be based on the following principles:

Full and open access
Timeliness
Non-discrimination
Minimum cost
Preservation
Information quality
Ease of use

Data management in federal agencies that are responsible for earth science data is described based on the three components of the data life cycle: planning and production, data management, and usage. The latter two components are the main focus of the data management strategy. The suggested activities for those are:

Data management

Data collection and processing - initial steps to store data and create usable data records.
Quality control - follow the principles of the “Quality Assurance Framework for Earth Observation” (QA4EO)
Documentation - basic information about the sensor systems, location and time available at the moment of data collection, etc.
Dissemination - data should be offered in formats that are known to work with a broad range of scientific or decision-support tools. Common vocabularies, semantics, and data models should be employed.
Cataloging - establishing formal standards-based catalog services, building thematic or agency-specific portals, enabling commercial search engines to index data holdings, and implementing emerging techniques such as feeds, self-advertising data, and casting.
Preservation and stewardship - guarantee the authenticity and quality of digital holdings over time.
Usage tracking - measuring whether the data are actually being used; to enable better usage tracking, data should be made available through application programming interfaces (APIs).
Final disposition - not all data and derived products must be archived, derived products that most users have access to may adequately replace raw data and processing algorithms.

Usage activities

Discovery - enabled by dissemination, cataloging and documentation activities.
Analysis - includes quick evaluaionts to assess the usefulness of a data set and an actual scientific analysis.
Product generation - creating new products by averaging, combining, differencing, interpolating, or assimilating data.
User feedback - mechanisms to provide feedback to improve usability and resolve data-related issues.
Citation - different data products, e.g., classifications, model runs, data subsets, etc., need to be citable.
Tagging - identify a data set as relevant to some event, phenomenon, purpose, program, or agency without needing to modify the original metadata.
Gap analysis - the determination by users that more data are needed, which influences the requirements-gathering for new data life cycles.

Each activity raises a lot of questions and challenges. The activities of cataloging, usage tracking, final disposition, tagging and gap analysis are particularly interesting. They raise questions that are rarely addressed in the data management literature. Does anybody use data that are being shared? Do all the data need to be preserved? How can we avoid duplicates and unnecessary modifications of metadata if data are being re-used? To what extent do we need to serve immediate user interests versus the future possibilities for research?

Apr 19, 2013

NIH report: Big data recommendations based on small data?

I've been browsing slides from the last BRDI Symposium, "Finding the Needle in the Haystack: A Symposium on Strategies for Discovering Research Data Online", and found a report for the National Institutes of Health about the management and analysis of large biomedical research data (pdf available here).

It is an interesting report that provides a lot of details about data and technologies in biomedical research as well as about existing efforts in data sharing. The recommendations make sense, since they follow most of the recommendations with regard to research data - more money, more policy, more training:

Promote data sharing by establishing a minimal metadata framework for data sharing, creating catalogs and tools and enhancing data sharing policy for NIH-funded research.
Support the development and dissemination of informatics methods and applications by funding software development.
Train the workforce in quantitative sciences by funding quantitative training initiatives and enhancing review expertise in quantitative methods of bioinformatics and biostatistics.

Even more interesting is what evidence is provided to support these recommendations. The report is based on a relatively small literature corpus (~25 citations plus footnotes) and on the analysis of comments that were solicited via an NIH request for information on management, integration, and analysis of large biomedical datasets. Overall, 50 respondents replied and made 244 suggestions. Is it enough data to make recommendations for NIH? If we begin with the assumption that more support for large datasets and biomedical computations is needed (which seems to be the case with this report), then there is almost no need to analyze costs and benefits of data sharing, the role of large datasets in providing solutions for biomedical problems, and so on.

Aug 23, 2012

Metadata webinar

Notes from the NISO / DCMI webinar "Metadata for managing scientific research data".

General impression: it seems that people who research metadata (and larger information/knowledge organization issues) are so deep into their domains that they think everybody else knows nothing about data/metadata. Perhaps, the audience of this webinar consisted largely of people who are unaware of anything relate to this topic. And that's why the first half hour was spent on pretty simple and uninformative issues of "what is data-metadata-science".

I heard such conversations so many times without any progress, that I began to think we should just skip it and move on. No agreed upon definitions can ever be provided for any more-less complex concept. And still talking about metadata as "data about data" is almost embarrassing. It's better to emphasize that having a shared description of data, e.g., who created them, where they come from, what they are about, etc., helps to produce good and verifiable research and to (re)use the data in the future.

As for how to create metadata, it seems that it still needs to be figured out and systematized, so researchers and librarians are on their own. The metadata world is messy. Possible criteria for the selection and evaluation of metadata schemes include:

From Public Broadcasting Metadata Dictionary Project

Objectives/principles, such as interoperability, specific needs, expertise required.
Domains (genre focus, format variation)
Architectural layout (flat, hierarchical, granular, etc.)

And below are some common schemes according to their level of complexity:

Simple (interoperable, easy to generate, multidisciplinary, flat, 15-25 properties)	Moderate (requires some expertise, more domain focused, extensible via connecting to other schemes)	Complex (requires domain expertise, hierarchical, many properties):
Dublin Core	Darwin Core	FGDC Content standard for digital geospatial metadata
MARC	Access to biological collections data (ABCD)	FGDC Content standard for digital geospatial metadata
DataCite	Ecological metadata language	Data Documentation Initiative (DDI)

A couple of interesting questions/challenges: how to integrate metadata creation into social settings / workflows, automated generation of metadata, metadata as linked data.

Aug 7, 2012

(Cyber)infrastructures

Thoughts based upon the readings about infrastructure, especially “Understanding infrastructures: Dynamics, Tensions, and Designs”, a great report by P. Edwards, S. Jackson, G. Bowker, and C. Knobel

Development of (cyber)infrastructures is not a merely technical/engineering issue. To ensure success we need to be aware of historical context and socio-political issues as well as the messiness of everyday practices.

Historical (dis)continuities underlie many infrastructural projects. Cyberinfrastructures and data science / curation problems did not appear out of nowhere in the 20th century. They have historical precursors, such as:

information gathering activities by the state (statistics as science of state) and the development of sciences as accumulation of records
the development of technologies and organizational practices to sort, sift and store information

Questions of ownership, management, control, and access are always present in infrastructural developments. With regard to data, years of private ownership in data has led to many idiosyncratic practices and formats, which, along with an absence of the metadata, prevent understanding and use by other scientists.

A good quote: “The consequence is that much “shared” data remains useless to others; the effort required for one group to understand another’s output, apply quality controls, and reformat it to fit a different purpose often exceeds that of generating a similar data set from scratch.” (p. 19 of the report)

Cyberinfrastructure development means system building. Successful system-builder teams are made up of technical “wizards”, who envision and create the system, a “maestro,” who orchestrates the organizational, financial, and marketing aspects of the system, and a “champion” who stimulates interest in the project, promotes it and generates adoption. During infrastructural growth, users and user communities can also become critical to success or failure.

Design-level perspective differs from the perspective on the ground. The former can be neat and organized, while the latter can be disorderly and requiring a lot of work. Finding ways to translate between these two perspectives and to incorporate lessons learned from “below” into design from “above” is a challenge and a crucial element of success.

A great quote: “It is also possible that a tech-centered approach to the challenge of data sharing inclines us toward failure from the beginning, because it leaves untouched underlying questions of incentives, organization, and culture that have in fact always structured the nature and viability of distributed scientific work.” (p. 32 of the report)

Additional reading – my other post about general issues in data curation.

Jul 30, 2012

Digital science ecosystem

From the GRDI2020 Final roadmap report: Global scientific data infrastructures: The big data challenges (pdf):

Data- any digitally encoded information, including data from instruments and simulations; results from previous research; material produced by publishing, broadcasting and entertainment; digitized representations of diverse collections of objects, e.g. of museums’ curated objects.

Research Data Infrastructures - managed networked environments (services and tools) that support the whole research cycle and the movement of data and information across domains and agencies.

An ecosystem metaphor is used to conceptualize science universe and its processes. A digital science ecosystem is composed of:

Digital Data Libraries that are designed to ensure the long-term stewardship and provision of quality-assessed data and data services.

Digital Data Archives that consist of older data that is still important and necessary for future reference, as well as data that must be retained for regulatory compliance.

Digital Research Libraries as a collection of electronic documents.

Communities of Research as communities organized around disciplines, methodologies, model systems, project types, research topics, technologies, theories, etc.

While I can see how the metaphor of ecosystem can be beneficial in conceptualizing science universe, I don’t think it was developed enough here. The whole report is structured around tools and infrastructure as it is understood rather narrowly. It seems that the biggest roadblocks are in the domains of human interactions: all those issues of social hierarchies and capital built into our social institutions.

Paul Edwards (one of the authors of another reading that seemed more sophisticated to me) somewhat wrote about it in his book “A vast machine” about infrastructure surrounding weather forecasting and climate change. He talks about how many-many efforts of various social actors facilitated the creation and inversion of infrastructure by constantly questioning data, models, and prognoses. Here is a large quote from the conclusion chapter of that book to demonstrate the emphasis on people and the making of data-knowledge-infrastructure (in bold, which is mine):

“Beyond the obvious partisan motives for stoking controversy, beyond disinformation and the (very real) “war on science,” these debates regenerate for a more fundamental reason. In climate science you are stuck with the data you already have: numbers collected decades or even centuries ago. The men and women who gathered those numbers are gone forever. Their memories are dust. Yet you want to learn new things from what they left behind, and you want the maximum possible precision. You face not only data friction (the struggle to assemble records scattered across the world) but also metadata friction (the labor of recovering data’s context of creation, restoring the memory of how those numbers were made). The climate knowledge infrastructure never disappears from view, because it functions by infrastructural inversion : continual self-interrogation, examining and reexamining its own past. The black box of climate history is never closed. Scientists are always opening it up again, rummaging around in there to find out more about how old numbers were made. New metadata beget new data models; those data models, in turn, generate new pictures of the past.” (P. N. Edwards, “A vast machine”, p. 432)

Why should we trust climate change and its infrastructures? Because of a “vast machine” that is built by a large community of researchers who constantly try to invert it. So in order to understand, develop and advance data-intensive environments, we shouldn’t consider social forces as external. They are part, if not the foundation, of the data universe. So I’d propose to equally emphasize tools (storage-, transfer- and sharing tools) and social arrangements (individuals, institutions, political contexts, events, and so on) as elements of ecosystem.

Jun 21, 2012

Big data landscape

A nice visualization of the big data landscape http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/. A good intro into the "whats" and "whos" of big data. Since I'm not an insider, but rather a curious outsider to this area, I'd be interested in seeing a more elaborate chart.

Feb 17, 2011

Data and the social sciences

Science magazine has a special issue on data. The article "Ensuring the Data-Rich Future of the Social Sciences" (pay-walled) has some suggestions for how to take advantage of the huge data in the future, facilitate sharing and at the same time protect privacy.

Promote data visibility and credit its original author, but archive the data professionally using formalized standards.
Nurture replication and encourage sharing by making a norm or a requirement.
Develop privacy-enhanced data sharing protocols and allow researchers to work with sensitive data in a connected but digitally secure environment (similar to corporations, governments, etc.)
Build a common, open-source, collaborative infrastructure that makes data analysis and sharing easy within and across disciplines.
Develop legal standards.

A lot of data and their preservation is a serious concern of the future social sciences. But by shifting from wisdom to knowledge to information to now data, are we moving forward or backward? The question is whether improved standards, techniques, and policies of data preservation and sharing ultimately improve our knowledge of the world and our collective wisdom...

Pages