Apr 21, 2017

March for Science - to march or not to march

Apparently, there is a big controversy with regard to March for Science (M4S) that will take place this Saturday April 22, 2017 in DC and in many other cities around the US.

The main stated goal of the march is to support publicly funded and publicly communicated science as a pillar of human freedom and prosperity. I was set on going because it seems that nowadays science needs support, because regardless of whether you believe in such thing as objective truth-seeking (I have my doubts), scientists can and should be political in defending their institutions and their role in public life. But mostly I was set on going because we need to resist anti-intellectualism and assaults on reason. My own reasons more-less clear, I didn't pay much attention for any discussion around the march. And I bought a t-shirt even though merchandising around protest movements seems out-of-place. Perhaps, because this march is not a protest or social justice movement.

Many people feel strongly that the march is wrong. That they were excluded from planning and organizing. Most importantly, that the march marginalizes non-white non-male scientists and disregards diversity. That it is a microcosm of liberal racism and that march organizers pushed out those who argued for inclusiveness and intersectionality. The controversy is scattered across mass and social media, but to summarize one side (organizers) is complicit in making the march a watered-down non-political "celebration of science". The other side (#MarginSci-ers) perceives the march as a social justice movements and wants the message of diversity (which applies to any context of American life) be reinforced through this movement as well. An interesting analysis of the march diversity discourse shows how organizers shifted their position with regard to diversity, thereby conforming to existing stereotypes and dominant discourse:
Unfortunately, through various miscommunications, including from the co-chairs and other key members of the MfS committee, the MfS audience has been primed to reinforce the established discourse about science. It took the better part of two months of constant lobbying and external pressure from minority scientists for the MfS organisers to finally reverse their stance. The fourth diversity statement finally states that science is political. At the same time, more recent media interviews that position diversity as a “distraction” undermine this stance.
In a sense, controversy is good. It highlights gaps in a movement and could potentially help to develop a robust program and action plan. But what is this movement? Upon reading the history of its organization, the march seems more like a top-down attempt to organize and contain rather than a grass-root protest and demand for change. It's being done professionally with attempts to control the message and the goals. Is "celebration" enough to ensure change? Do I need to celebrate science or to improve the mutual relationship between science and society? Are we mobilizing only because we want public funding and therefore need to "educate" the public and policy-makers?

There is a high probability that with the goals of celebration, connections, understanding and outreach, M4S will follow #Occupy and Women's March movements - much enthusiasm and no action due to the lack of clear vision and strategies for change. A strong movement should have strong demands, which can then translate to specific legislation and policies. For example,

  • Equal pay and opportunities in science and research
  • Strong science education across all states
  • Protections for whistle-blowers and government scientists from political repressions
  • No marketization of science and education
  • Exposing and dismantling the military-industrial-scientific complex



Feb 21, 2017

U.S. House, Indiana District 9 General Election 2010-2016 Visualization

This is my first attempt to create a choropleth map - a map that visualizes measurements by shading geographic regions. I used election results data from in.gov - general election of US House representatives from Indiana Congressional District 9, years 2010, 2012, 2014, and 2016. The maps below represent percent of people who voted for Democrat party Candidates (Baron P Hill in 2010, Shelli Yoder in 2012 and 2016, and William Bailey in 2014).



The process was tedious, but straightforward:

  • Find data and get it into appropriate format (some manual copying from PDF was needed)
  • Calculate statistics needed for mapping (here percent voting for Democrats within county)
  • Get geographic (shapefile) data
  • Combine stats and geographic data
  • Generate choropleth map

This tutorial on creating maps with R and this vignette about tmap package were very helpful.

A quick analysis of the District 9 US House elections over time shows that some counties (e.g., Monroe county) are strong in voting for Democrats and some counties (e.g., Morgan and Orange counties) are much weaker. In 2012 though Orange, Washington, Harrison and some other counties suddenly had nearly half of county voters voting for Democrats. The turnout in 2012 was higher than in 2010, but it was comparable to 2016, when Shelli Yoder was the candidate again. The year of 2012 was also when only two candidates ran, so may be we need to look at other candidates and how they take votes. More data and more analysis needed.

Feb 13, 2017

Data quality - a short overview

love your data image

A short overview of data quality definitions and challenges in support of the Love Your Data week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: Defining Data QualityDocumenting, Describing, DefiningGood Data ExamplesFinding the Right DataRescuing Unloved Data

Data quality is the degree to which data meets the purposes and requirements of its use. Good data, therefore, is the data that can be used for the task at hand even if it has some issues (e.g., missing data, poor metadata, value inconsistencies, etc.) Data that has errors, is hard to retrieve or understand, or has no context or traces of where it came from is generally considered bad.

Numerous attempts to define data quality over the last few decades relied on diverse methodologies and identified multiple dimensions of data or information quality (Price and Shanks, 2005). The importance of quality of data is recognized in business and commercial data warehousing (Fan and Geertz, 2012; Redman, 2001), in government operations (Information Quality Act, 2001) and by international agencies involved in data-intensive activities (IMF, 2001). Many research domains have also developed frameworks to evaluate quality of information, including decision, measurement, test, and estimation theories (Altman, 2012).

Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).

Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):

  • how to assess quality
  • how to manage quality
  • what impact quality has on organization
The multitude of definitions, frameworks, and contexts in which data quality is used demonstrate that making data quality a useful paradigm is a persisting challenge that can benefit from establishing a dynamic network of researchers and practitioners in the area of data quality and from developing a framework that would be general and yet flexible enough to accommodate highly specific attributes and measurements from particular domains.

data quality attributes
3 Reasons Why Data Quality Should Be Your Top Priority This Year
Each dimension of data quality, such as completeness, accuracy, timeliness, or consistency creates challenges for data quality.

Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.

As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:
  • How does non-compliance with policies undermine the quality of data?
  • What role does scientific community consensus play in establishing the quality of data?
  • Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized? 
Another aspect of data quality that complicates formalized treatment of initial dimensions is that data is often heterogeneous and can be applied in varied contexts. As has been pointed above, data quality frameworks and approaches are being developed in business, government, and research contexts and quality solutions have to consider structured, semi-structured, and unstructured data and their combinations. Most of the previous data quality research focused on structured or semi-structured data. Additionally, spatial, temporal, and volume dimensions of data contribute to quality assessment and management.

Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.

Most likely, some combination of the above is the best approach. Quality depends on how data was collected as well as on how it was subsequently stored, curated, and made available to others. Data quality is a responsibility that is shared between data providers, data curators and data consumers. While data providers can ensure the quality of their individual datasets, curators help with consistency, coverage and metadata. Maintaining current and consistent metadata across copies and systems also benefits contributions from those who intend to re-use the data. Data and software documentation is another aspect of data quality that cannot be solved technically and needs a combination of organizational / information science solutions.

References and further reading:

Sep 16, 2016

Data for humanitarian purposes

Unni Karunakara, a former president of "Doctors without borders", gave a talk at International Data Week 2016 on September 13 about the role of data in humanitarian organizations. The talk was very powerful in its simplicity and urgent need for better data and its management and dissemination. It was a story of human suffering, but also a story of care and integrity in using data to alleviate it.

Humanitarian action can be defined as moral activity grounded in the ethics of assistance to those in need. Four principles guide humanitarian action:
  • humanity (respect for the human)
  • impartiality (provide assistance because of person's need, not politics or religion
  • neutrality (tell the truth regardless of interests)
  • independence (work independently from governments, businesses, or other agencies)
These principles affect how to collect and use data and how to ensure that data helps. Data collected for humanitarian action is evidence that can be used for direct medical action and for bearing witness, which is a very important activity of humanitarian organizations:  
“We are not sure that words can always save lives, but we know that silence can certainly kill." (quoted from another MSF president)
Awareness of serious consequences of data for humanitarian action makes "Doctors without borders" work only with data they collect themselves and use stories they witnessed firsthand. Restraint and integrity in data collection is crucial in maintaining credibility of the organization.

Lack of data or lack of mechanisms to deliver necessary data hurts people. Thus, in Ebola outbreak it took the World Health Organization about 8 months to declare emergency and 3000 people died because data was not available in time or in the right form. The Infectious Diseases Data Observatory (IDDO) was created to help with tracking and researching infectious diseases by sharing data, but many ethical, legal, etc. issues still need to be solved.

Humanitarian organizations often do not have trustworthy data available, either because of competing definitions or lack of data collection systems. For example, because of the differences in defining "civilian casualty" numbers of civilians killed in drone strikes range from a hundred to thousands. Or, in developing countries or conflict zones where census activities are absent or dangerous, counting graves or tents becomes a proxy of mortality, mobility rates and other important indicators. Crude estimates then are the only available evidence.

"Doctors without borders" (MSF) does a lot to share and disseminate its information. It has an open data / access policy and aspires to share data, while placing high value on security and well-being of people it helps.


Sep 2, 2016

Workshop: Data Quality in Era of Big Data

The center where I work organizes a workshop of possible interest to many who work with data. Scholarships are available.

Data Quality in Era of Big Data
Bloomington, Indiana
28-29 September 2016


Throughout the history of modern scholarship, the exchange of scholarly data was undertaken through personal interactions among scholars or through highly curated data archives. In either case, implicit or explicit provenance mechanisms gave a relatively high degree of insurance of the quality of the data. However, the ubiquity of the web and mobile digital culture has produced disruptive new forms of data. We need to ask ourselves what we know about the data and what we can trust. Failure to answer these questions endangers the integrity of the science produced from these data.

The workshop will examine questions of quality:
·        Citizen science data
·        Health records
·        Integrity
·        Completeness; boundary conditions
·        Instrument quality
·        Data trustworthiness
·        Data provenance
·        Trust in data publishing

The 2 day workshop begins with a half day of tutorials.  The main workshop begins early afternoon on 28 September and continuing to noon on the 29 September.  With sufficient interest, there may be another training session following noon conclusion of the main workshop on 29 September.

Early Career Travel Funds:
Travel funds are available for early career researchers, scholars, and practitioners http://d2i.indiana.edu/mbdh/#scholarships

Important Dates:
·        Workshop:  Sep 28-29, 2016
·        Deadline for requesting early career travel funds:  Sep 9, 2016 midnight EDT
·        Notification of travel funding:  Sep 13, 2016
·        Registration deadline:  Sep 19, 2016
 ​
Organizing Committee:
General Chairs:  Beth Plale, Indiana University

Program Committee
Carl Lagoze, University of Michigan, chair
Devan Donaldson, Indiana University
H.V. Jagadish, University of Michigan
Xiaozhong Liu, Indiana University
Jill Minor, Indiana University
Val Pentchev, Indiana University
Hridesh Rajan, Iowa State University

Early Career Chairs
Devan Donaldson, Indiana University 
Xiaozhong Liu, Indiana University

Local Arrangements Chair
Jill Minor, Indiana University