Feb 21, 2017

U.S. House, Indiana District 9 General Election 2010-2016 Visualization

This is my first attempt to create a choropleth map - a map that visualizes measurements by shading geographic regions. I used election results data from in.gov - general election of US House representatives from Indiana Congressional District 9, years 2010, 2012, 2014, and 2016. The maps below represent percent of people who voted for Democrat party Candidates (Baron P Hill in 2010, Shelli Yoder in 2012 and 2016, and William Bailey in 2014).

The process was tedious, but straightforward:

  • Find data and get it into appropriate format (some manual copying from PDF was needed)
  • Calculate statistics needed for mapping (here percent voting for Democrats within county)
  • Get geographic (shapefile) data
  • Combine stats and geographic data
  • Generate choropleth map

This tutorial on creating maps with R and this vignette about tmap package were very helpful.

A quick analysis of the District 9 US House elections over time shows that some counties (e.g., Monroe county) are strong in voting for Democrats and some counties (e.g., Morgan and Orange counties) are much weaker. In 2012 though Orange, Washington, Harrison and some other counties suddenly had nearly half of county voters voting for Democrats. The turnout in 2012 was higher than in 2010, but it was comparable to 2016, when Shelli Yoder was the candidate again. The year of 2012 was also when only two candidates ran, so may be we need to look at other candidates and how they take votes. More data and more analysis needed.

Feb 13, 2017

Data quality - a short overview

love your data image

A short overview of data quality definitions and challenges in support of the Love Your Data week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: Defining Data QualityDocumenting, Describing, DefiningGood Data ExamplesFinding the Right DataRescuing Unloved Data

Data quality is the degree to which data meets the purposes and requirements of its use. Good data, therefore, is the data that can be used for the task at hand even if it has some issues (e.g., missing data, poor metadata, value inconsistencies, etc.) Data that has errors, is hard to retrieve or understand, or has no context or traces of where it came from is generally considered bad.

Numerous attempts to define data quality over the last few decades relied on diverse methodologies and identified multiple dimensions of data or information quality (Price and Shanks, 2005). The importance of quality of data is recognized in business and commercial data warehousing (Fan and Geertz, 2012; Redman, 2001), in government operations (Information Quality Act, 2001) and by international agencies involved in data-intensive activities (IMF, 2001). Many research domains have also developed frameworks to evaluate quality of information, including decision, measurement, test, and estimation theories (Altman, 2012).

Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).

Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):

  • how to assess quality
  • how to manage quality
  • what impact quality has on organization
The multitude of definitions, frameworks, and contexts in which data quality is used demonstrate that making data quality a useful paradigm is a persisting challenge that can benefit from establishing a dynamic network of researchers and practitioners in the area of data quality and from developing a framework that would be general and yet flexible enough to accommodate highly specific attributes and measurements from particular domains.

data quality attributes
3 Reasons Why Data Quality Should Be Your Top Priority This Year
Each dimension of data quality, such as completeness, accuracy, timeliness, or consistency creates challenges for data quality.

Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.

As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:
  • How does non-compliance with policies undermine the quality of data?
  • What role does scientific community consensus play in establishing the quality of data?
  • Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized? 
Another aspect of data quality that complicates formalized treatment of initial dimensions is that data is often heterogeneous and can be applied in varied contexts. As has been pointed above, data quality frameworks and approaches are being developed in business, government, and research contexts and quality solutions have to consider structured, semi-structured, and unstructured data and their combinations. Most of the previous data quality research focused on structured or semi-structured data. Additionally, spatial, temporal, and volume dimensions of data contribute to quality assessment and management.

Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.

Most likely, some combination of the above is the best approach. Quality depends on how data was collected as well as on how it was subsequently stored, curated, and made available to others. Data quality is a responsibility that is shared between data providers, data curators and data consumers. While data providers can ensure the quality of their individual datasets, curators help with consistency, coverage and metadata. Maintaining current and consistent metadata across copies and systems also benefits contributions from those who intend to re-use the data. Data and software documentation is another aspect of data quality that cannot be solved technically and needs a combination of organizational / information science solutions.

References and further reading:

Sep 16, 2016

Data for humanitarian purposes

Unni Karunakara, a former president of "Doctors without borders", gave a talk at International Data Week 2016 on September 13 about the role of data in humanitarian organizations. The talk was very powerful in its simplicity and urgent need for better data and its management and dissemination. It was a story of human suffering, but also a story of care and integrity in using data to alleviate it.

Humanitarian action can be defined as moral activity grounded in the ethics of assistance to those in need. Four principles guide humanitarian action:
  • humanity (respect for the human)
  • impartiality (provide assistance because of person's need, not politics or religion
  • neutrality (tell the truth regardless of interests)
  • independence (work independently from governments, businesses, or other agencies)
These principles affect how to collect and use data and how to ensure that data helps. Data collected for humanitarian action is evidence that can be used for direct medical action and for bearing witness, which is a very important activity of humanitarian organizations:  
“We are not sure that words can always save lives, but we know that silence can certainly kill." (quoted from another MSF president)
Awareness of serious consequences of data for humanitarian action makes "Doctors without borders" work only with data they collect themselves and use stories they witnessed firsthand. Restraint and integrity in data collection is crucial in maintaining credibility of the organization.

Lack of data or lack of mechanisms to deliver necessary data hurts people. Thus, in Ebola outbreak it took the World Health Organization about 8 months to declare emergency and 3000 people died because data was not available in time or in the right form. The Infectious Diseases Data Observatory (IDDO) was created to help with tracking and researching infectious diseases by sharing data, but many ethical, legal, etc. issues still need to be solved.

Humanitarian organizations often do not have trustworthy data available, either because of competing definitions or lack of data collection systems. For example, because of the differences in defining "civilian casualty" numbers of civilians killed in drone strikes range from a hundred to thousands. Or, in developing countries or conflict zones where census activities are absent or dangerous, counting graves or tents becomes a proxy of mortality, mobility rates and other important indicators. Crude estimates then are the only available evidence.

"Doctors without borders" (MSF) does a lot to share and disseminate its information. It has an open data / access policy and aspires to share data, while placing high value on security and well-being of people it helps.

Sep 2, 2016

Workshop: Data Quality in Era of Big Data

The center where I work organizes a workshop of possible interest to many who work with data. Scholarships are available.

Data Quality in Era of Big Data
Bloomington, Indiana
28-29 September 2016

Throughout the history of modern scholarship, the exchange of scholarly data was undertaken through personal interactions among scholars or through highly curated data archives. In either case, implicit or explicit provenance mechanisms gave a relatively high degree of insurance of the quality of the data. However, the ubiquity of the web and mobile digital culture has produced disruptive new forms of data. We need to ask ourselves what we know about the data and what we can trust. Failure to answer these questions endangers the integrity of the science produced from these data.

The workshop will examine questions of quality:
·        Citizen science data
·        Health records
·        Integrity
·        Completeness; boundary conditions
·        Instrument quality
·        Data trustworthiness
·        Data provenance
·        Trust in data publishing

The 2 day workshop begins with a half day of tutorials.  The main workshop begins early afternoon on 28 September and continuing to noon on the 29 September.  With sufficient interest, there may be another training session following noon conclusion of the main workshop on 29 September.

Early Career Travel Funds:
Travel funds are available for early career researchers, scholars, and practitioners http://d2i.indiana.edu/mbdh/#scholarships

Important Dates:
·        Workshop:  Sep 28-29, 2016
·        Deadline for requesting early career travel funds:  Sep 9, 2016 midnight EDT
·        Notification of travel funding:  Sep 13, 2016
·        Registration deadline:  Sep 19, 2016
Organizing Committee:
General Chairs:  Beth Plale, Indiana University

Program Committee
Carl Lagoze, University of Michigan, chair
Devan Donaldson, Indiana University
H.V. Jagadish, University of Michigan
Xiaozhong Liu, Indiana University
Jill Minor, Indiana University
Val Pentchev, Indiana University
Hridesh Rajan, Iowa State University

Early Career Chairs
Devan Donaldson, Indiana University 
Xiaozhong Liu, Indiana University

Local Arrangements Chair
Jill Minor, Indiana University

Aug 17, 2016

SynBERC, anthropological inquiry and methods of research

Recently I've been searching for guidance on how to describe ethnographic methodology in a grant proposal and found P. Rabinow and A. Stavrianakis' commentary Movement space: Putting anthropological theory, concepts, and cases to the test, where they reflect on the challenges of anthropological inquiry, on what it means to observe in heterogeneous and changing spaces. I had no time to read it slowly and carefully, so now just filling this gap.

The essay is a response to another collection of essays, but also a reflection on previous ethnographic research with Synthetic Biology Engineering Research Center SynBERC (I wish I paid more attention to it during my own dissertation research). An honest public account like that contributes to the ethics and methodology discussions more than any published "research" article.
Raising the question of "to what end" in anthropological inquiry, Rabinow and Stavrianakis' essay recollects previous collaborative participant-observations as attempts to bring the ethics that exists outside of the instrumental rationality of science into multidisciplinary research projects.

Flourishing is the concept they used to challenge and change the currently existing relations between knowledge and care (see Rabinow, Paul. "Prosperity, Amelioration, Flourishing: From a Logic of Practical Judgment to Reconstruction." Law and Literature 21, no. 3 (2009): 301-20, jstor). Flourishing helps to examine research practices from a holistic perspective, as practices that are performed by human beings without ethical compartmentalization into scientific, individual, and citizen values.

Then the discussion moves to temporality in anthropological research and the distinction between "contemporary" and "present" in ethnography. This distinction was hard for me to understand. Observations are made in the present, but somehow contextualization with experiences from the past (history) helps to challenge the "ethnographic present". Does it mean that something (objects or practices) maybe present but not contemporary? Or that contemporary may include the past? In other words, the distinction present vs contemporary vs modern allows us to stay tuned to the constant changes and not to fix descriptions as existing in certain times only. Sometimes it seemed that contemporary referred to attempts to reconcile diverse or contradicting practices (e.g., the practices of observation and observers and the practices of the observed).

An interesting point was made on citation (again, as a response to someone else's point). It is about acknowledging more recent work on similar issues.  Understanding that that's rules of the game (esp. to get grants), Rabinow writes that excessive citation also constrains thinking and writing, and authorizes such practices. Why would someone go back to reading and citing Weber, Foucault, or Dewey? Not because what they said is still relevant and true (althrough some of it is), but because they paid so much attention to problem formation, to the need for conceptual tools, and to the importance of experimentation with form.

Back to SynBERC, it is striking how empty expressions of support from bioscientists and engineers masked indifference and ultimately lack of respect and willingness to change. What made things worse was that social scientists' effort to develop effective modes of governance and interaction were blocked and downgraded to non-action and "soothing public relations". Moreover, the social scientists themselves failed to coordinate and reflect on their complicity with dominating technoscientific norms and values.

Even though I really appreciated this account of anthropological "failure" (as seen by others, but not by the authors who conceptualize it as an "anthropological test"), there is a larger purpose in it. As the authors put it,
... it is time to thematize the new configurations of power relations in which anthropologists are working today. Critique as denunciation, still the dominant mode in anti-colonial narratives, is no longer sufficient for the complexities of contemporary inquiry. We are arguing for a more fine-grained acceptance of the fact that by refusing the binaries of inside and outside, one’s responsibility for one’s position in the field is made available for reflection and invention.
Anthropology's major task is to map heterogeneity of human and cultural forms, including:
  • cultural heterogeneity with an underlying generality (American anthropology)
  • heterogeneity within common institutional forms such as kinship and law (British anthropology)
  • variations in structural patterns of society and the mind (French anthropology)
However, accounts of heterogeneity lost their force, in some ways losing their criteria of validity under the pressure of current norms of conducting research. At the same time critical evaluation of such criteria is an important task in changing present times. Such evaluation can be done through testing - constant re-evaluation of the existing conceptual tools in the context of new situations and experiences. The rest of the detailed discussion on testing was dense, but less relevant to me, so it was also harder to follow. 

One of the take-aways is that anthropology needs to be a collaborative endeavor, where individual inquiries examine specific cases and then many inquirers create a common space of concepts, problems, and cases. The constant movement between specific cases and topology of cases creates a space where anthropology can make justifiable warrantable claims about more than one case, i.e., about heterogeneity and associated generality.