Jan 11, 2016

Pantheon 1.0: A manually verified dataset of globally famous biographies



Scientific Data has published a description of an interesting dataset: "Pantheon 1.0, a manually verified dataset of globally famous biographies". This data collection effort contributes to quantitative data for studying historical information, especially, the information about famous people and events.


Data collection workflow
Workflow diagram (Image from the paper)
The authors retrieved over 2 mln records about famous ("globally known") individuals from Google's Freebase, narrowed down the dataset to individuals who have metadata in English Wikipedia and then reduced it further to people who have records in more than 25 different languages in Wikipedia.

Manual cleaning and verification includes a controlled vocabulary for occupations, popularity metrics (defined as a number of Wikipedia edits adjusted by age and pageviews).




The dataset is available for download at Harvard Dataverse http://dx.doi.org/10.7910/DVN/28201. Another entertaining part is a visualization interface at http://pantheon.media.mit.edu that allows to explore the data and answer questions like "Where were globally known individuals in Math born?" (21% in France) or "Who are the globally known people born within present day by country?". Turns out that Russia produced a lot of politicians and writers, while the US gave us many actors, singers and musicians. 

Globally known people born in the US (from http://pantheon.media.mit.edu/treemap/country_exports/US/all/-4000/2010/H15/pantheon)


Oct 19, 2015

Making progress in data sharing

A few useful tips on making progress in data sharing in a blog post "Data Sharing: Access, Language, and Context All Matter":

  • To make the global data system less fragmented and disorganized, create data portals with good human-centered designs and support users with varying levels of expertise

  • JSON and XML are great, but humans read data too. These formats are critical to fueling innovation, but make sure CSVs are available as well

  • Responsible data use demands proper attention to metadata. Document datasets and don't ignore ReadMe files  while re-using them

Sep 17, 2015

Valuable lessons from sharing and non-sharing of data

A vivid story from Buzzfeed "Scientists Are Hoarding Data And It’s Ruining Medical Research" describes two related cases - one where researchers voluntarily shared their entire dataset and how the re-analysis  found errors and miscalculations and another one where the data or any results from a largest drug trial were not released for 7 years because the researchers feared criticism and continued double-checking their data. The details from each of the cases are worth following up, but the author of the story comes to an important conclusion that we need to accept that science works through checks and corrections, stop unfair criticisms and doubts in researchers' credibility, and start sharing data for better science, better knowledge, and ultimately, better informed decisions that impact our lives:

And here is where I think the threads come together. The press releases on the reanalysis of the Miguel and Kremer deworming trial in Kenya will go live this week. Somewhere, I’m sure, people will attack or mock them for their errors. One way or another, I can’t believe they won’t feel bruised by the reanalysis. And that is where we have gone wrong. It’s not just naive to expect that all research will be perfectly free from errors, it’s actively harmful.


There is a replication crisis throughout research. When subjected to independent scrutiny, results routinely fail to stand up. We are starting to accept that there will always be glitches and flaws. Slowly, as a consequence, the culture of science is shifting beneath everyone’s feet to recognise this reality, work with it, and create structural changes or funding models to improve it.

 

Sep 15, 2015

Dataset: Roads and cities of 18th century France

An interesting dataset has been described in the Scientific Data journal and shared via the Harvard Dataverse repository - "Roads and cities of 18th century France".
The database presented here represents the road network at the french national level described in the historical map of Cassini in the 18th century. The digitization of this historical map is based on a collaborative methodology that we describe in detail. This dataset can be used for a variety of interdisciplinary studies, covering multiple spatial resolutions and ranging from history, geography, urban economics to network science.

The repository page showed 268 downloads on Sept 15, 2015, so hopefully, some examples of data re-use will follow this publication.

Aug 31, 2015

Lessons from replication of research in psychology

Science magazine has published an article “Estimating the reproducibility of psychological science”, which reports the first findings from 100 replications completed by 270 contributing authors. A quasi-random sample was drawn from three psychology journals: Psychological Science (PSCI), Journal of Personality and Social Psychology (JPSP), and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP:LMC). The replications were performed by teams and then independently reviewed by other researchers and reproduced by another analyst. The reproducibility was evaluated using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analyses of effect sizes. Some highlights from the results:

  • 35 studies in the replications showed positive effect of p < 0.05 compared to 97 original studies

  • 82 studies showed a stronger effect size in the original study than in the replication

  • Effect size comparisons showed a 47.4% replication success rate

  • 39 studies were subjectively rated as successfully replicated


While some news about this publication reported failures in the test (e.g., Nature’s "Over half of psychology studies fail reproducibility test"), the Science article emphasized the challenges of reproducibility itself and care with which interpretations of successes and failures need to be made. The authors of the study pointed out that while replications produced weaker evidence for the original findings,
“It is too easy to conclude that successful replication means that the theoretical understanding of the original finding is correct. Direct replication mainly provides evidence for the reliability of a result. If there are alternative explanations for the original finding, those alternatives could likewise account for the replication. Understanding is achieved through multiple, diverse investigations that provide converging support for a theoretical interpretation and rule out alternative explanations.

It is also too easy to conclude that a failure to replicate a result means that the original evidence was a false positive. Replications can fail if the replication methodology differs from the original in ways that interfere with observing the effect. We conducted replications designed to minimize a priori reasons to expect a different result by using original materials, engaging original authors for review of the designs, and conducting internal reviews. Nonetheless, unanticipated factors in the sample, setting, or procedure could still have altered the observed effect magnitudes.”