Jan 11, 2016

Pantheon 1.0: A manually verified dataset of globally famous biographies

Scientific Data has published a description of an interesting dataset: "Pantheon 1.0, a manually verified dataset of globally famous biographies". This data collection effort contributes to quantitative data for studying historical information, especially, the information about famous people and events.

Data collection workflow
Workflow diagram (Image from the paper)
The authors retrieved over 2 mln records about famous ("globally known") individuals from Google's Freebase, narrowed down the dataset to individuals who have metadata in English Wikipedia and then reduced it further to people who have records in more than 25 different languages in Wikipedia.

Manual cleaning and verification includes a controlled vocabulary for occupations, popularity metrics (defined as a number of Wikipedia edits adjusted by age and pageviews).

The dataset is available for download at Harvard Dataverse http://dx.doi.org/10.7910/DVN/28201. Another entertaining part is a visualization interface at http://pantheon.media.mit.edu that allows to explore the data and answer questions like "Where were globally known individuals in Math born?" (21% in France) or "Who are the globally known people born within present day by country?". Turns out that Russia produced a lot of politicians and writers, while the US gave us many actors, singers and musicians. 

Globally known people born in the US (from http://pantheon.media.mit.edu/treemap/country_exports/US/all/-4000/2010/H15/pantheon)