Apr 18, 2016

Dataset on Parkinson's disease

In March 2016 Sage Bionetworks released a dataset that captures the everyday experiences of over 9,500 people with Parkinson's disease (press release). The data described in the data paper "The mPower study, Parkinson disease mobile data collected using ResearchKit" was collected via the mPower iPhone app, where participants were presented with tasks (referred to as ‘memory’, ‘tapping’, ‘voice’, and ‘walking’ activities) and asked to fill out surveys.

Not everybody agreed to share their data broadly with the research community. Out of 14,684 verified participants 9,520 (65%) agreed to share broadly, the rest split between withdrawing from the study and agreeing to share narrowly with the team only:

Study cohort description
Figure 1: mPower study cohort description. From http://www.nature.com/articles/sdata201611#methods

To provide proper safeguards and to balance sharing and privacy, the research team established a data governance structure. Access is granted to qualified researchers who agree to specific conditions for use, including the following:

  • participants cannot be re-identified
  • the data may not be redistributed
  • findings need to be published in open access venues
  • both participants and research team need to be acknowledged as data contributors
This effort is another example of the newly forming data sharing culture. And it uses Synapse that seems to make sharing easier from both technical and policy perspectives.

Apr 10, 2016

Big data analytics overview

The paper Beyond the hype: Big data concepts, methods, and analytics (2015, International Journal of Information Management, Vol. 35, N 2, pp. 137–144) reviews definitions and analytics techniques of big data and discusses some future developments. The article begins with a chart showing an explosion of publications in the Proquest database, which is quite similar to the chart in our JASIST publication "Big data, bigger dilemmas". Both charts show that 2013 was the year when the term "big data" gained popularity:
"Beyond the hype ..."
"Big data, bigger dilemmas..."
The paper cites Diebold's paper "A personal perspective on the origin(s) and development of “big data”: The phenomenon, the term, and the discipline" to describe the origin of the term "big data":
"... the term “big data … probably originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid-1990s, in which John Mashey figured prominently".
After summarizing aspects of big data that were discussed many times elsewhere (volume, velocity, variety, veracity, etc.), the article provides a useful summary of the types of analytics that are common in big data research:
  1. Text analytics
    • Information extraction
      • Entity recognition
      • Relation extraction
  2. Text summarization
    • Extractive (location and frequency of text units)
    • Abstractive (semantic information)
  3. Question answering
  4. Audio (speech) analytics
    • Transcript-based approach (large-vocabulary continuous speech recognition, LVCSR)
    • Phonetic-based approach
  5. Video analytics
  6. Social media analytics
    • Content-based analytics
    • Structure-based analytics
      • Community detection
      • Social influence analysis
      • Link prediction
  7. Predictive analytics
In conclusion the paper argues for new techniques that would address such issues as the irrelevance of statistical significance, heterogeneity and computational efficiency in big data.

Mar 3, 2016

Predatory journal invitation

A few days ago I received an invitation to join an editorial board of a journal:

Dear Dr. Inna Kouper,
Wishes from The Scientific Pages!
We are glad to announce the successful launch of The Scientific Pages of Information Science. It is my great pleasure inviting you to join our editorial board.
You have been invited because of your contribution and recognized works in this field. Upon acceptance, we request you to send your recent photograph, CV, short biography and research interests. The details are requested in order to create your profile page in our journal.
Of course, the grammar and style of the invitation were so off, there was no doubt that this is some kind of predatory publishing. Nevertheless, I did some searching, and here is the result:
"New Open-Access Publisher Launches with 65 Unneeded Journals"
The publisher is currently spamming for editorial board members ...
At least this one was easy to spot. What if they get better and become harder to differentiate?

Jan 11, 2016

Pantheon 1.0: A manually verified dataset of globally famous biographies

Scientific Data has published a description of an interesting dataset: "Pantheon 1.0, a manually verified dataset of globally famous biographies". This data collection effort contributes to quantitative data for studying historical information, especially, the information about famous people and events.

Data collection workflow
Workflow diagram (Image from the paper)
The authors retrieved over 2 mln records about famous ("globally known") individuals from Google's Freebase, narrowed down the dataset to individuals who have metadata in English Wikipedia and then reduced it further to people who have records in more than 25 different languages in Wikipedia.

Manual cleaning and verification includes a controlled vocabulary for occupations, popularity metrics (defined as a number of Wikipedia edits adjusted by age and pageviews).

The dataset is available for download at Harvard Dataverse http://dx.doi.org/10.7910/DVN/28201. Another entertaining part is a visualization interface at http://pantheon.media.mit.edu that allows to explore the data and answer questions like "Where were globally known individuals in Math born?" (21% in France) or "Who are the globally known people born within present day by country?". Turns out that Russia produced a lot of politicians and writers, while the US gave us many actors, singers and musicians. 

Globally known people born in the US (from http://pantheon.media.mit.edu/treemap/country_exports/US/all/-4000/2010/H15/pantheon)

Oct 19, 2015

Making progress in data sharing

A few useful tips on making progress in data sharing in a blog post "Data Sharing: Access, Language, and Context All Matter":

  • To make the global data system less fragmented and disorganized, create data portals with good human-centered designs and support users with varying levels of expertise

  • JSON and XML are great, but humans read data too. These formats are critical to fueling innovation, but make sure CSVs are available as well

  • Responsible data use demands proper attention to metadata. Document datasets and don't ignore ReadMe files  while re-using them