Apr 19, 2013

NIH report: Big data recommendations based on small data?

I've been browsing slides from the last BRDI Symposium, "Finding the Needle in the Haystack: A Symposium on Strategies for Discovering Research Data Online", and found a report for the National Institutes of Health about the management and analysis of large biomedical research data (pdf available here).

It is an interesting report that provides a lot of details about data and technologies in biomedical research as well as about existing efforts in data sharing. The recommendations make sense, since they follow most of the recommendations with regard to research data - more money, more policy, more training:

  • Promote data sharing by establishing a minimal metadata framework for data sharing, creating catalogs and tools and enhancing data sharing policy for NIH-funded research.
  • Support the development and dissemination of informatics methods and applications by funding software development.
  • Train the workforce in quantitative sciences by funding quantitative training initiatives and enhancing review expertise in quantitative methods of bioinformatics and biostatistics.
Even more interesting is what evidence is provided to support these recommendations. The report is based on a relatively small literature corpus (~25 citations plus footnotes) and on the analysis of comments that were solicited via an NIH request for information on management, integration, and analysis of large biomedical datasets. Overall, 50 respondents replied and made 244 suggestions. Is it enough data to make recommendations for NIH? If we begin with the assumption that more support for large datasets and biomedical computations is needed (which seems to be the case with this report), then there is almost no need to analyze costs and benefits of data sharing, the role of large datasets in providing solutions for biomedical problems, and so on.