tag:blogger.com,1999:blog-62190182477717816492024-03-26T23:38:22.573-07:00DIKW: Data, Information, Knowledge, WisdomDIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.comBlogger146125tag:blogger.com,1999:blog-6219018247771781649.post-30295148336405422182019-09-17T12:29:00.000-07:002019-09-17T12:29:42.723-07:00Field interviewers and survey selection bias<b>Notes from:</b> Stephanie Eckman, Achim Koch, Interviewer Involvement in Sample Selection Shapes the Relationship Between Response Rates and Data Quality, Public Opinion Quarterly, Volume 83, Issue 2, Summer 2019, Pages 313–337, <a href="https://doi.org/10.1093/poq/nfz012">https://doi.org/10.1093/poq/nfz012</a><br />
<br />
The paper examines the role of interviewers in sample selection and its impact on data quality, especially the response rates. They use the European Social Survey - a face-to-face survey of behavior and opinions in many EU countries. Selection methods vary from using countries' registries of individuals (no interviewer involvement) to household and address registries, where the sample may have more than one house and the interviewer selects a housing unit and a person to interview) to random walks, where the interviewer selects every k-th unit and conducts interviews.<br />
<br />
The total survey error framework associates the descrease in data quality with the following sources of errors:<br />
- undercoverage (some persons have no chance to be selected)<br />
- nonresponse (not all selected persons participated)<br />
- sampling error<br />
- measurement error (the response given does not match the true value)<br />
<br />
This study focused on undercoverage and nonreponse as <b><i>selection bias</i></b> and used an external (another larger survey in the EU) and internal (difference from 50% female sample) measures of this bias.<br />
<br />
The results suggest that when interviewers are not involved in sample selection, response rates are unrelated to selection bias. However, when interviewers are involved in sample selection, the response rates are higher, but they're associated with more selection bias. The paper concludes:<br />
<blockquote>
The most important issue for researchers who rely on survey data is how we can prevent manipulation of selection by interviewers. We recommend using sampling methods that minimize interviewer selection, as far as possible. Improved training and supervision of interviewers could also reduce interference in the selection process. If interviewers did not feel pressured to achieve high response rates, they might allow the selection process to be fully random and selection bias would be smaller.</blockquote>
<br />
<br />
<br />
<br />DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-77113758702125983622018-03-29T08:37:00.004-07:002018-03-29T08:37:47.437-07:00Cybersecurity Curricular GuidelineA report released by the <a href="https://cybered.hosting.acm.org/wp/task-force-members/" target="_blank">Task Force on Cybersecurity Education</a> provides a comprehensive framework and guidelines for cybersecurity post-secondary education (<a href="https://www.acm.org/binaries/content/assets/education/curricula-recommendations/csec2017.pdf">pdf</a>). According to the presentation of one of the task force co-chairs, Diana Burley, it was a huge effort with many consultations, travel, and experts involved. And it went through the endorsement process with four major computing organizations: <a href="http://www.acm.org/" target="_blank">ACM</a>, <a href="https://www.computer.org/" target="_blank">IEEE</a>, Association for Information Systems Special Interest Group on Security (<a href="http://aisnet.org/group/SIGSEC" target="_blank">AIS SIGSEC</a>) and International Federation for Information Processing Technical Committee on Information Security Education (<a href="http://www.ifiptc11.org/index.php?id=home0" target="_blank">IFIP</a>). The resulting report can hopefully help to define cybersecurity as a discipline, describe proficiency needed for cybersecurity experts, and connect academic programs with industry needs. Ultimately, bringing some common understanding and standardization into cybersecurity education should improve the education and help fill a shortage of security professionals.<br />
<br />
In terms of definition, cybersecurity involves the creation, operation, analysis, and testing of secure computer systems. The report assumes that while it is an interdisciplinary area that includes law, policy, human factors, ethics, and risk management, it is fundamentally a computing-based discipline. One of the challenges in developing curricula guidelines was to accommodate large variability of cybersecurity programs - depending on in which department or program they're created, there can be significantly different content and emphasis. So the guidelines are designed to have some flexibility through the notion of disciplinary lens. The program should be based on a solid computer science foundation with input from computer and software engineering and information systems and technologies and include cross-cutting concepts such as confidentiality, integrity, risk, and systems thinking.<br />
<br />
The report shows a serious effort to be comprehensive and yet flexible. It includes eight knowledge areas: <b>data, software, components and connections, system, human, organization, and society.</b> Each area has several comprising units along with described essentials and learning outcomes. There is some overlap between areas and units, which again, helps to accommodate the variety of existing education efforts. Below is a summary that provides a quick overview of some areas:
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe_UNq5ajxBQyFulh8naqZGdbY_EwOqQb2npH8j7fp1FrYDaRh0dA4JBx4IEBOgOUPU8KMFPnyN1rQ3S7pRYlr6k6h_YFrC14KKgef8y8KCo4dpOHuLjKEgmNIo3GiX8e0zQHW-ndU4zg/s1600/cybersec+areas.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="292" data-original-width="888" height="210" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe_UNq5ajxBQyFulh8naqZGdbY_EwOqQb2npH8j7fp1FrYDaRh0dA4JBx4IEBOgOUPU8KMFPnyN1rQ3S7pRYlr6k6h_YFrC14KKgef8y8KCo4dpOHuLjKEgmNIo3GiX8e0zQHW-ndU4zg/s640/cybersec+areas.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
It is nice to see that ethics is a significant and explicit component of the curriculum. While it doesn't remove the challenge of educating technical professionals on ethics and human behavior, it certainly provides space for discussions. More information about the guideline and the task force is at <a href="http://cybered.acm.org/">http://cybered.acm.org/</a> </div>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-53785194455670150472018-02-14T08:01:00.000-08:002018-02-14T08:01:16.514-08:00Chomsky and Foucault on human nature and power<p>Notes from a televised debate between N. Chomsky and M. Foucault in 1971 (<a href="https://www.youtube.com/watch?v=3wfNl2L0Gf8" target="_blank">video</a> and <a href="https://chomsky.info/1971xxxx/" target="_blank">transcript</a>).</p>
<p>Chomsky begins with examples from linguistics to illustrate the notion of "innate structures". Children are successful in learning the language because they can use "innate language" or "instinctive knowledge" to transform limited data they get exposed to into organized knowledge. This <i>instinctive knowledge</i>, which allows children to build complex knowledge structures from partial data, is a fundamental constituent of human nature. Such a constituent (a collection of innate organizing principles) must be available in other domains, such as human cognition, behavior, and interaction. This is what Chomsky refers to as human nature.<p>
<p>Foucault mistrusts the notion of human nature - it is one of the concepts that while not being strictly scientific, has the ability to "designate, delimit and situate" certain types of discourses. For Chomsky it is ok to start with the concept of human nature as somewhat mystical (similar to gravitational forces or other scientific concepts) and later explain it through physical components (e.g., neural networks). Chomsky describes his approach as looking at the earlier stages of scientific thinking (great thinkers, more specifically) and understanding how they were able to arrive at concepts and ideas not available to anybody before.</p>
<p>Foucault makes a distinction between individual <i>attribution </i>of a discovery and <i>collective production of knowledge,</i> which can be referred to as "tradition", "mentality", or "modes". The former has been highly valued, while the latter is usually negativized. Another distinction is between knowledge as human activity and <i>truth</i>. The latter may be hidden from humans, but it will be unveiled. Attribution and relation to truth are interconnected. Throughout history we see examples of how the subject of truth (the individual revealing it) has to overcome myths and common thought, he has to "discover". What if this close relation of subject to truth is an effect of knowledge? What if truth is a complex non-individual formation? Can we replace individuals in the production of knowledge?</p>
<p>This position highlights a difference between Chomsky's and Foucault's approach to creativity. According to Foucault, Chomsky had to introduce the speaking subject into linguistics because language has been commonly studied as a system with a collective value. In language we have a few rules and elements and an unknown system of totalities that can be brought to light by individuals. In the history of knowledge, it's similar, but one has to overcome the dominance of individual creativity to show that there are rules and elements that can be transformed without explicitly passing through an individual.</p>
<p>Throughout the debate both scholars touch on many concepts from science and politics. Some of them are described below to highlight their differences:</p>
<table border="1">
<tbody>
<tr><td width="20%"><b>Concept</b></td>
<td><b>Chomsky</b></td>
<td><b>Foucault</b></td>
</tr>
<tr><td>Domain (Focus)</td>
<td>Language</td>
<td>Knowledge</td></tr>
<tr><td>Human nature</td>
<td>Comprised of innate structures that allow for learning and arriving at complex knowledge based on partial information</td>
<td>A historical construct that can organize knowledge, but also can delimit how we see human behavior</td></tr>
<tr><td>Creativity</td>
<td>A common human act of thinking about a new situation, describing it and acting in it</td>
<td>An individualistic act that has been emphasized throughout history without looking at general communal rules that are behind it</td></tr>
<tr><td>Freedom</td>
<td>Limited number of rules with infinite possibilities of application</td>
<td>"Grille" of many determinisms that affects how we arrive at knowledge and understanding</td></tr>
<tr><td>Ideal model of society</td>
<td>A federated, decentralised system of free associations, incorporating economic as well as other social institutions</td>
<td>No such model can be proposed, it is more important to expose the power that controls society, especially institutions such as education and medicine that appear neutral</td>
</tr>
</tbody></table>
<p>Somewhere in the middle, Chomsky also tried to bring their differences closer:
<blockquote>
CHOMSKY: ... That is, I think that an act of scientific creation depends on two facts: one, some intrinsic property of the mind, another, some set of social and intellectual conditions that exist. And it is not a question, as I see it, of which of these we should study; rather we will understand scientific discovery, and similarly any other kind of discovery, when we know what these factors are and can therefore explain how they interact in a particular fashion.</blockquote>
<p>While Foucault didn't completely agree to that, the conversation was still building upon each other's ideas:</p>
<blockquote>
FOUCAULT: ... ultimately we understand each other very well on these theoretical problems. On the other hand, when we discussed the problem of human nature and political problems, then differences arose between us. And contrary to what you think, you can’t prevent me from believing that these notions of human nature, of justice, of the realisation of the essence of human beings, are all notions and concepts which have been formed within our civilisation, within our type of knowledge and our form of philosophy, and that as a result form part of our class system; and one can’t, however regrettable it may be, put forward these notions to describe or justify a fight which should - and shall in principle – overthrow the very fundaments of our society. This is an extrapolation for which I can’t find the historical justification.</blockquote>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-35980842152237272682017-04-21T09:15:00.005-07:002017-04-21T10:25:01.677-07:00March for Science - to march or not to marchApparently, there is a big controversy with regard to <a href="https://www.marchforscience.com/" target="_blank">March for Science (M4S)</a> that will take place this Saturday April 22, 2017 in DC and in many other cities around the US.<br />
<br />
The main stated goal of the march is to support publicly funded and publicly communicated science as a pillar of human freedom and prosperity. I was set on going because it seems that nowadays science needs support, because regardless of whether you believe in such thing as objective truth-seeking (I have my doubts), scientists can and should be political in defending their institutions and their role in public life. But mostly I was set on going because we need to resist <a href="https://en.wikipedia.org/wiki/Anti-intellectualism" target="_blank">anti-intellectualism</a> and assaults on reason. My own reasons more-less clear, I didn't pay much attention for any discussion around the march. And I bought a t-shirt even though merchandising around protest movements seems out-of-place. Perhaps, because this march is not a protest or social justice movement.<br />
<br />
Many people feel strongly that the march is wrong. That they were excluded from planning and organizing. Most importantly, that the march <a href="http://www.latinorebels.com/2017/03/14/the-march-for-science-cant-figure-out-how-to-handle-diversity/" target="_blank">marginalizes non-white non-male scientists and disregards diversity</a>. That it is <a href="http://www.theroot.com/marginsci-the-march-for-science-as-a-microcosm-of-lib-1794463442">a microcosm of liberal racism</a> and that march organizers pushed out those who argued for inclusiveness and <a href="https://en.wikipedia.org/wiki/Intersectionality" target="_blank">intersectionality</a>. The controversy is scattered across mass and social media, but to summarize one side (organizers) is complicit in making the march a watered-down non-political "celebration of science". The other side (#MarginSci-ers) perceives the march as a social justice movements and wants the message of diversity (which applies to any context of American life) be reinforced through this movement as well. An interesting analysis of <a href="http://www.minoritypostdoc.org/view/2017-8-1-zevallos-MfSDiversityDiscourse.html" target="_blank">the march diversity discourse</a> shows how organizers shifted their position with regard to diversity, thereby conforming to existing stereotypes and dominant discourse:<br />
<blockquote class="tr_bq">
<i>Unfortunately, through various miscommunications, including from the co-chairs and other key members of the MfS committee, the MfS audience has been primed to reinforce the established discourse about science. It took the better part of two months of constant lobbying and external pressure from minority scientists for the MfS organisers to finally reverse their stance. The fourth diversity statement finally states that science is political. At the same time, more recent media interviews that position diversity as a “distraction” undermine this stance.</i></blockquote>
In a sense, controversy is good. It highlights gaps in a movement and could potentially help to develop a robust program and action plan. But what is this movement? Upon reading the history of its organization, the march seems more like a top-down attempt to organize and contain rather than a grass-root protest and demand for change. It's being done professionally with attempts to control the message and the goals. Is "celebration" enough to ensure change? Do I need to celebrate science or to improve the mutual relationship between science and society? Are we mobilizing only because we want public funding and therefore need to "educate" the public and policy-makers?<br />
<br />
There is a high probability that with the goals of celebration, <a href="https://sciencemarchind.org/about/why-we-march/" target="_blank">connections, understanding and outreach,</a> M4S will follow #Occupy and Women's March movements - much enthusiasm and no action due to the lack of clear vision and strategies for change. A strong movement should have strong demands, which can then translate to specific legislation and policies. For example,<br />
<br />
<ul>
<li>Equal pay and opportunities in science and research</li>
<li>Strong science education across all states</li>
<li>Protections for whistle-blowers and government scientists from political repressions</li>
<li>No marketization of science and education</li>
<li>Exposing and dismantling the <a href="http://karlgrossman.blogspot.com/2009/03/military-industrial-scientific-complex.html" target="_blank">military-industrial-scientific complex</a></li>
</ul>
<br />
<br />
<br />DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-1071901334709906402017-02-21T10:29:00.001-08:002017-02-21T10:29:05.883-08:00U.S. House, Indiana District 9 General Election 2010-2016 VisualizationThis is my first attempt to create a <a href="https://en.wikipedia.org/wiki/Choropleth_map" target="_blank">choropleth map</a> - a map that visualizes measurements by shading geographic regions. I used <a href="http://www.in.gov/sos/elections/2400.htm" target="_blank">election results data from in.gov</a> - general election of US House representatives from Indiana Congressional District 9, years 2010, 2012, 2014, and 2016. The maps below represent percent of people who voted for Democrat party Candidates (<a href="https://ballotpedia.org/Baron_Hill" target="_blank">Baron P Hill</a> in 2010, <a href="https://ballotpedia.org/Shelli_Yoder" target="_blank">Shelli Yoder</a> in 2012 and 2016, and <a href="https://ballotpedia.org/Bill_Bailey" target="_blank">William Bailey</a> in 2014).<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZB-8P6xNZsum9CAG8PEeTN4QbCCUBAouJnW_EpVSYGwYE9xqFAKyIwvBMXIKGwiUgBoLzmjAhbbHqwnc_dQSV45frQr-vq6sx8FXYu5rOVpOZFK0VGZ7sc7hnEuN2eg5PgVhENlxsm-0/s1600/congress+votes+2010-2016.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="234" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZB-8P6xNZsum9CAG8PEeTN4QbCCUBAouJnW_EpVSYGwYE9xqFAKyIwvBMXIKGwiUgBoLzmjAhbbHqwnc_dQSV45frQr-vq6sx8FXYu5rOVpOZFK0VGZ7sc7hnEuN2eg5PgVhENlxsm-0/s640/congress+votes+2010-2016.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
The process was tedious, but straightforward:<br />
<br />
<ul>
<li>Find data and get it into appropriate format (some manual copying from PDF was needed)</li>
<li>Calculate statistics needed for mapping (here percent voting for Democrats within county)</li>
<li>Get geographic (shapefile) data</li>
<li>Combine stats and geographic data</li>
<li>Generate choropleth map</li>
</ul>
<br />
<a href="http://www.computerworld.com/article/3038270/data-analytics/create-maps-in-r-in-10-fairly-easy-steps.html" target="_blank">This tutorial on creating maps with R</a> and <a href="https://cran.r-project.org/web/packages/tmap/vignettes/tmap-nutshell.html" target="_blank">this vignette about tmap package</a> were very helpful.<br />
<br />
A quick analysis of the District 9 US House elections over time shows that some counties (e.g., Monroe county) are strong in voting for Democrats and some counties (e.g., Morgan and Orange counties) are much weaker. In 2012 though Orange, Washington, Harrison and some other counties suddenly had nearly half of county voters voting for Democrats. The turnout in 2012 was higher than in 2010, but it was comparable to 2016, when Shelli Yoder was the candidate again. The year of 2012 was also when only two candidates ran, so may be we need to look at other candidates and how they take votes. More data and more analysis needed.DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-68801162089463362132017-02-13T08:24:00.000-08:002017-02-13T08:35:11.010-08:00Data quality - a short overview<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7Ap_1uKP4G2AATSRbNrU-eOpZMCN_IEqbmciTwf9IfYn1fEMsMFoUzncd8uJWE-kkvlWW0pi5AjMthwDU7GVRMn5rrmvvgkGhjSvhAbJHODqTSNFm51ynWIgfA4Oy62OrEfGEgB4O-nA/s1600/LYD_00_undated.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 2em;"><img alt="love your data image" border="0" height="170" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7Ap_1uKP4G2AATSRbNrU-eOpZMCN_IEqbmciTwf9IfYn1fEMsMFoUzncd8uJWE-kkvlWW0pi5AjMthwDU7GVRMn5rrmvvgkGhjSvhAbJHODqTSNFm51ynWIgfA4Oy62OrEfGEgB4O-nA/s200/LYD_00_undated.png" title="" width="200" /></a></div>
<br />
A short overview of data quality definitions and challenges in support of the <a href="https://www.blogger.com/">Love Your Data</a> week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: <a href="https://loveyourdata.wordpress.com/lydw-2017/monday-2017/" target="_blank">Defining Data Quality</a>, <a href="https://loveyourdata.wordpress.com/lydw-2017/tuesday-2017/" target="_blank">Documenting, Describing, Defining</a>, <a href="https://loveyourdata.wordpress.com/lydw-2017/wednesday-2017/" target="_blank">Good Data Examples</a>, <a href="https://loveyourdata.wordpress.com/lydw-2017/thursday-2017/" target="_blank">Finding the Right Data</a>, <a href="https://loveyourdata.wordpress.com/lydw-2017/friday-2017/" target="_blank">Rescuing Unloved Data</a><br />
<br />
<div>
<div>
<i><b>Data quality</b></i> is the degree to which data meets the purposes and requirements of its use. Good data, therefore, is the data that can be used for the task at hand even if it has some issues (e.g., missing data, poor metadata, value inconsistencies, etc.) Data that has errors, is hard to retrieve or understand, or has no context or traces of where it came from is generally considered bad.</div>
<div>
<br /></div>
<div>
Numerous attempts to define data quality over the last few decades relied on diverse methodologies and identified multiple dimensions of data or information quality (Price and Shanks, 2005). The importance of quality of data is recognized in business and commercial data warehousing (Fan and Geertz, 2012; Redman, 2001), in government operations (Information Quality Act, 2001) and by international agencies involved in data-intensive activities (IMF, 2001). Many research domains have also developed frameworks to evaluate quality of information, including decision, measurement, test, and estimation theories (Altman, 2012). <br />
<br />
Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as <b>data-related versus system-related</b> (Wand and Wang, 1996), as <b>product and service quality</b> (Khan, Strong and Wang, 2002), as <b>syntactic, semantic and pragmatic dimensions</b> (Price and Shanks, 2005), and as <b>user-oriented and contextual quality</b> (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009). <br />
<br />
Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):<br />
<br />
<ul>
<li><b>how to assess quality</b></li>
<li><b>how to manage quality</b></li>
<li><b>what impact quality has on organization</b></li>
</ul>
The multitude of definitions, frameworks, and contexts in which data quality is used demonstrate that making data quality a useful paradigm is a persisting challenge that can benefit from establishing a dynamic network of researchers and practitioners in the area of data quality and from developing a framework that would be general and yet flexible enough to accommodate highly specific attributes and measurements from particular domains.<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="http://dn128h0qfjlc0.cloudfront.net/wp-content/uploads/2014/08/05175105/data-quality.gif" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="data quality attributes" border="0" height="241" src="https://dn128h0qfjlc0.cloudfront.net/wp-content/uploads/2014/08/05175105/data-quality.gif" title="data quality attributes" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><a href="http://www.realisedatasystems.com/3-reasons-why-data-quality-should-be-your-top-priority-this-year/" target="_blank">3 Reasons Why Data Quality Should Be Your Top Priority This Year</a></td></tr>
</tbody></table>
<div>
Each dimension of data quality, such as completeness, accuracy, timeliness, or consistency creates challenges for data quality.<br />
<br />
Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.<br />
<br />
As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In <a href="http://www.sciencealert.com/no-noaa-didn-t-fake-environmental-data" target="_blank">one of the most recent attempts to discredit climate change studies</a>, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:</div>
<div>
<ul>
<li>How does non-compliance with policies undermine the quality of data?</li>
<li>What role does scientific community consensus play in establishing the quality of data?</li>
<li>Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized? </li>
</ul>
Another aspect of data quality that complicates formalized treatment of initial dimensions is that data is often heterogeneous and can be applied in varied contexts. As has been pointed above, data quality frameworks and approaches are being developed in business, government, and research contexts and quality solutions have to consider structured, semi-structured, and unstructured data and their combinations. Most of the previous data quality research focused on structured or semi-structured data. Additionally, spatial, temporal, and volume dimensions of data contribute to quality assessment and management.<br />
<br />
Madnick et al. (2009) identify three approaches to possible solutions to data quality: <b>technical </b>or database approach, <b>computer science / information technology (IT)</b> approach, and<b> digital curation</b> approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.</div>
<div>
<br /></div>
<div>
Most likely, some combination of the above is the best approach. Quality depends on how data was collected as well as on how it was subsequently stored, curated, and made available to others. Data quality is a responsibility that is shared between data providers, data curators and data consumers. While data providers can ensure the quality of their individual datasets, curators help with consistency, coverage and metadata. Maintaining current and consistent metadata across copies and systems also benefits contributions from those who intend to re-use the data. Data and software documentation is another aspect of data quality that cannot be solved technically and needs a combination of organizational / information science solutions.<br />
<b><br /></b>
<b>References and further reading:</b></div>
</div>
</div>
<div>
<ul>
<li>Altman, M. (2012). <a href="http://openscholar.mit.edu/sites/default/files/dept/files/altman2012-mitigating_threats_to_data_quality_throughout_the_curation_lifecycle.pdf" target="_blank">Mitigating Threats to Data Quality Throughout the Curation Lifecycle</a>. </li>
<li>Ballou, D. P., & Pazer, H. L. (1985). <a href="http://doi.org/10.1287/mnsc.31.2.150" target="_blank">Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems</a>. <i>Management Science, 31</i>(2), 150–162. </li>
<li>Fan, W., & Geerts, F. (2012). <i>Foundations of data quality management.</i> Morgan & Claypool.</li>
<li>Ge, M., & Helfert, M. (2007). <a href="http://mitiq.mit.edu/iciq/PDF/A%20REVIEW%20OF%20INFORMATION%20QUALITY%20RESEARCH.pdf" target="_blank">A review of information quality research - develop a research agenda</a>. In <i>The International Conference on Information Quality </i>(pp. 76--91).</li>
<li>International Monetary Fund. (2001). <a href="http://www.imf.org/external/np/sta/dsbb/2001/review.htm" target="_blank">Fourth Review of the Fund’s Data Standards’ Initiatives.</a></li>
<li>Letzter, R. (2017, February 9). <a href="http://www.sciencealert.com/no-noaa-didn-t-fake-environmental-data" target="_blank">No, the NOAA didn’t fake climate change data</a>. <i>Science Alert</i>.</li>
<li>Madnick, S. E., Wang, R. Y., Lee, Y. W., & Zhu, H. (2009). <a href="http://doi.org/10.1145/1515693.1516680" target="_blank">Overview and Framework for Data and Information Quality Research</a>. <i>Journal of Data and Information Quality, 1</i>(1), 1–22. </li>
<li>Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). <a href="http://doi.org/10.1145/505248.506010" target="_blank">Data quality assessment</a>. <i>Communications of the ACM, 45</i>(4), 211. </li>
<li>Price, R., & Shanks, G. (2005). <a href="http://doi.org/10.1057/palgrave.jit.2000038" target="_blank">A semiotic information quality framework: development and comparative analysis</a>. <i>Journal of Information Technology, 20</i>(2), 88–102. </li>
<li>Redman, T. C. (2001). <i>Data Quality: The Field Guide. </i>Boston: Digital Press.</li>
<li>Wand, Y., & Wang, R. Y. (1996). <a href="http://doi.org/10.1145/240455.240479" target="_blank">Anchoring data quality dimensions in ontological foundations.</a> <i>Communications of the ACM, 39</i>(11), 86–95. </li>
</ul>
</div>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-44828528278660199352016-09-16T09:29:00.000-07:002016-09-16T09:29:33.752-07:00Data for humanitarian purposes<div>
<a href="http://www.msf.org/en/dr-unni-karunakara-international-president-jun-2010-sep-2013" target="_blank">Unni Karunakara</a>, a former president of "Doctors without borders", gave a talk at <a href="http://www.internationaldataweek.org/" target="_blank">International Data Week 2016</a> on September 13 about the role of data in humanitarian organizations. The talk was very powerful in its simplicity and urgent need for better data and its management and dissemination. It was a story of human suffering, but also a story of care and integrity in using data to alleviate it.</div>
<div>
<br /></div>
<div>
Humanitarian action can be defined as moral activity grounded in the ethics of assistance to those in need. Four principles guide humanitarian action:</div>
<div>
<ul>
<li>humanity (respect for the human)</li>
<li>impartiality (provide assistance because of person's need, not politics or religion</li>
<li>neutrality (tell the truth regardless of interests)</li>
<li>independence (work independently from governments, businesses, or other agencies)</li>
</ul>
</div>
<div>
These principles affect how to collect and use data and how to ensure that data helps. Data collected for humanitarian action is evidence that can be used for direct medical action and for bearing witness, which is a very important activity of humanitarian organizations: </div>
<blockquote class="tr_bq">
“We are not sure that words can always save lives, but we know that silence can certainly kill." (<a href="http://www.doctorswithoutborders.org/about-us/history-principles/nobel-peace-prize" target="_blank">quoted from another MSF president</a>)</blockquote>
<div>
Awareness of serious consequences of data for humanitarian action makes "Doctors without borders" work only with data they collect themselves and use stories they witnessed firsthand. Restraint and integrity in data collection is crucial in maintaining credibility of the organization.</div>
<div>
<br /></div>
<div>
Lack of data or lack of mechanisms to deliver necessary data hurts people. Thus, in Ebola outbreak it took the World Health Organization about 8 months to declare emergency and 3000 people died because data was not available in time or in the right form. The Infectious Diseases Data Observatory (<a href="https://www.iddo.org/" target="_blank">IDDO</a>) was created to help with tracking and researching infectious diseases by sharing data, but many ethical, legal, etc. issues still need to be solved.</div>
<div>
<br /></div>
<div>
Humanitarian organizations often do not have trustworthy data available, either because of competing definitions or lack of data collection systems. For example, because of the differences in defining "civilian casualty" numbers of civilians killed in drone strikes range from a hundred to thousands. Or, in developing countries or conflict zones where census activities are absent or dangerous, counting graves or tents becomes a proxy of mortality, mobility rates and other important indicators. Crude estimates then are the only available evidence.</div>
<div>
<br /></div>
"Doctors without borders" (MSF) does a lot to share and disseminate its information. It has an open data / access policy and <a href="http://dx.doi.org/10.1371/journal.pmed.1001562" target="_blank">aspires to share data</a>, while placing high value on security and well-being of people it helps.<div>
<br /></div>
<div>
<br /></div>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-81545977496230343522016-09-02T09:15:00.001-07:002016-09-06T07:51:01.501-07:00Workshop: Data Quality in Era of Big DataThe center where I work organizes a workshop of possible interest to many who work with data. Scholarships are available.<br />
<br />
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Data Quality in Era of Big Data</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Bloomington, Indiana</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong><span class="aBn" data-term="goog_354432590" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">28-29 September 2016</span></span></strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://d2i.indiana.edu/mbdh&source=gmail&ust=1472904810565000&usg=AFQjCNEl7lRnGQaVmolhNBNsadeAHBpYtw" href="http://d2i.indiana.edu/mbdh" style="color: #1155cc;" target="_blank">http://d2i.indiana.edu/mbdh</a></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Throughout the history of modern scholarship, the exchange of scholarly data was undertaken through personal interactions among scholars or through highly curated data archives. In either case, implicit or explicit provenance mechanisms gave a relatively high degree of insurance of the quality of the data. However, the ubiquity of the web and mobile digital culture has produced disruptive new forms of data. We need to ask ourselves what we know about the data and what we can trust. Failure to answer these questions endangers the integrity of the science produced from these data.</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
The workshop will examine questions of quality:</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Citizen science data</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Health records</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Integrity</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Completeness; boundary conditions</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Instrument quality</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Data trustworthiness</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Data provenance</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· Trust in data publishing</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
The 2 day workshop begins with a half day of tutorials. The main workshop begins early afternoon on <span class="aBn" data-term="goog_354432591" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">28 September</span></span> and continuing to <span class="aBn" data-term="goog_354432592" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">noon</span></span> on the 29<sup> </sup>September. With sufficient interest, there may be another training session following <span class="aBn" data-term="goog_354432594" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">noon</span></span> conclusion of the main workshop on <span class="aBn" data-term="goog_354432595" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">29 September</span></span>.</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Early Career Travel Funds:</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Travel funds are available for early career researchers, scholars, and practitioners <a href="http://d2i.indiana.edu/mbdh/#scholarships">http://d2i.indiana.edu/mbdh/#scholarships</a></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Important Dates:</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· <strong>Workshop:</strong> <span class="aBn" data-term="goog_354432596" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Sep 28-29, 2016</span></span></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· <strong>Deadline for requesting early career travel funds:</strong> <span class="aBn" data-term="goog_354432597" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Sep 9, 2016 midnight EDT</span></span></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· <strong>Notification of travel funding:</strong> <span class="aBn" data-term="goog_354432598" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Sep 13, 2016</span></span></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
· <strong>Registration deadline:</strong> <span class="aBn" data-term="goog_354432599" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Sep 19, 2016</span></span></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Organizing Committee:</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>General Chairs:</strong> Beth Plale, Indiana University</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Program Committee</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Carl Lagoze, University of Michigan, chair</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Devan Donaldson, Indiana University</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
H.V. Jagadish, University of Michigan</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Xiaozhong Liu, Indiana University</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Jill Minor, Indiana University</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Val Pentchev, Indiana University</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Hridesh Rajan, Iowa State University</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Early Career Chairs</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Devan Donaldson, Indiana University </div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Xiaozhong Liu, Indiana University</div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<br /></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
<strong>Local Arrangements Chair</strong></div>
<div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px;">
Jill Minor, Indiana University</div>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-89844072044480957332016-08-17T12:13:00.000-07:002016-08-17T12:13:09.421-07:00SynBERC, anthropological inquiry and methods of research<div class="tr_bq">
Recently I've been searching for guidance on how to describe ethnographic methodology in a grant proposal and found P. Rabinow and A. Stavrianakis' commentary <a href="http://www.haujournal.org/index.php/hau/article/view/hau6.1.021/2297" target="_blank">Movement space: Putting anthropological theory, concepts, and cases to the test</a>, where they reflect on the challenges of anthropological inquiry, on what it means to observe in heterogeneous and changing spaces. I had no time to read it slowly and carefully, so now just filling this gap.</div>
<div class="tr_bq">
<br /></div>
The essay is a response to another collection of essays, but also a reflection on previous ethnographic research with <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=0540879" target="_blank">Synthetic Biology Engineering Research Center SynBERC</a> (I wish I paid more attention to it during my own dissertation research). An honest public account like that contributes to the ethics and methodology discussions more than any published "research" article.<br />
Raising the question of "to what end" in anthropological inquiry, Rabinow and Stavrianakis' essay recollects previous collaborative participant-observations as attempts to bring the ethics that exists outside of the instrumental rationality of science into multidisciplinary research projects.<br />
<b><br /></b>
<b>Flourishing</b> is the concept they used to challenge and change the currently existing relations between
knowledge and care (see Rabinow, Paul. "Prosperity, Amelioration, Flourishing: From a Logic of Practical Judgment to Reconstruction." Law and Literature 21, no. 3 (2009): 301-20, <a href="http://www.jstor.org/stable/10.1525/lal.2009.21.3.301" target="_blank">jstor</a>). Flourishing helps to examine research practices from a holistic perspective, as practices that are performed by human beings without ethical compartmentalization into scientific, individual, and citizen values.<br />
<br />
Then the discussion moves to temporality in anthropological research and the distinction between "contemporary" and "present" in ethnography. This distinction was hard for me to understand. Observations are made in the present, but somehow contextualization with experiences from the past (history) helps to challenge the "ethnographic present". Does it mean that something (objects or practices) maybe present but not contemporary? Or that contemporary may include the past? In other words, the distinction <i>present vs contemporary vs modern</i> allows us to stay tuned to the constant changes and not to fix descriptions as existing in certain times only. Sometimes it seemed that contemporary referred to attempts to reconcile diverse or contradicting practices (e.g., the practices of observation and observers and the practices of the observed).<br />
<br />
An interesting point was made on citation (again, as a response to someone else's point). It is about acknowledging more recent work on similar issues. Understanding that that's rules of the game (esp. to get grants), Rabinow writes that excessive citation also constrains thinking and writing, and authorizes such practices. Why would someone go back to reading and citing Weber, Foucault, or Dewey? Not because what they said is still relevant and true (althrough some of it is), but because they paid so much attention to problem formation, to the need for conceptual tools, and to the importance of experimentation with form.<br />
<br />
Back to SynBERC, it is striking how empty expressions of support from bioscientists and engineers masked indifference and ultimately lack of respect and willingness to change. What made things worse was that social scientists' effort to develop effective modes of governance and interaction were blocked and downgraded to non-action and "soothing public relations". Moreover, the social scientists themselves failed to coordinate and reflect on their complicity with dominating technoscientific norms and values.<br />
<br />
Even though I really appreciated this account of anthropological "failure" (as seen by others, but not by the authors who conceptualize it as an "anthropological test"), there is a larger purpose in it. As the authors put it,<br />
<blockquote>
... it is time to thematize the new configurations of power relations in which anthropologists are working today. Critique as denunciation, still the dominant mode in anti-colonial narratives, is no longer sufficient for the complexities of contemporary inquiry. We are arguing for a more fine-grained acceptance of the fact that by refusing the binaries of inside and outside, one’s responsibility for one’s position in the field is made available for reflection and invention.</blockquote>
Anthropology's major task is to map heterogeneity of human and cultural forms, including:<br />
<ul>
<li>cultural heterogeneity with an underlying generality (American anthropology)</li>
<li>heterogeneity within common institutional forms such as kinship and law (British anthropology)</li>
<li>variations in structural patterns of society and the mind (French anthropology)</li>
</ul>
<div>
However, accounts of heterogeneity lost their force, in some ways losing their criteria of validity under the pressure of current norms of conducting research. At the same time critical evaluation of such criteria is an important task in changing present times. Such evaluation can be done through <i>testing </i>- constant re-evaluation of the existing conceptual tools in the context of new situations and experiences. The rest of the detailed discussion on testing was dense, but less relevant to me, so it was also harder to follow. </div>
<div>
<br /></div>
<div>
One of the take-aways is that anthropology needs to be a collaborative endeavor, where individual inquiries examine specific cases and then many inquirers create a common space of concepts, problems, and cases. The constant movement between specific cases and topology of cases creates a space where anthropology can make justifiable warrantable claims about more than one case, i.e., about heterogeneity and associated generality.</div>
<div>
<br /></div>
<div>
<br /></div>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-68726452092175551102016-06-14T09:28:00.003-07:002016-06-14T09:28:59.172-07:00Mapping scientific fields, domains and specialtiesI'm embarking on a new project that focuses on mapping research fields and studying the evolution of certain concepts and research communities. I have a certain field in mind that I'd like to investigate, but first I need to learn more about scientometrics and mapping of research domains. This is a first in the series of notes from my readings - a review chapter in the <i>Annual Review of Information Science & Technology</i> (ARIST) titled <a href="http://onlinelibrary.wiley.com/doi/10.1002/aris.2008.1440420113/abstract">"Mapping Research Specialties"</a>".<br />
<br />
The chapter defines research specialty as a self-organizing network of researchers that tends to study the same research topics, attend the same conferences, publish in the same journals, and also read and cite each others’ research papers.<br />
<br />
Other definitions of research specialties:<br />
<br />
<ul>
<li>Kuhn (1970) - communities of one hundred members, sometimes less </li>
<li>Price (1986) - an “invisible college” of approximately 100 “core” scientists, monitoring the work of individuals who are rivals and peers by reading about 100 papers for every one published</li>
<li>Lievrouw (1990) - a set of informal communication relations among scholars or researchers who share a specific common interest or goal </li>
<li>Small (1980) - consensual structure of concepts in a field, employed through its citation and co-citation network </li>
<li>Rogers, Dearing, and Bregman (1993) - a family tree in which earlier studies influence later studies</li>
</ul>
The term "specialties" rather than invisible college allows to avoid the assumption that the researchers are in frequent informal communication.<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0S9Vzjan3ygdAU8Og2bQcK2Xy9UeTNwPKkzQJXN4hh_NNH6OXRdJw6Gz_oJN-kV9l-TPNAKCX4eC91A4MGEntekgc65vG7eLJX-gaOPwcuLVc_Qi76Mo-VA8zI5HASjNENihDzTXHcU8/s1600/research+specialty+model.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="research specialty model" border="0" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0S9Vzjan3ygdAU8Og2bQcK2Xy9UeTNwPKkzQJXN4hh_NNH6OXRdJw6Gz_oJN-kV9l-TPNAKCX4eC91A4MGEntekgc65vG7eLJX-gaOPwcuLVc_Qi76Mo-VA8zI5HASjNENihDzTXHcU8/s400/research+specialty+model.png" title="model of a research specialty" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Fig. 6.2 from <a href="http://onlinelibrary.wiley.com/doi/10.1002/aris.2008.1440420113/abstract" target="_blank">"Mapping research specialties"</a></td></tr>
</tbody></table>
Research specialties are therefore an interconnected group of researchers that has their own knowledge base with its own concepts, paradigms and validation standards, and uses particular channels of formal and informal communication.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Studies of research specialties are connected to the key questions raised by Chubin in his 1976 review of the field "<a href="http://www.jstor.org/stable/4105548">The Conceptualization of Scientific Specialties</a>":<br />
<ol>
<li>What are the social and intellectual properties of a specialty? </li>
<li>How do specialties grow, stabilize, and decline? </li>
<li>What are the temporal and spatial dimensions of a specialty? </li>
<li>How do specialties vary in size, scope, and life expectancy? </li>
<li>What are the institutional arrangements that support specialties? </li>
<li>What impact does funding have on the kind and volume of research produced in a specialty?</li>
<li>What kinds of communication relations sustain research activities in a specialty? </li>
</ol>
The following approaches are used in the studies of research specialties:<br />
<br />
<ol>
<li>The sociological approach (seems to be much more developed than others): science as an institution (Merton); science as a system of beliefs (Bloor, Barnes, Collins); science as culture (Latour, Woolgar, Knorr-Cetina); science as collaboration and competition (Whitley, Gibbons); science as boundary making and demarcation (Gieryn)</li>
<li>Bibliographic or bibliometric: relevance (topics, novelty, availability, etc.); citations and co-citations; author co-citations; co-word analysis</li>
<li>Communicative approach: knowledge diffusion through informal channels and discourses and rhetoric in science </li>
<li>Cognitive approach: paradigm shift (Kuhn) and branching of ideas (Mulkay)</li>
</ol>
<br />
Mapping research specialties helps to find the structure and dynamics of a research specialty and can include:<br />
<br />
<ol>
<li>A map of the network of researchers and research teams involved with the specialty.</li>
<li>A map of the base knowledge supporting research in the specialty.</li>
<li>A map of current research topics in the specialty.</li>
</ol>
A map of a specialty is a representation of the structure and interconnection of known elements of the specialty, which includes research topics, teams, concepts, authorities, archival journals, research institutions, and technical vocabularies. Mapping techniques often include bibliometric methods, such as reference co-citation analysis, bibliographic coupling analysis, co-authorship analysis, author co-citation analysis, co-word analysis, paper to paper citation analysis, journal to journal citation analysis, and journal co-citation analysis.<br />
<br />
Others goals of mapping include:<br />
<br />
<ul>
<li>Mapping the social network of researchers - identify and characterize researchers and teams of researchers and their sponsoring institutions in terms of productivity, impact of research results, weak ties, levels of participation and collaborations. </li>
<li>Mapping the base knowledge in the specialty - concepts, theories, methods, controversies</li>
<li>Mapping the topical structure </li>
<li>Mapping the relations - researchers, concepts, and topics </li>
<li>Mapping changes - shifts in base knowledge and topics, new subtopics, productive researchers, changes in funding</li>
</ul>
Techniques of mapping can include surveys of subject matter experts, bibliometric techniques (see above), web content analysis, and analysis of formal literature (most developed and frequently done).<br />
<br />
The conclusion is not very optimistic though:<br />
<blockquote class="tr_bq">
The problem of mapping specialties is complex and poorly defined. A number of techniques have been developed and applied. Each of these techniques reveals some separate aspect of the specialty. For example, co-authorship analysis uncovers the social structure of collaboration and research teams in the specialty, co-citation analysis uncovers structure of base knowledge in the specialty, and bibliographic coupling analysis reveals research subtopics. In and of themselves, these analytic techniques are inadequate as tools to map the whole research specialty: the social structure of researchers, the base knowledge they use, and the research topics they study. ... the metaphor of the blind men and the elephant is appropriate, as each analytic technique reveals the specialty in some limited aspect.</blockquote>
<br />
What is the solution for examining a specialty as a whole? Combine as many existing techniques as possible or develop some new techniques?DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-90771272359543171952016-06-08T09:52:00.002-07:002016-06-08T09:52:59.958-07:00Cyberinfrastructure studies overview<p>In their introduction to the special issue on sociotechnical studies of cyberinfrastructure (CI) and e-research Ribes and Lee identify current themes and methodologies of CI studies (<i>Computer Supported Cooperative Work (CSCW)</i>, 2010, Volume 19, Issue 3, pp 231-244, doi: <a href="http://link.springer.com/article/10.1007/s10606-010-9120-0/fulltext.html?view=classic">10.1007/s10606-010-9120-0</a>)</p>
<p>Cyberinfrastructure (CI) is one of the current terms for the technologies that support scientific activities such as collaboration, data sharing and dissemination of findings. CI features that distinguish it from other CSCW work include: community wide and cross-disciplinary scope, computational orientation, and end-to-end (data-to-knowledge-to-user) integration.</p>
<b>Themes in CI studies:</b><br />
<br />
<ol>
<li>Relationality. What is supporting the work of another and who is sustaining those relationships?</li>
<li>Integration of heterogeneity. CI involves computer specialists, data and information managers, domain scientists, and so on, but also non-human actors such as sensors and databases.</li>
<li>Sustainability. What makes CI a long-term resource?</li>
<li>Standardization. Ways to achieve integration on the technical and human levels.</li>
<li>Scale. How to plan for change and growth in the number of collaborators, the quantity of data, and the geographical reach.</li>
<li>The distribution between human work and technological delegation. </li>
</ol>
<p>Methods include historical, ethnographic, documentary, and interview-based approaches that focus on the following:</p>
<ul>
<li>Investigations of ongoing planning, development and deployment efforts </li>
<li>Activities of maintenance, upgrade and breakdown</li>
<li>Adoption of certain expressions of scientific activity and changes in their use</li>
<li>Adoption of new technological artifacts</li>
</ul>
<p>Units of analysis can be a project or CI as a whole (focus on national policies and funding incentives). The introduction concludes by calling for more studies:
<blockquote>
The stories of cyberinfrastructure are revealed by looking across multiple levels of granularity, various facets of social life, and diverse technological actors. Much remains to be studied in the areas of supporting domain specific practice, data sharing and curating, and infrastructural organizings. This is an exciting time for CI studies. Research is occurring in new and unexpected places, drawing on and bringing together the traditions of CSCW, information science, organizational studies, and science and technology studies. This cross-pollination, as exemplified by the papers in this issue, seems to be not only fruitful, but also very necessary.</blockquote>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-65696226376022593732016-06-06T06:50:00.003-07:002016-06-06T06:50:29.265-07:00The Net Data directory<p>The <a href="https://cyber.law.harvard.edu/">Berkman Center for Internet & Society</a> announced the launch of the <a href="http://netdatadirectory.org/">Net Data Directory</a> - a free, publicly available database of data about the Internet that covers topics such as cyber-security, civil and human rights, social media and many more. The directory currently contains about 150 data source records and includes many types of sources, including website rankings, opinion surveys, maps of activities and so on.</p>
<p>The <a href="http://netdatadirectory.org/node/2487" target="_blank">press release</a> says that records are maintained by researchers at the Berkman Center, which means that keeping the directory current, relevant and error-free will be a challenge. As the number of sources grows, it will also be harder to navigate the directory through search and browse, without more sophisticated tools of filtering, recommendations, and visualizations.</p>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-51930920221354079842016-04-18T09:46:00.001-07:002016-04-18T09:46:45.718-07:00Dataset on Parkinson's diseaseIn March 2016 <a href="http://sagebase.org/" target="_blank">Sage Bionetworks</a> released a dataset that captures the everyday experiences of over 9,500 people with Parkinson's disease (<a href="http://sagebase.org/sage-bionetworks-releases-first-of-its-kind-data-from-parkinsons-iphone-study/" target="_blank">press release</a>). The data described in the data paper <a href="http://www.nature.com/articles/sdata201611" target="_blank">"The mPower study, Parkinson disease mobile data collected using ResearchKit"</a> was collected via the mPower iPhone app, where participants were presented with tasks (referred to as ‘memory’, ‘tapping’, ‘voice’, and ‘walking’ activities) and asked to fill out surveys.<br />
<br />
Not everybody agreed to share their data broadly with the research community. Out of 14,684 verified participants 9,520 (65%) agreed to share broadly, the rest split between withdrawing from the study and agreeing to share narrowly with the team only:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTtIwNmECyS2O-JX8htzRTl_X1EU0dLJ_00hoEB_gN0PSxUng07Sin5Rc569LnidP5p4MWESvbtYzbVMFpGF4tAsktnLnmiSrWny0EqVgTJpxX5M05TTy44PQlIfKvyeA_K3Bkoh7-kK4/s1600/sdata201611-f1.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Study cohort description" border="0" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTtIwNmECyS2O-JX8htzRTl_X1EU0dLJ_00hoEB_gN0PSxUng07Sin5Rc569LnidP5p4MWESvbtYzbVMFpGF4tAsktnLnmiSrWny0EqVgTJpxX5M05TTy44PQlIfKvyeA_K3Bkoh7-kK4/s320/sdata201611-f1.jpg" title="Study cohort description" width="320" /></a></td></tr>
<tr><td class="tr-caption">Figure 1: mPower study cohort description. From http://www.nature.com/articles/sdata201611#methods</td></tr>
</tbody></table>
<br />
To provide proper safeguards and to balance sharing and privacy, the research team established a <a href="https://www.synapse.org/#!Synapse:syn4993293/wiki/247860" target="_blank">data </a><a href="https://www.synapse.org/#!Synapse:syn4993293/wiki/247860">governance structure</a>. Access is granted to qualified researchers who agree to specific conditions for use, including the following:<br />
<br />
<ul>
<li>participants cannot be re-identified</li>
<li>the data may not be redistributed</li>
<li>findings need to be published in open access venues</li>
<li>both participants and research team need to be acknowledged as data contributors</li>
</ul>
<div>
This effort is another example of the newly forming data sharing culture. And it uses <a href="https://www.synapse.org/" target="_blank">Synapse</a> that seems to make sharing easier from both technical and policy perspectives.</div>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-24018882016082112492016-04-10T08:24:00.001-07:002016-04-10T08:24:28.963-07:00Big data analytics overviewThe paper <a href="http://www.sciencedirect.com/science/article/pii/S0268401214001066" target="_blank">Beyond the hype: Big data concepts, methods, and analytics</a> (2015, <i>International Journal of Information Management</i>, Vol. 35, N 2, pp. 137–144) reviews definitions and analytics techniques of big data and discusses some future developments. The article begins with a chart showing an explosion of publications in the <i>Proquest </i>database, which is quite similar to the chart in our JASIST publication "<a href="https://www.academia.edu/10031775/Big_data_bigger_dilemmas_A_critical_review" target="_blank">Big data, bigger dilemmas</a>". Both charts show that 2013 was the year when the term "big data" gained popularity:
<br />
<table>
<tbody>
<tr>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEif7ZMnaSqDNhzLLqjV1abv-E2AttDFnJOoiAEOsFs8uEjunzvFD214dbUFixQSbfiUqGxryOMuhX1mZPr3pHTjRWgD6-RTPo7cB1KdAgn6a90PFJD8KSUfCx9oDNV_wJJdOCs3VSdoZfA/s1600/1-s2.0-S0268401214001066-gr1.jpg" imageanchor="1" style="margin-bottom: 1em;"><img border="1" height="170" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEif7ZMnaSqDNhzLLqjV1abv-E2AttDFnJOoiAEOsFs8uEjunzvFD214dbUFixQSbfiUqGxryOMuhX1mZPr3pHTjRWgD6-RTPo7cB1KdAgn6a90PFJD8KSUfCx9oDNV_wJJdOCs3VSdoZfA/s320/1-s2.0-S0268401214001066-gr1.jpg" width="320" /></a></td>
<td><div style="text-align: center;">
<b><i>"Beyond the hype ..."</i></b></div>
</td>
</tr>
<tr>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpZAzvm-EB7c_65cu1uvLV8C2HGwjzOoJu95B4al8QTt2P2vPiWsPadpvQYhJFOxV-SBYsId3Ht8IRbFXrSTkNAlrCOQSfSvIxEKN-_aj1I-tJjw3C9QuRwtgLxXeKONBtq3SJCxnpS1I/s1600/bd+trends+color.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="216" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpZAzvm-EB7c_65cu1uvLV8C2HGwjzOoJu95B4al8QTt2P2vPiWsPadpvQYhJFOxV-SBYsId3Ht8IRbFXrSTkNAlrCOQSfSvIxEKN-_aj1I-tJjw3C9QuRwtgLxXeKONBtq3SJCxnpS1I/s320/bd+trends+color.png" width="320" /></a></td>
<td><div style="text-align: center;">
<b><i>"Big data, bigger dilemmas..."</i></b></div>
</td>
</tr>
</tbody></table>
The paper cites Diebold's paper <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2202843" target="_blank">"A personal perspective on the origin(s) and development of “big data”: The phenomenon, the term, and the discipline"</a> to describe the origin of the term "big data":
<br />
<blockquote>
"... the term “big data … probably originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid-1990s, in which John Mashey figured prominently".</blockquote>
After summarizing aspects of big data that were discussed many times elsewhere (volume, velocity, variety, veracity, etc.), the article provides a useful summary of the types of analytics that are common in big data research:
<br />
<ol>
<li>Text analytics</li>
<ul>
<li>Information extraction</li>
<ul>
<li>Entity recognition</li>
<li>Relation extraction</li>
</ul>
</ul>
<li>Text summarization</li>
<ul>
<li>Extractive (location and frequency of text units)</li>
<li>Abstractive (semantic information)</li>
</ul>
<li>Question answering</li>
<li>Audio (speech) analytics</li>
<ul>
<li>Transcript-based approach (large-vocabulary continuous speech recognition, LVCSR)</li>
<li>Phonetic-based approach</li>
</ul>
<li>Video analytics</li>
<li>Social media analytics</li>
<ul>
<li>Content-based analytics</li>
<li>Structure-based analytics</li>
<ul>
<li>Community detection</li>
<li>Social influence analysis</li>
<li>Link prediction</li>
</ul>
</ul>
<li>Predictive analytics</li>
</ol>
In conclusion the paper argues for new techniques that would address such issues as the irrelevance of statistical significance, heterogeneity and computational efficiency in big data.DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-70475802491591001352016-03-03T07:31:00.001-08:002016-03-03T07:31:17.856-08:00Predatory journal invitation<p>A few days ago I received an invitation to join an editorial board of a journal:<br />
<blockquote>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 12.8px; margin: 0in 0in 0.0001pt;">
Dear Dr. Inna Kouper,<br>
Wishes from The Scientific Pages!<br>
We are glad to announce the successful launch of <i>The Scientific Pages of Information Science</i>. It is my great pleasure inviting you to join our editorial board.<br>
...<br>
You have been invited because of your contribution and recognized works in this field. Upon acceptance, we request you to send your recent photograph, CV, short biography and research interests. The details are requested in order to create your profile page in our journal.<br>
</blockquote></div>
Of course, the grammar and style of the invitation were so off, there was no doubt that this is some kind of <a href="https://en.wikipedia.org/wiki/Predatory_open_access_publishing">predatory publishing</a>. Nevertheless, I did some searching, and here is the result:</br>
<blockquote> "<a href="https://scholarlyoa.com/2016/03/01/new-open-access-publisher-launches-with-65-unneeded-journals/">New Open-Access Publisher Launches with 65 Unneeded Journals</a>"<br>
The publisher is currently spamming for editorial board members ... </blockquote>
At least this one was easy to spot. What if they get better and become harder to differentiate?</p>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-14974581838908464922016-01-11T07:31:00.000-08:002016-01-11T07:31:59.642-08:00Pantheon 1.0: A manually verified dataset of globally famous biographies<br /><br /><i>Scientific Data</i> has published a description of an interesting dataset: "<a href="http://www.nature.com/articles/sdata201575" target="_blank">Pantheon 1.0, a manually verified dataset of globally famous biographies</a>". This data collection effort contributes to quantitative data for studying historical information, especially, the information about famous people and events.<div>
<br /></div>
<div>
<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="http://www.nature.com/article-assets/npg/sdata/2016/sdata201575/images_hires/w926/sdata201575-f1.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Data collection workflow" border="0" src="http://www.nature.com/article-assets/npg/sdata/2016/sdata201575/images_hires/w926/sdata201575-f1.jpg" height="269" title="" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Workflow diagram (Image from the paper)</td></tr>
</tbody></table>
The authors retrieved over 2 mln records about famous ("globally known") individuals from Google's Freebase, narrowed down the dataset to individuals who have metadata in English Wikipedia and then reduced it further to people who have records in more than 25 different languages in Wikipedia.</div>
<div>
<br /></div>
<div>
Manual cleaning and verification includes a controlled vocabulary for occupations, popularity metrics (defined as a number of Wikipedia edits adjusted by age and pageviews).</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
The dataset is available for download at Harvard Dataverse <a href="http://dx.doi.org/10.7910/DVN/28201">http://dx.doi.org/10.7910/DVN/28201</a>. Another entertaining part is a visualization interface at <a href="http://pantheon.media.mit.edu/">http://pantheon.media.mit.edu</a> that allows to explore the data and answer questions like "Where were globally known individuals in Math born?" (21% in France) or "Who are the globally known people born within present day by country?". Turns out that Russia produced a lot of politicians and writers, while the US gave us many actors, singers and musicians. </div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNeyixAGbX9PXVNa45fCS48yRPJDTX48fwo2QB8e5fQf3mXYnwz0Zq0zkeKjITnGYc-UqtjhLvnW-yUoQe4p3mzSZfC6zji91wyiaSg7v9lHa3__Mg0uW7m7zOvW3ns3Hh6PYisCe1-AQ/s1600/pantheon_example.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="178" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNeyixAGbX9PXVNa45fCS48yRPJDTX48fwo2QB8e5fQf3mXYnwz0Zq0zkeKjITnGYc-UqtjhLvnW-yUoQe4p3mzSZfC6zji91wyiaSg7v9lHa3__Mg0uW7m7zOvW3ns3Hh6PYisCe1-AQ/s400/pantheon_example.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Globally known people born in the US (from http://pantheon.media.mit.edu/treemap/country_exports/US/all/-4000/2010/H15/pantheon)</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
<br /></div>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-64584610908406561862015-10-19T07:19:00.000-07:002018-06-22T07:51:40.868-07:00Making progress in data sharingA few useful tips on making progress in data sharing in a blog post "<a href="http://www.developmentgateway.org/2015/09/15/data-sharing-access-language-context/" target="_blank">Data Sharing: Access, Language, and Context All Matter</a>":<br />
<ul><br />
<li>To make the global data system less fragmented and disorganized, create data portals with good human-centered designs and support users with varying levels of expertise</li>
<br />
<li>JSON and XML are great, but humans read data too. These formats are critical to fueling innovation, but make sure CSVs are available as well</li>
<br />
<li>Responsible data use demands proper attention to metadata. Document datasets and don't ignore ReadMe files while re-using them</li>
</ul>
DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-3695469325103055672015-09-17T11:12:00.000-07:002016-03-03T07:39:13.138-08:00Valuable lessons from sharing and non-sharing of dataA vivid story from Buzzfeed "<a href="http://www.buzzfeed.com/bengoldacre/deworming-trials" target="_blank">Scientists Are Hoarding Data And It’s Ruining Medical Research</a>" describes two related cases - one where researchers voluntarily shared their entire dataset and how the re-analysis found errors and miscalculations and another one where the data or any results from a largest drug trial were not released for 7 years because the researchers feared criticism and continued double-checking their data. The details from each of the cases are worth following up, but the author of the story comes to an important conclusion that we need to accept that science works through checks and corrections, stop unfair criticisms and doubts in researchers' credibility, and start sharing data for better science, better knowledge, and ultimately, better informed decisions that impact our lives:<br/><blockquote><br/><p class="sub_buzz_desc">And here is where I think the threads come together. The press releases on the reanalysis of the Miguel and Kremer deworming trial in Kenya will go live this week. Somewhere, I’m sure, people will attack or mock them for their errors. One way or another, I can’t believe they won’t feel bruised by the reanalysis. And that is where we have gone wrong. It’s not just naive to expect that all research will be perfectly free from errors, it’s actively harmful.</p><br/>There is a replication crisis throughout research. When subjected to independent scrutiny, results routinely fail to stand up. We are starting to accept that there will always be glitches and flaws. Slowly, as a consequence, the culture of science is shifting beneath everyone’s feet to recognise this reality, work with it, and create structural changes or funding models to improve it.</blockquote><br/> DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-91844749281656748942015-09-15T09:51:00.000-07:002016-03-03T07:39:50.026-08:00Dataset: Roads and cities of 18th century FranceAn interesting dataset has been described in the <em>Scientific Data</em> journal and shared via the <em>Harvard Dataverse</em> repository - "<a href="http://dx.doi.org/10.7910/DVN/28674" target="_blank">Roads and cities of 18th century France</a>".<br/><blockquote>The database presented here represents the road network at the french national level described in the historical map of Cassini in the 18th century. The digitization of this historical map is based on a collaborative methodology that we describe in detail. This dataset can be used for a variety of interdisciplinary studies, covering multiple spatial resolutions and ranging from history, geography, urban economics to network science.</blockquote><br/>The repository page showed 268 downloads on Sept 15, 2015, so hopefully, some examples of data re-use will follow this publication.DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-80905962537048252492015-08-31T09:11:00.000-07:002016-01-11T07:56:22.851-08:00Lessons from replication of research in psychology<em>Science</em> magazine has published an article “<a href="http://www.sciencemag.org/content/349/6251/aac4716.full" target="_blank">Estimating the reproducibility of psychological science</a>”, which reports the first findings from 100 replications completed by 270 contributing authors. A quasi-random sample was drawn from three psychology journals: <em>Psychological Science</em> (PSCI), <em>Journal of Personality and Social Psychology</em> (JPSP), and <em>Journal of Experimental Psychology: Learning, Memory, and Cognition</em> (JEP:LMC). The replications were performed by teams and then independently reviewed by other researchers and reproduced by another analyst. The reproducibility was evaluated using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analyses of effect sizes. Some highlights from the results:<br/><ul><br/> <li>35 studies in the replications showed positive effect of p < 0.05 compared to 97 original studies</li><br/> <li>82 studies showed a stronger effect size in the original study than in the replication</li><br/> <li>Effect size comparisons showed a 47.4% replication success rate</li><br/> <li>39 studies were subjectively rated as successfully replicated</li><br/></ul><br/>While some news about this publication reported failures in the test (e.g., <em>Nature’s</em> "<a href="http://www.nature.com/news/over-half-of-psychology-studies-fail-reproducibility-test-1.18248" target="_blank">Over half of psychology studies fail reproducibility test</a>"), the <em>Science</em> article emphasized the challenges of reproducibility itself and care with which interpretations of successes and failures need to be made. The authors of the study pointed out that while replications produced weaker evidence for the original findings,<br/><blockquote>“It is too easy to conclude that successful replication means that the theoretical understanding of the original finding is correct. Direct replication mainly provides evidence for the reliability of a result. If there are alternative explanations for the original finding, those alternatives could likewise account for the replication. Understanding is achieved through multiple, diverse investigations that provide converging support for a theoretical interpretation and rule out alternative explanations.<br/><br/>It is also too easy to conclude that a failure to replicate a result means that the original evidence was a false positive. Replications can fail if the replication methodology differs from the original in ways that interfere with observing the effect. We conducted replications designed to minimize a priori reasons to expect a different result by using original materials, engaging original authors for review of the designs, and conducting internal reviews. Nonetheless, unanticipated factors in the sample, setting, or procedure could still have altered the observed effect magnitudes.”</blockquote>DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-43612696332098049052015-08-10T09:05:00.000-07:002016-04-18T09:48:16.637-07:00Losing data from the National Centre for e-Social Science (NCeSS) portal<p><strong>Submitted by Andy Turner</strong></p><p><strong>Edited by Inna Kouper</strong></p><p>The National Centre for e-Social Science (NCeSS) was a UK based program established around 2004 to stimulate the development of digital tools and services for social scientists. In around 2008 it adopted the use of <a href="https://sakaiproject.org/">Sakai</a> as a system for communicating, developing information, storing and managing access to data. NCeSS was configured with a “hub” at the University of Manchester and a network of research nodes across the UK (see the <a href="http://www.esrc.ac.uk/research/research-methods/dsr.aspx">Digital Social Research page</a> for the list of nodes, many now archived).</p><p><a href="http://www.geog.leeds.ac.uk/people/a.turner">Andy Turner</a>, a researcher from the University of Leeds, worked on a project to develop demographic models for geographical simulation system. The project, abbreviated as MoSeS (Modelling and Simulation for e-Social Science), was one of the first phase research nodes of NCeSS. Some information is available on <a href="http://www.geog.leeds.ac.uk/people/a.turner/projects/MoSeS/">Andy’s page</a>, but many links from there are now unavailable. Andy explains why:</p><blockquote><p>“In 2011 the NCeSS Sakai Portal went off-line following a server failure and because there were no more resources for replacing the server. All the data was stored in a database on a National Grid Service server which for some reason had a catastrophic failure. All that remained for me to salvage were some backup database dumps, which fortunately also contained the portal front end configuration which enabled me with the help of my local IT team to get a database reader set up and a version of the NCeSS Sakai Portal working almost, but not quite as it had been. This was good enough to get some data out, but my local IT were not willing to make the system accessible again for security reasons. As a consequence of the problems some detailed social simulation model run results were lost. These would take a lot of time and effort to reproduce as they were generated on a fairly massive computer, which we got access to thanks to the UK-CERN collaboration <a href="http://www.gridpp.ac.uk/">GridPP</a> and my collaborator Tom Doherty from the University of Glasgow. The work with Tom was undertaken as part of the <a href="https://www.jisc.ac.uk/">Jisc </a>funded project NeISS (a project to establish a National e-Infrastructure for Social Simulation), which was by the time of the server failure supporting the NCeSS Sakai Portal.</p><p>In theory, sufficient metadata has been stored from the simulation runs so that the results can be readily produced, but this is unlikely to transpire as the results were really only academically interesting as their inherent uncertainties were too great to make them of practical use. Anyway, I have given up on all that for now. I have moved on, but at the time it was rather painful seeing what probably amounted to almost three years of my effort turn to nothing. I may still get something more out of it in the long run because of the learning involved in this process. Explaining what happened to my academic superiors who desperately wanted research outputs was hard. One day I may return to research that pushes the boundaries of what we can and can’t do, but I know that is risky as failure is not tolerated well in academia."</p></blockquote><p>Reflecting on the importance of preservation and curation of data, Andy writes:</p><blockquote><p>“Preservation and curation are not easy. Sustaining research effort that may one day generate useful data and software is also not easy, especially when the goal is aspirational and probably quite a long way off and the steps are necessarily baby steps to begin with. In NCeSS, issues of sustainability were discussed from early on for each NCeSS research project and for the organisation itself. Documentation about this from 2008 was stored on the portal and so is now also inaccessible…</p><p>The soft learning experiences of failure and how these relate to sustainability and the importance of promoting collaboration, re-use and enrichment in the research process are key, but where can these be written up in the academic literature? The blog might seem ephemeral, but these days they can be captured by a directed <a href="https://archive.org/web/">Internet Archive WayBack Machine</a> and preserved for the future.”</p></blockquote><p>Our data stories blog is one of the places where such discussions can be recorded, and while we don’t have a solid sustainability plan, we do keep external backup copies of the stories. If you have stories similar to Andy’s to share, <a href="http://datastories.jiscinvolve.org/wp/submit-story/">use our form</a>, send it directly to datastories@dcc.ac.uk or register on the website and become a contributor.</p>DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-32225409616696797482015-07-20T01:32:00.000-07:002016-01-11T07:59:31.154-08:00Study: Biomedical data sharing and reuseA recent publication in PLoS ONE <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129506" target="_blank">"Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff"</a> surveyed the Intramural Research Program at the US National Institutes of Health (NIH) with regard to data management, data sharing, and data re-use. The authors received 190 responses and analyzed 135 (scientific and clinical staff). Below are the highlights from their findings:<br />
<ul><br />
<li>~60% of respondents rated relevance of data re-use as high, while ~15% rated it as low</li>
<br />
<li>~25% rated their expertise in re-using data as high, while ~45% rated it as low</li>
<br />
<li>~61% reported that they had <strong>never</strong> uploaded a dataset into a repository, while ~71% said they had shared data directly with another researcher</li>
<br />
<li>`30% indicated that it took them more than 10 hours to prepare data for sharing</li>
<br />
<li>Only 20 respondents provided reasons for not sharing data and their reasons were pretty scattered (see image below):</li>
</ul>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7yPBrQ2ZbRn0PBW2INpFlgcPTAZrLnHCKJBjvFeHvK9D-1VcsjKGksQmPAqzeiXkEinaUmyufoo2l5dBISh7TERTuB6aDbaDwzYae3RKai6FYZz7ZRFksC6Pd5WMN0pQT83nxZxANwJA/s1600/journal.pone.0129506.t016.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7yPBrQ2ZbRn0PBW2INpFlgcPTAZrLnHCKJBjvFeHvK9D-1VcsjKGksQmPAqzeiXkEinaUmyufoo2l5dBISh7TERTuB6aDbaDwzYae3RKai6FYZz7ZRFksC6Pd5WMN0pQT83nxZxANwJA/s320/journal.pone.0129506.t016.PNG" width="320" /></a></div>
<br />
Image from the study (t016), responses to "...the reason(s) for not sharing your data"<br />
<br />
The data from this study is <a href="http://dx.doi.org/10.6084/m9.figshare.12%E2%80%8B88935" target="_blank">available on Figshare</a>, but it is not the full survey dataset, it's a subset to support the results of this publication. And for some reason it doesn't contain free text responses to the non-sharing question. It's always informative to see what people say beyond the provided standard categories (that is usually the most interesting story in my mind). Perhaps, there were no free text responses.<br />
<br />
<strong>Citation:</strong> <em>Federer LM, Lu Y-L, Joubert DJ, Welsh J, Brandys B (2015) Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff. PLoS ONE 10(6): e0129506. doi:10.1371/journal.pone.0129506</em>DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-79687136132027963072015-07-15T03:01:00.000-07:002018-06-22T07:54:59.459-07:00Archaeology meets modern scanning technology for preservation and re-use<strong>Submitted by Annemiek van der Kuil, edited by Inna Kouper</strong><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJP2Xz-kv4WqcF0h7wfCUkIQR3Dp0l0qQ-JV5AphpZLvXDe5AR6kMqCz7MIRcBfDN_KHv8qFjZ1q9cyCPngKpqOW3lYxvyMuMTUXag75GpVQACU0RiulTnzo5vDnhKdyex6w61QqYJQo8/s1600/beads+story.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="305" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJP2Xz-kv4WqcF0h7wfCUkIQR3Dp0l0qQ-JV5AphpZLvXDe5AR6kMqCz7MIRcBfDN_KHv8qFjZ1q9cyCPngKpqOW3lYxvyMuMTUXag75GpVQACU0RiulTnzo5vDnhKdyex6w61QqYJQo8/s320/beads+story.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: small; text-align: start;">Image from "The strange case of 60 frothy beads: puzzling Early Iron Age glass beads from the Netherlands" conference paper by D.J. Huisman et al.</span></td></tr>
</tbody></table>
<a href="http://www.citg.tudelft.nl/nl/over-faculteit/afdelingen/geoscience-engineering/sections/geo-engineering/staff/academic-staff/dr-ir-djm-ngan-tillard/">Dr. Dominique Ngan-Tillard</a>, a professor at the Faculty of Civil Engineering and Geosciences at Delft University of Technology, the Netherlands, has deposited <a href="http://data.3tu.nl/repository/uuid:a4162b99-5110-4b89-bcc8-68985e7de32f">a dataset</a> into the <a href="http://data.3tu.nl/repository/">3TU.Datacentrum repository</a> that contains tomography scans of early Iron Age glass beads found during the archaeological excavations in the Netherlands.<br />
<br />
The dataset supports a conference publication by Dr. Ngan-Tillard and others <a href="https://ees.kuleuven.be/isa2012/proceedings/ISA%202012%20Huisman%20et%20al.%202.pdf">“The strange case of 60 frothy beads: puzzling Early Iron Age glass beads from the Netherlands”</a>. The micro-CT scans helped to identify gas bubbles and mineral and metal inclusions in the glass beads, which allowed the researchers to conclude that “the Zutphen glass beads are the result of local, inexpert, reworking of imported glass objects” (p. 231, conference paper).<br />
<br />
In addition to the in-depth analysis of the beads’ structure, the scans serve as a form of virtual preservation of the ornaments. Stored in a data repository and made publicly available, they can help other archaeologists, as well as material scientists and museums in their research and educational activities. In the future 3D prints of the ornaments can be produced for a better understanding of the art of making glass and jewels.<br />
<br />
According to Dr. Ngan-Tillard’ comment on the 3TU.Datacentrum website, storing digital collections of archaeological remains together with their meta-data and interpretation will help advance both arts and research and create more challenges for our knowledge.<br />
<br />
Watch a short <a href="http://data.3tu.nl/downloads/uuid_a4162b99-5110-4b89-bcc8-68985e7de32f/bead%201%20and%20bead%202-v3.mp4">video about frothy beads</a> or see the full story at <a href="http://datacentrum.3tu.nl/en/researchers-about-3tudatacentrum/showcase-ngan-tillard/">http://datacentrum.3tu.nl/en/researchers-about-3tudatacentrum/showcase-ngan-tillard/</a>DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-43853140259623192422015-07-10T01:00:00.000-07:002018-06-22T07:48:08.539-07:00Withholding data - questionable science or scientific misconductNicole Janz <a href="http://blogs.lse.ac.uk/impactofsocialsciences/2015/07/03/data-secrecy-bad-science-or-scientific-misconduct/">writes in the LSE Impact of Social Sciences blog</a> that not sharing one’s research data should be considered a scientific misconduct. This will help to fight data secrecy and establish better research practices. A few key points from the post:<br />
<ul><br />
<li>Many researchers don’t share data even if they promise to do so - see, for example, Krawczyk and Reuben’s 2012 study “<a href="http://www.ncbi.nlm.nih.gov/pubmed/22686633">(Un)available upon request: field experiment on researchers' willingness to share supplementary materials</a> [see also “<a href="http://exchanges.wiley.com/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont/">How and why researchers share data (and why they don’t)</a>”]</li>
<br />
<li>Scientific misconduct definitions usually includes fabrication, falsification or plagiarism. Sharing research data provides evidence that there was no fabrication or falsification involved, hence it’s crucial in avoiding misconduct allegations and demonstrating proper conduct.</li>
<br />
<li>A broader definition of scientific misconduct includes departure from accepted standards and practices of a research community. As many research communities strive to be open with regard to the evaluation of their knowledge claims, obligations to share data can be seen as part of the research standards and practices. Hence, data secrecy can be considered a questionable research practice or a misconduct.</li>
</ul>
<br />
The continuum of research practices described by Janz ranges from the gold standards of <a href="http://opendatahandbook.org/guide/en/what-is-open-data/">open data</a>, open code, <a href="http://www.theguardian.com/science/blog/2013/jun/05/trust-in-science-study-pre-registration">pre-registration</a> and version control to questionable research practices of p-hacking, sloppy statistical methods and other manipulations to withholding data to misconduct with its fabrication, falsification, and plagiarism.<br />
<br />
[caption id="" align="aligncenter" width="433"]<a href="http://blogs.lse.ac.uk/impactofsocialsciences/files/2015/07/Slide1-data-secrecy.jpg"><img alt="From Janz' LSE Impact of Social Sciences blog post: Research practices continuum" class="" src="https://blogs.lse.ac.uk/impactofsocialsciences/files/2015/07/Slide1-data-secrecy.jpg" height="324" width="433"></a> From Janz' LSE Impact of Social Sciences blog post: Research practices continuum[/caption]DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0tag:blogger.com,1999:blog-6219018247771781649.post-53305045606475711262015-07-07T02:38:00.000-07:002015-12-07T07:52:36.715-08:00Climategate study - simpler methods into the mix<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="http://images.huffingtonpost.com/2009-12-18-HP1009CLIMATEGATE.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Joe Denier gets a new "climategate" hat" border="0" src="http://images.huffingtonpost.com/2009-12-18-HP1009CLIMATEGATE.jpg" height="136" title="Joe Denier gets a new "climategate" hat" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Image from <a href="http://www.huffingtonpost.com/shan-wells/lord-of-the-climategate_b_397371.html" target="_blank">a HuffPo article by Shan Wells</a></td></tr>
</tbody></table>
<p>Climategate was a controversy unfolded in November 2009 after thousands of emails and files from the Climatic Research Unit (CRU) at the University of East Anglia (UEA) were published online without the owners' consent. The climate change opponents used the content of the emails to argue that scientists manipulated data to prove their argument for human responsibility of climate change. Several investigations didn't find any scientific misconduct at the CRU, but the reports called for opening up access to research data and more transparency in methods and communication of results (see <a href="https://en.wikipedia.org/wiki/Climatic_Research_Unit_email_controversy">Climactic Research Unit email controversy in Wikipedia</a>).</p><br>
<p>Controversies are always hard to sort through, but they present an interesting research case for those like me who are interested in discourse, language, and media. A recent study "<a href="http://www.emeraldinsight.com/doi/pdfplus/10.1108/IntR-05-2014-0130" target="_blank">The creation of the climategate hype in blogs and newspapers: mixed methods approach</a>" (paywalled) looked at the Climategate controversy and compared discussions in blogs and newspapers.</p>
<P>Newspaper and blog data were collected from the LexisNexis Academic database using the search term ‘climategate’. Two methods were used to analyze the data: a) <b>ARIMA</b> (Auto Regressive Integrated Moving Average) modeling to create a model of the daily frequencies of postings and to examine the mutual influence of newspapers and blogs and b) <b>semantic co-word maps</b> of blogs and newspaper headlines to compare framings of climategate.</p>
<p>The results of the modeling seemed a bit confusing as they showed a significant link between a high number of blogs and a high change in newspapers articles (either increase or decrease) on the same day. (I'd really like to see simple descriptive statistics of posts per day, etc. Also, a pre-print where all the images and tables are at the end of the article is very hard to read). At the same time an increase in newspaper articles on one day had no effect on the number of blog postings on the next day. The conclusion of the article is that blogs influenced newspapers, but not the other way around. The semantic maps showed (predictably) that the blogs used a more informal language and framed the topics more negatively, while the newspapers were more formal and stayed more neutral. Both blogs and newspapers picked up similar sub-topics, such as climate change, scientists, and so on, although the word "climategate" occurred more in blogs.</p>
<p>Several thoughts / questions upon reading this interesting, although a bit too methodologically complicated for such simple variables and questions, study:</p>
<ul>
<li>How different are "traditional" and "new" media nowadays? They may be still different in their language style, but what about the speed of publication, audiences, contributors, and so on? The headlines don't get to the differences in main posts and comments either.</li>
<li>The word "climagate" did originate in a blog, but it was a journalist who picked it up and popularized it via a newspaper-hosted blog (see <a href="http://blogs.telegraph.co.uk/news/jamesdelingpole/100018246/climategate-how-the-greatest-scientific-scandal-of-our-generation-got-its-name/">Climategate: how the 'greatest scientific scandal of our generation' got its name</a>). Does it change the conclusion that "blogs were independent of the attention in newspapers" (p. 20) if journalists write for both media?</li>
<li>It would've been helpful to establish the actual sequence of events via an additional documentary analysis. The paper argues that the word "climategate" originated in blogs, which promoted the hype. But according to the Wikipedia article, news about emails release were published almost simultaneously in blogs and newspapers - on November 20, 2009. So is the hype about the word or other, more nuanced exchanges and actions as well?</li>
<li>Three blogs received links to leaked documents. It seems that it was intentional - the blogs were skeptical of climate change. Did it matter for how the hype have originated and developed? Again, what is the connection between the actual controversy and its naming as climategate?</li>
<li>How can the link between the large number of blog posts and the decrease in newspapers articles be explained? More quotes and examples of interactions and influences between blogs and newspapers could be very helpful in illustrating all the findings.</li>
</ul>
<p>Overall, it seems that the studies of controversies benefit from careful tracings of words and actor connections rather than from complicated modeling that is rather confusing and not so eye-opening.</p>DIKWhttp://www.blogger.com/profile/17704613496486511504noreply@blogger.com0