Mar 29, 2018

Cybersecurity Curricular Guideline

A report released by the Task Force on Cybersecurity Education provides a comprehensive framework and guidelines for cybersecurity post-secondary education (pdf). According to the presentation of one of the task force co-chairs, Diana Burley, it was a huge effort with many consultations, travel, and experts involved. And it went through the endorsement process with four major computing organizations: ACM, IEEE, Association for Information Systems Special Interest Group on Security (AIS SIGSEC) and International Federation for Information Processing Technical Committee on Information Security Education (IFIP). The resulting report can hopefully help to define cybersecurity as a discipline, describe proficiency needed for cybersecurity experts, and connect academic programs with industry needs. Ultimately, bringing some common understanding and standardization into cybersecurity education should improve the education and help fill a shortage of security professionals.

In terms of definition, cybersecurity involves the creation, operation, analysis, and testing of secure computer systems. The report assumes that while it is an interdisciplinary area that includes law, policy, human factors, ethics, and risk management, it is fundamentally a computing-based discipline. One of the challenges in developing curricula guidelines was to accommodate large variability of cybersecurity programs - depending on in which department or program they're created, there can be significantly different content and emphasis. So the guidelines are designed to have some flexibility through the notion of disciplinary lens. The program should be based on a solid computer science foundation with input from computer and software engineering and information systems and technologies and include cross-cutting concepts such as confidentiality, integrity, risk, and systems thinking.

The report shows a serious effort to be comprehensive and yet flexible. It includes eight knowledge areas: data, software, components and connections, system, human, organization, and society. Each area has several comprising units along with described essentials and learning outcomes. There is some overlap between areas and units, which again, helps to accommodate the variety of existing education efforts. Below is a summary that provides a quick overview of some areas:

It is nice to see that ethics is a significant and explicit component of the curriculum. While it doesn't remove the challenge of educating technical professionals on ethics and human behavior, it certainly provides space for discussions. More information about the guideline and the task force is at 

Feb 14, 2018

Chomsky and Foucault on human nature and power

Notes from a televised debate between N. Chomsky and M. Foucault in 1971 (video and transcript).

Chomsky begins with examples from linguistics to illustrate the notion of "innate structures". Children are successful in learning the language because they can use "innate language" or "instinctive knowledge" to transform limited data they get exposed to into organized knowledge. This instinctive knowledge, which allows children to build complex knowledge structures from partial data, is a fundamental constituent of human nature. Such a constituent (a collection of innate organizing principles) must be available in other domains, such as human cognition, behavior, and interaction. This is what Chomsky refers to as human nature.

Foucault mistrusts the notion of human nature - it is one of the concepts that while not being strictly scientific, has the ability to "designate, delimit and situate" certain types of discourses. For Chomsky it is ok to start with the concept of human nature as somewhat mystical (similar to gravitational forces or other scientific concepts) and later explain it through physical components (e.g., neural networks). Chomsky describes his approach as looking at the earlier stages of scientific thinking (great thinkers, more specifically) and understanding how they were able to arrive at concepts and ideas not available to anybody before.

Foucault makes a distinction between individual attribution of a discovery and collective production of knowledge, which can be referred to as "tradition", "mentality", or "modes". The former has been highly valued, while the latter is usually negativized. Another distinction is between knowledge as human activity and truth. The latter may be hidden from humans, but it will be unveiled. Attribution and relation to truth are interconnected. Throughout history we see examples of how the subject of truth (the individual revealing it) has to overcome myths and common thought, he has to "discover". What if this close relation of subject to truth is an effect of knowledge? What if truth is a complex non-individual formation? Can we replace individuals in the production of knowledge?

This position highlights a difference between Chomsky's and Foucault's approach to creativity. According to Foucault, Chomsky had to introduce the speaking subject into linguistics because language has been commonly studied as a system with a collective value. In language we have a few rules and elements and an unknown system of totalities that can be brought to light by individuals. In the history of knowledge, it's similar, but one has to overcome the dominance of individual creativity to show that there are rules and elements that can be transformed without explicitly passing through an individual.

Throughout the debate both scholars touch on many concepts from science and politics. Some of them are described below to highlight their differences:

Concept Chomsky Foucault
Domain (Focus) Language Knowledge
Human nature Comprised of innate structures that allow for learning and arriving at complex knowledge based on partial information A historical construct that can organize knowledge, but also can delimit how we see human behavior
Creativity A common human act of thinking about a new situation, describing it and acting in it An individualistic act that has been emphasized throughout history without looking at general communal rules that are behind it
Freedom Limited number of rules with infinite possibilities of application "Grille" of many determinisms that affects how we arrive at knowledge and understanding
Ideal model of society A federated, decentralised system of free associations, incorporating economic as well as other social institutions No such model can be proposed, it is more important to expose the power that controls society, especially institutions such as education and medicine that appear neutral

Somewhere in the middle, Chomsky also tried to bring their differences closer:

CHOMSKY: ... That is, I think that an act of scientific creation depends on two facts: one, some intrinsic property of the mind, another, some set of social and intellectual conditions that exist. And it is not a question, as I see it, of which of these we should study; rather we will understand scientific discovery, and similarly any other kind of discovery, when we know what these factors are and can therefore explain how they interact in a particular fashion.

While Foucault didn't completely agree to that, the conversation was still building upon each other's ideas:

FOUCAULT: ... ultimately we understand each other very well on these theoretical problems. On the other hand, when we discussed the problem of human nature and political problems, then differences arose between us. And contrary to what you think, you can’t prevent me from believing that these notions of human nature, of justice, of the realisation of the essence of human beings, are all notions and concepts which have been formed within our civilisation, within our type of knowledge and our form of philosophy, and that as a result form part of our class system; and one can’t, however regrettable it may be, put forward these notions to describe or justify a fight which should - and shall in principle – overthrow the very fundaments of our society. This is an extrapolation for which I can’t find the historical justification.

Apr 21, 2017

March for Science - to march or not to march

Apparently, there is a big controversy with regard to March for Science (M4S) that will take place this Saturday April 22, 2017 in DC and in many other cities around the US.

The main stated goal of the march is to support publicly funded and publicly communicated science as a pillar of human freedom and prosperity. I was set on going because it seems that nowadays science needs support, because regardless of whether you believe in such thing as objective truth-seeking (I have my doubts), scientists can and should be political in defending their institutions and their role in public life. But mostly I was set on going because we need to resist anti-intellectualism and assaults on reason. My own reasons more-less clear, I didn't pay much attention for any discussion around the march. And I bought a t-shirt even though merchandising around protest movements seems out-of-place. Perhaps, because this march is not a protest or social justice movement.

Many people feel strongly that the march is wrong. That they were excluded from planning and organizing. Most importantly, that the march marginalizes non-white non-male scientists and disregards diversity. That it is a microcosm of liberal racism and that march organizers pushed out those who argued for inclusiveness and intersectionality. The controversy is scattered across mass and social media, but to summarize one side (organizers) is complicit in making the march a watered-down non-political "celebration of science". The other side (#MarginSci-ers) perceives the march as a social justice movements and wants the message of diversity (which applies to any context of American life) be reinforced through this movement as well. An interesting analysis of the march diversity discourse shows how organizers shifted their position with regard to diversity, thereby conforming to existing stereotypes and dominant discourse:
Unfortunately, through various miscommunications, including from the co-chairs and other key members of the MfS committee, the MfS audience has been primed to reinforce the established discourse about science. It took the better part of two months of constant lobbying and external pressure from minority scientists for the MfS organisers to finally reverse their stance. The fourth diversity statement finally states that science is political. At the same time, more recent media interviews that position diversity as a “distraction” undermine this stance.
In a sense, controversy is good. It highlights gaps in a movement and could potentially help to develop a robust program and action plan. But what is this movement? Upon reading the history of its organization, the march seems more like a top-down attempt to organize and contain rather than a grass-root protest and demand for change. It's being done professionally with attempts to control the message and the goals. Is "celebration" enough to ensure change? Do I need to celebrate science or to improve the mutual relationship between science and society? Are we mobilizing only because we want public funding and therefore need to "educate" the public and policy-makers?

There is a high probability that with the goals of celebration, connections, understanding and outreach, M4S will follow #Occupy and Women's March movements - much enthusiasm and no action due to the lack of clear vision and strategies for change. A strong movement should have strong demands, which can then translate to specific legislation and policies. For example,

  • Equal pay and opportunities in science and research
  • Strong science education across all states
  • Protections for whistle-blowers and government scientists from political repressions
  • No marketization of science and education
  • Exposing and dismantling the military-industrial-scientific complex

Feb 21, 2017

U.S. House, Indiana District 9 General Election 2010-2016 Visualization

This is my first attempt to create a choropleth map - a map that visualizes measurements by shading geographic regions. I used election results data from - general election of US House representatives from Indiana Congressional District 9, years 2010, 2012, 2014, and 2016. The maps below represent percent of people who voted for Democrat party Candidates (Baron P Hill in 2010, Shelli Yoder in 2012 and 2016, and William Bailey in 2014).

The process was tedious, but straightforward:

  • Find data and get it into appropriate format (some manual copying from PDF was needed)
  • Calculate statistics needed for mapping (here percent voting for Democrats within county)
  • Get geographic (shapefile) data
  • Combine stats and geographic data
  • Generate choropleth map

This tutorial on creating maps with R and this vignette about tmap package were very helpful.

A quick analysis of the District 9 US House elections over time shows that some counties (e.g., Monroe county) are strong in voting for Democrats and some counties (e.g., Morgan and Orange counties) are much weaker. In 2012 though Orange, Washington, Harrison and some other counties suddenly had nearly half of county voters voting for Democrats. The turnout in 2012 was higher than in 2010, but it was comparable to 2016, when Shelli Yoder was the candidate again. The year of 2012 was also when only two candidates ran, so may be we need to look at other candidates and how they take votes. More data and more analysis needed.

Feb 13, 2017

Data quality - a short overview

love your data image

A short overview of data quality definitions and challenges in support of the Love Your Data week #lyd17 (February 13 - 17, 2017). The main theme is "Data Quality" and I was part of preparing daily content. Many of the aspects discussed below are elaborated through stories and resources for each day on the LYD website: Defining Data QualityDocumenting, Describing, DefiningGood Data ExamplesFinding the Right DataRescuing Unloved Data

Data quality is the degree to which data meets the purposes and requirements of its use. Good data, therefore, is the data that can be used for the task at hand even if it has some issues (e.g., missing data, poor metadata, value inconsistencies, etc.) Data that has errors, is hard to retrieve or understand, or has no context or traces of where it came from is generally considered bad.

Numerous attempts to define data quality over the last few decades relied on diverse methodologies and identified multiple dimensions of data or information quality (Price and Shanks, 2005). The importance of quality of data is recognized in business and commercial data warehousing (Fan and Geertz, 2012; Redman, 2001), in government operations (Information Quality Act, 2001) and by international agencies involved in data-intensive activities (IMF, 2001). Many research domains have also developed frameworks to evaluate quality of information, including decision, measurement, test, and estimation theories (Altman, 2012).

Attempts to develop discipline-independent frameworks resulted in several models, including models that define quality as data-related versus system-related (Wand and Wang, 1996), as product and service quality (Khan, Strong and Wang, 2002), as syntactic, semantic and pragmatic dimensions (Price and Shanks, 2005), and as user-oriented and contextual quality (Dedeke, 2000). Despite these many attempts to define discipline-independent data quality frameworks, they have not been widely adopted and more frameworks continue to appear. Several systematic syntheses compared many existing frameworks to only point out the complexity and multidimensionality of data quality (Knight and Burn, 2005; Battini et al, 2009).

Data / information quality research grapples with the following fundamental questions (Ge and Helfert, 2007):

  • how to assess quality
  • how to manage quality
  • what impact quality has on organization
The multitude of definitions, frameworks, and contexts in which data quality is used demonstrate that making data quality a useful paradigm is a persisting challenge that can benefit from establishing a dynamic network of researchers and practitioners in the area of data quality and from developing a framework that would be general and yet flexible enough to accommodate highly specific attributes and measurements from particular domains.

data quality attributes
3 Reasons Why Data Quality Should Be Your Top Priority This Year
Each dimension of data quality, such as completeness, accuracy, timeliness, or consistency creates challenges for data quality.

Completeness, for example, is the extent to which data is not missing or is of sufficient breadth and depth for the task at hand (Khan, Strong and Wang, 2002). If a dataset has missing values due to non-response or errors in processing, there is a danger that representativeness of the sample is reduced and thus inferences about the population are distorted. If the dataset contains inaccurate or outdated values, problems with modeling and inference arise.

As data goes through many stages during the research lifecycle, from its collection / acquisition to transformation and modeling to publication, each of the stages creates additional challenges for maintaining integrity and quality of data. In one of the most recent attempts to discredit climate change studies, for example, the authors of the study were blamed for not following the NOAA Climate Data Record policies that maintain standards for documentation, software processing, and access and preservation (Letzter, 2017). This brings out possibilities for further studies:
  • How does non-compliance with policies undermine the quality of data?
  • What role does scientific community consensus play in establishing the quality of data?
  • Should quality management efforts focus on improving the quality of data at every stage or the quality of procedures so that possibilities of errors are minimized? 
Another aspect of data quality that complicates formalized treatment of initial dimensions is that data is often heterogeneous and can be applied in varied contexts. As has been pointed above, data quality frameworks and approaches are being developed in business, government, and research contexts and quality solutions have to consider structured, semi-structured, and unstructured data and their combinations. Most of the previous data quality research focused on structured or semi-structured data. Additionally, spatial, temporal, and volume dimensions of data contribute to quality assessment and management.

Madnick et al. (2009) identify three approaches to possible solutions to data quality: technical or database approach, computer science / information technology (IT) approach, and digital curation approach. Technical solutions include data integration and warehousing, conceptual modeling and architecture, monitoring and cleaning, provenance tracking and probabilistic modeling. Computer / IT solutions include assessments of data quality, organizational studies, studies of data networks and flows, establishment of protocols and standards, and others. Digital curation includes paying attention to metadata, long-term preservation, and provenance.

Most likely, some combination of the above is the best approach. Quality depends on how data was collected as well as on how it was subsequently stored, curated, and made available to others. Data quality is a responsibility that is shared between data providers, data curators and data consumers. While data providers can ensure the quality of their individual datasets, curators help with consistency, coverage and metadata. Maintaining current and consistent metadata across copies and systems also benefits contributions from those who intend to re-use the data. Data and software documentation is another aspect of data quality that cannot be solved technically and needs a combination of organizational / information science solutions.

References and further reading: