May 30, 2013

Notes on RDF/Linked data

A few clarifying points for and against linked data / RDF from a couple of interesting resources.

Big Data journal published an article What Do RDF and SPARQL bring to Big Data Projects?. The article provides a concise yet very clear overview of RDF - a standard to represent structured and semi-structured data, based on triples and globally unique IDs. Unique IDs look like URLs, but their job is naming (i.e., unambiguous naming), not providing locations. They look like URLs, because this allows domain owners to control the naming conventions.

Benefits of RDF over relational databases:

  • independence from schemas (new classes, properties and relationships can be added easily without the need to modify a schema in advance);
  • easy integration and disintegration of datasets; 
  • shared vocabularies; 
  • efficiency in storing datasets that don't have all the same properties for every instance in a class (similar to NoSQL databases). 

The difference of RDF from other noSQL options is that it comes with a query language standard, SPARQL, which allows you to find the connections between triples in aggregated data. SPARQL can be used to query both RDF and non-RDF data. D2RQ platform, for example, can execute a single SPARQL query across databases stored in different relational database management systems.

The overall conclusion of the article is that RDF offers a more flexible solution for big data, the solution that is based on open standards and therefore is interoperable and expandable.

Another resource, a presentation "RDF: Resource Description Failures and Linked Data Letdowns" by Robert Sanderson (YouTube link and Vimeo link and I wish these great presentations existed in some textual form - it's really hard to work with videos) argues that Semantic Web was a great idea in 2003 and still is a great idea in 2013, but it has serious difficulties and complications to overcome:

  • RDF complicates querying and limits it. Both structure and data are important for queries, but in RDF data is treated as a "second class" citizen, because structure is a new thing and more interesting. So full text search, for example, is difficult. You need to know the structure (triple elements) to query RDF. And that's usually a problem for systems with a developed front end - you can't expect users to know the structure. 
  • Serialization and storage become complicated. RDF graphs have no beginning and end and can be cyclic, but you want to store information without repeating it. Triplestore (RDF storage solution) is a one big lump that has not clear documents or objects. If different users put documents in, they might want to treat their contributions as separate.
  • RDF visualization is hard (as opposed to document visualization), because it focuses on structure. But what about data? Hard to see what's going on, labeling and connections are terrible.
  • RDF/Semantic Web is based on the ultimate goal of creating a single global graph. This "Open World" idea is good in theory as it provides richness of data, its re-use, gradual expansion, etc. But global graphs create several problems, e.g., how to deal with local complexity (when statements are true in one context and not true in another). Local identities are also similarly complex.
  • Linked data do not address well changes over time - a lot of work need to be done.

Good things about RDF: open source, scalable, cross-platform, in use by commercial companies, good for inferences and new knowledge, interfaces are fast developing.