Linked Open Data

John O’Gorman (john@og.co.nz)

10 May 2015

1 Intro

The internet was created in the early 1970s and Tim Berners-Lee invented the World Wide Web in 1990 linking hypertext pages to the internet. The hypertext pages depended on HTML (HyperText Markup Language) and the Hypertext Transfer Protocol (HTTP). Browsers came very soon after - Mosaic, then Firefox, Safari, Chrome, and others allowing users access to linked pages containing text content, images, and links to other pages.
Now Berners-Lee has proposed extending the Web to include Linked Open Data. Already projects such as Wikipedia have been embedding data into their webpages as infopages using an extension to HTML called RDF (Resource Definition Framework).
The Linked Open Data (LOD) project makes the World Wide Web into a global database where literally any knowledge can be shared and combined.
Linked Data is a set of techniques for the publication and accessing of data using standard formats and interfaces defined by the World Wide Web Consortium (W3C) such as OWL, RDF, RDFs, RDFa, etc.

1.1 HTML5

The original HTML standard has now been extended to a new standard called HTML5 which encompasses RDF, RDFa, RDFs, and XHTML. Current browsers are being enhanced to support HTML5.
To test your browser, point it to https://html5test.com. It will give your browser a score (e.g. 396 out of 555 points) indicating all the specs it can and cannot support. The HTML5 standard is evolving and has not been finally ratified so browsers cannot be expected to pass the test with a 100% score. All HTML5 tags are lower case even where they were upper case in HTML. e.g. <br /> instead of <BR>.
A further problem with HTML5 is that it is a subset of XML rather than SGML which had more permissive syntax. Many older web pages will not conform and HTML5 parsers will struggle to cope with the defective tags allowed by the more permissive HTML. e.g.:
The new HTML5 standard also specifies the DOM (Document Object Module) and the Javascript language which together allow web pages to be animated.
To view the HTML page source do the following

1.2 The Basics

1.2.1 Entities

The Web Ontology Language was an early W3C standard. It is whimsically named OWL in respect of the Winnie the Pooh character Owl who mispelled his name: WOL. Ontology (from the Greek word ontos meaning being) is borrowed from metaphysics and means the formal definition of types, properties, and relationships of entities.
RDFS (RDF Schema) is now the favoured standard for defining vocabularies of entities used in Linked Data and is compatible with pre-existing OWL constructs.

1.2.2 Triples

Resource Description Framework (RDF) is a W3C specification for describing entities as triples: subject, predicate, object.
A single RDF statement describes two things and a relationship between them. Technically this is is called an Entity-Attribute-Value but Linked Data people often call the 3 elements the subject, the predicate, and the object. e.g cats eat mice is a triple where cats is the subject, mice is the object, and eats is the predicate. The simplest representation of this triplet is Turtle:
“cats” “eat” “mice” .
or in RDF syntax:
<cats rdf:about=”mice”>
    <eat>
</cats>

1.2.3 URIs

The example above shows data but not Linked Data. In order for it to be linked, the cats and mice need to be stored and retrieved as URIs and eats needs to be a standard defined value. So a more realistic Turtle representation might be:
@prefix dbpedia:  <http://dbpedia.org/resource> . 

dbpedia:cats dbpedia:eat dbpedia.mice .
 
assuming that cats, eat, and mice have been defined in dbpedia. If they have then you can trace linkages from cats and mice to felines, rodents, mammals, etc and link into the dbpedia world of knowledge.

1.3 DBPedia

The WikiPedia project is a free collaborative encyclopedia with over 4 billion entries and has been going since 1994. It has become the custom to put RDF based data entries called infoboxes at or near the top of each entry. DBPedia is also a free collaborative enterprise based in the University of Leipzig which extracts what it can from the infoboxes and builds a linked open database from them. DBpedia has been available since 2007 and Tim Berners-Lee has described it as one of the more famous parts of Linked Data effort.

1.4 Tools for finding LOD

There are some sites which offer access to Linked Open Data and provide tools for gathering and using the information available.
Point your browser to any of the following:
Site URL Comment
LOD Cloud Linked Open Data Cloud
DBPedia Extracts from WikiPedia
Sindice Italian for Semantic Index (pronounced sin-dee-chey)
SameAs.org Identifies equivalent URIs
Data Hub Community run catalogue

1.5 History

1.5.1 BBC

The BBC faced the challenge of producing web pages for 1500 television and radio programmes with a staff of only a handful of people. They also needed to publish web content for every band and the songs they record, updated each day. They also needed web pages for for each animal species and its habitat when the organisation did not have that information. They met this challenge during a period of staff cuts using Linked Data. Point your browser at any of the following:
http://www.bbc.co.uk/programmes
http://www.bbc.co.uk/music
http://www.bbc.co.uk/nature/wildlife
The BBC collects, filters,and reuses Linked Data from from various sources, including the World Wildlife Fund, MusicBrainz, and the DBpedia project.

1.5.2 DBPedia

Wikipedia embeds in its web pages Linked Data in tables called infoboxes usually at thr top right of pageswhich can be accessed from
http://dbpedia.org
whence it can be used by others.

2 RDF

Resource Description Framework (RDF) is a W3C specification for describing entities as triples: subject, predicate, object.
A single RDF statement describes two things and a relationship between them. Technically this is is called an Entity-Attribute-Value but Linked Data people often call the 3 elements the subject, the predicate, and the object. e.g. The following Turtle representation
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix foaf: <http://xmlns.com/foaf/0.1/> . 
@prefix dc:   <http://purl.org/dc/elements/1.1/> .

<http://en.wikipedia.org/wiki/Tony_Benn>
    dc:publisher "Wikipedia" .
<http://en.wikipedia.org/wiki/Tony_Benn>
    dc:title "Tony Benn" .
<http://en.wikipedia.org/wiki/Tony_Benn>
    foaf:primaryTopic [
        a foaf:Person ;
        foaf:name "Tony Benn"
    ] .
The above RDF turtle listing show 3 triples:
Subject Predicate Object
<http://en.wikipedia.org/wiki/Tony_Ben> publisher “Wikipedia”
<http://en.wikipedia.org/wiki/Tony_Ben> title “Tony Benn”
<http://en.wikipedia.org/wiki/Tony_Ben> name “Tony Ben”
But the RDF specifies that the redundant repetition of the subject can be eliminated by using semi-colons for the shared subject:
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix foaf: <http://xmlns.com/foaf/0.1/> . 
@prefix dc:   <http://purl.org/dc/elements/1.1/> .

<http://en.wikipedia.org/wiki/Tony_Benn>
    dc:publisher "Wikipedia" ;
    dc:title "Tony Benn" ;
    foaf:primaryTopic [
        a foaf:Person ;
        foaf:name "Tony Benn"
    ] .
Similarly where several triples share the same object you can use commas.
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
@prefix rdf: <http://www.w3c.org/1999/02/22-rdf-syntax-ns#> .

dbpedia:Bonobo
    rdf:type dbpedia-owl:Eukaryote , dbpedia-owl:Mammal , dbpedia-owl:Animal .
The @prefix statements provide a means of reducing verbiage. e.g. the dc prefix allows you to use dc: instead of the full URL.
The above format is called Turtle and is the easiest and most readable format of RDF. Each Turtle RDF statement is terminated by a full stop.

2.1 URIs

The use of RDF should keep in mind 4 principles:
Uniform Resource Identifiers (URIs) used to name things in Linked Data are a generalised version of the Uniform Resource Locators (URLs) used to locate web pages.
URIs are often long and difficult to read in triples. So a shorthand is allowed in RDF whereby you can declare a short prefix to be equivalent to a long URI. When you refer to the URI within the document, you can substitute the prefix.
The following prefixes are in common use:
Prefix URI Name Describes
air: http://www.daml.org/2001/10/html/airport-on Airport Ontology Nearest Airport
bibo: http://purl.org/ontology/bibo BIBO Bibliographies
bio: http://purl.org/vocab/bio/0.1 Bio Biographicl info
cc: http://creativecommons.org/ns# CC rights expression Software Licences
doap: http://usefulinc.com/ns/doap# DOAP (Description of a Project) Projects
dc: http://purl.org/dc/elements/1.1 Dublin Core Elements Publications
dct: http://purl.org/dc/terms Dublin Core Terms Publications
foaf: http:/xmins.com/foaf/0.1 FOAF (Friend of a Friend) People
pos: http://www.w3.org/2003/01/geo/wgs84_pos# Geo Positions
gn: http://www.geonames.org/ontology# GeoNames Locations
gr: http://purl.org/goodrelations/v1# Good Relations Products
ore: http://www.openarchives.org/ore/terms Object Reuse and Exchange Resource Maps
rdf: http://www.w3.org/1999/02/22-red-syntax-ns# RDF Core Framework
rdfs: http://www.w3.org/2000/01/ref-schema# RDFS RDF Vocabularities
sioc: http://rdfs.org/sioc/ns# SIOC Online Communities
skos: http://www.w3.org/2004/02/skos/core# SKOS Controlled vocabularies
vcard: http://www.w3.org.2006/vcard/ns# vCard Business Cards
void: http://rdfs.org/ns/void# VoID Vocabularies
owl: http://www.w3.org.2002/07/owl# Web Ontology Language Ontologies
wn: http://xmins.com/wordnet/1.6/ WordNet English Words
xsd: http://www.w3.org/2001/XMLSchema# XML Schema Datatypes Data Types

2.2 RDF Formats

Files containing RDF statements originally were stored in a variant of XML. But more recently other formats have arisen and been standardised, in particular Turtle.
The above formats can be used for creating, storing, and translation of Linked Data.

2.2.1 Turtle

Turtle (Terse RDF Triple Language) is a format for expressing data in the Resource Description Framework (RDF) data model with the syntax similar to SPARQL. RDF, in turn, represents information using "triples", each of which consists of a subject, a predicate, and an object. Each of those items is expressed as a Web URI.
Turtle provides a way to group three URIs to make a triple, and provides ways to abbreviate such information, for example by factoring out common portions of URIs. For example:
<http://example.org/person/Mark_Twain>
   <http://example.org/relation/author>
   <http://example.org/books/Huckleberry_Finn> .
Turtle is an alternative to RDF/XML, the originally unique syntax and standard for writing RDF. As opposed to RDF/XML, Turtle does not rely on XML and is generally recognized as being more readable and easier to edit manually than its XML counterpart.

2.2.2 RDF/XML

RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed. RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.

2.2.3 RDFa

RDFa (or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The RDF data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

2.2.4 JSON-LD

JavaScript Object Notation for Linked Data is a method of transporting Linked Data using JSON. It was a goal to require as little effort as possible from developers to transform their existing JSON to JSON-LD. This allows data to be serialized in a way that is similar to traditional JSON. It is a World Wide Web Consortium Recommendation that has been developed by the JSON for Linking Data Community Group before it has been transferred to the RDF Working Group for review, improvement, and standardization.
JSON-LD is designed around the concept of a "context" to provide additional mappings from JSON to an RDF model. The context links object properties in a JSON document to concepts in an ontology. In order to map the JSON-LD syntax to RDF, JSON-LD allows values to be coerced to a specified type or to be tagged with a language. A context can be embedded directly in a JSON-LD document or put into a separate file and referenced from different documents (from traditional JSON documents via an HTTP Link header).
Sample:
{
  "@context":{
    "name": "http://xmlns.com/foaf/0.1/name",
    "homepage": {
      "@id": "http://xmlns.com/foaf/0.1/workplaceHomepage",           "@type": "@id"
    },
    "Person": "http://xmlns.com/foaf/0.1/Person"
  },
  "@id": "http://me.markus-lanthaler.com",
  "@type": "Person",
  "name": "Markus Lanthaler",
  "homepage": "http://www.tugraz.at/"
}

3 Using Linked Data

The project Linking Open Data (LOD pronounced ell-oh-dee) is an ambitious project whose aim is to make data available to everyone.
http://lod-cloud.net/state/

4 DBPedia

DBPedia is a community project devoted to extracting and manipulating RDF data embedded in WikiPedia pages. It was originally implemented using PHP but has been re-developed using Scala which is a object-functional language which creates Java byte-code files.

4.1 Prerequisites

5 FOAF

Friend of a Friend (FOAF) is a project devoted to linking people and information using the Web. Regardless of whether information is in people’s heads, in physical or digital documents, or in the form of factual data, it can be linked.

6 SPARQL

SPARQL is a recursive acronym for SPARQL Protocal And RDF Query Language. The intention is that it can be used with Linked Data analogously to SQL with Relational Databases.

6.1 The ARQ tool

To use SPARQL locally download the ARQ utility from http://apache.org/dist/jena then from the command line:
export ARQROOT=’/Applications/ARQ-2.8.8’
/Applications/ARQ-2.8.8.8/bin/arq -h
SPARQL queries can be provided remotely by end points. e.g. if you point your browser to http://dbpedia.org/sparql you will get an HTML query form which will accept a SPARQL query.

6.2 Sample SPARQL Query

PREFIX foaf: <http://xmlns.com/foaf/0.1/> 
SELECT ?name ?email 
WHERE {
  ?person a foaf:Person.
  ?person foaf:name ?name.
  ?person foaf:mbox ?email.
}
The ? prefix is used for variables which will be instantiated when the SPARQL query runs.

7 Sources

8 Callimachus

Callimachus is a Linked Data management system. It is named after Callimachus of Cyrene who was an ancient Greek researcher in the library of Alexandria and first demonstrated the need for graph data structures in his attempts at classifying books. Download Callimachus from
http://callimachusproject.org
Callimachus 1.2 requires Java Development Kit (JDK) 1.7 or later. JRE is not sufficient. Unfortunately it does not run with JDK version 8 (the current version).

9 Virtuoso

http://virtuoso.openlinksw.com