Christopher Guy Yocum

ORCID: 0000-0002-7241-3264

Following on from earlier posts, this post will discuss the concept of RDF Datasets also known as Named Graphs. The reader will need to be familiar with the basics of Linked Data and SPARQL and reading earlier posts would be advisable before continuing. This post will proceed in four parts. First, a practical example as to why Named Graphs are necessary and useful. Second, a discussion of the Named Graphs and the Default Graph and how they interact. Third, how Named Graphs are searched via SPARQL in the context of IrishGen. Fourth, and finally, how and why the owl:sameAs predicate is used in IrishGen and what effects it has to the result sets returned from the two Triplestores. The outcome is that the reader will have an appreciation for: why Named Graphs exist, what this means for the structure of the IrishGen dataset, and how to use this facility in their queries. Additionally, the user will get an appreciation for how and why owl:sameAs is handled in the two Triplestores and what effect that will have on their query’s result sets.

Provenance of Data

The medieval Irish genealogical tradition is not presented as a single whole. The texts that comprise it are contained within MSS which are separated in time and space. These were all compiled from various sources and exist in their own right. This situation is very messy from a data modelling point of view. There are competing constraints which need to be balanced when attempting to translate the situation as it stands within the MS tradition to the Linked Data universe. On the one hand, there is a solid core of information contained within the major MS genealogical collections. On the other hand, there are substantive differences which must be respected. Additionally, for users of IrishGen, it is equally important to be able to search one branch of the tradition or the other. Without Named Graphs, as will be discussed shortly, users who only wished to search, for instance, the Book of Leinster would have to contort their searches to narrow them.

When using a database such as IrishGen which replicates in computational form data which is taken directly from primary sources, it is imperative that IrishGen is clear concerning whence that information derives. As it stands, a triple does not carry with it a method for partitioning information. Within IrishGen, there is an ad-hoc attempt at this by encoding the MS that a triple is from into the triple’s URL. While this method can work with some extra diligence from the user, it is far from perfect. Thankfully, others have encountered the same problem and the Semantic Web community has created the concept of a “Named Graph”.

To illustrate this, the situation in which there is no Named Graph for a triple will be shown then the situation with a Named Graph will be shown.

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
where {
    ?a irishRel:nomName "Find"
}

The above query in a situation where Named Graphs are not in use will return every single triple that happens to match the Basic Graph Pattern without respect to what manuscript the triple may have come from. From a research point of view, this is very unhelpful. Usually, researchers will want to explore a specific genealogical text. While it can be very interesting to do a meta-analysis of the entire corpus as a whole, such as when counting the number of distinct individuals recorded, or useful when the user is searching to see if information happens to be contained in the genealogies, such as when the user has a name from another source and wishes to know if the same name appears in the genealogical corpus, a database wide search is often not what is wanted.

The solution to this problem is to partition the dataset into sub-graphs. Named Graphs allow the user to define what sub-graph a triple will belong to. In the case of IrishGen, Named Graphs define the relationship between the triple and the sub-graph. As with all things in Linked Data, the sub-graph is defined by a URL. In the case of IrishGen, the URL is http://example.com/ plus the commonly used scholarly abbreviation. For instance, in the case of the Book of Leinster the URL is: http://example.com/LL.

More technically, when a Named Graph is attached to a triple, it is transformed into a “quad”. This extra bit of information is attached to every triple in a Named Graph. For instance, a triple will normally look like this to a Triplestore:

<http://example.com/LL/lagin.trig#Find> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person>

This triple states informally that “this particular instance of Find is a person”. When the Named Graph is added, the triple will be transformed thus:

<http://example.com/LL/lagin.trig#Find> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://example.com/LL>

This is now a quad (it has four basic elements rather than three). This quad informally states: “Find is a person in the graph http://example.com/LL”. The quad above can be read more liberally as “Find is a person in the Book of Leinster”. To make the above slightly more comprehensible, the TRiG file format slightly extends the Turtle format to accommodate the sub-graph. So, if a user looks at the data directly, rather than through a Triplestore, it will look something like this:

<http://example.com/LL> {
     <#Find>
         a foaf:Person;
         irishRel:nomName "Find";
         rel:descendantOf <#Baiscni>;
         rdfs:comment "Senchan Torpeist cecinit isin Cocangaib".
}

To search for the instances of a nominal name “Find” in the Book of Leinster’s genealogies, the SPARQL query would be:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
from named <http://example.com/LL>
where {
	graph ?g {
		?a irishRel:nomName "Find"
	}
}

This ensures, with some very important caveats explored below, that only information from the Book of Leinster will be returned to this query, allowing the user to narrow to a specific MS or set of MSS.

At the moment of writing, Named Graphs are not composable. For instance, you cannot nest a Named Graph within another Named Graph. For readers who are more familiar with the structure of the medieval Irish genealogies, you cannot nest an item graph within a manuscript graph.

Named Graphs and the Default Graph (RDF Dataset)

Because triples were created before quads, the two ways of denoting information in RDF must be compatible with each other. This is done by defining two separate domains: the Default Graph and any Named Graphs. Triples exist in the Default Graph while quads exist within their own, distinct, Named Graphs as defined by the dataset. In the case of IrishGen currently, there are no triples but only quads. Every bit of information stored within IrishGen is given a MS as a Named Graph (sometimes termed “context”). This has consequences for how SPARQL handles queries which deal with Named Graphs.

Sadly, Named Graphs have many formal definitions which can make understanding them confusing. This situation is due to the fact that the Semantic Web community has yet to come to a consensus as to the direct formal meaning of a Named Graph. However, this does not detract from their usefulness in both SPARQL and IrishGen.

Named Graphs and SPARQL

From the foregoing sections, it may seem straightforward to add this functionality to SPARQL. Sadly, it is not. The two implementations that are recommended (Stardog and Ontotext GraphDB) handle RDF Datasets in very different ways. To anticipate slightly, Stardog handles Named Graphs in a way which adheres to the standard closely but GraphDB attempts to elide over the differences between the Named Graph and the Default Graph. While this elision on GraphDB’s part makes getting started with Named Graphs easier, it makes the entire situation more complicated in the long run.

First, let us look at a naive query in Stardog:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
where {
    ?a irishRel:nomName "Find"
}

The query above will produce zero results when used with IrishGen. This has caused confusion to the curators several times over the course of the project. Why does this produce no results? It is because this query does not query the entire graph; it only queries the Default Graph, the graph which contains only triples. Find does not appear in the Default Graph and thus the query will produce no results. The key to understanding this is that IrishGen is split into MS Named Graphs and does not have triples in the Default Graph thus searching the Default Graph will yield no results because there is nothing in the Default Graph.

How do we create a query which will return what is expected? For this SPARQL has two methods, the first is the from keyword. This keyword is followed by the Named Graph that is of interest. What this does is pull the entire Named Graph into the Default Graph. In effect, it reduces the quads from the Named Graph to triples in the Default Graph and then searches in the Default Graph. So, if we wanted to pull all quads from LL into the Default Graph, we would use from like so:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
from <http://example.com/LL>
where {
    ?a irishRel:nomName "Find"
}

This will search the Default Graph with all the quads added from the Named Graph <http://example.com/LL>. This is useful when a user wishes to combine triples and quads into a single query.

The query above produces 19 results which meets expectations. If the user only wishes to search quads from a single Named Graph, SPARQL has the from named construct which, in effect, narrows a query to that particular Named Graph.

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
from named <http://example.com/LL>
where {
    graph ?g {
        ?a irishRel:nomName "Find"
    }
}

Before continuing to demonstrate combining from and from named in a single query, it is useful to restate the purpose of these statements as they can be confusing. from pulls quads from a Named Graph into the Default Graph. This means that triples and quads can be merged together in a single query. from named retains the distinction between Default Graph and Named graph. Queries that use from named will need to use the graph SPARQL keyword to enable access to quads from a Named Graph but the Default Graph is always available outside the graph keyword. For instance:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
from <http://example.com/Rawl_B502>
from named <http://example.com/LL>
where {
    graph ?g {
        ?a irishRel:nomName "Find"
    }
}

The query above will again only return 19 results, which is because it does not search the Default Graph.

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a ?g
from <http://example.com/Rawl_B502>
from named <http://example.com/LL>
where {
    { ?a irishRel:nomName "Find" }
    union {
		graph ?g {
			?a irishRel:nomName "Find"
		}
    }	
}

This illustrative query will return 21 results where genealogical quads from Rawl B502 are pulled into the default graph and searched while information from LL stays within its Named Graph and is searched separately. The combination of both graphs is merged using the union SPARQL keyword which merges two result sets into a single result set. However, see below concerning how Stardog’s and GraphDB’s owl:sameAs semantics can make these results confusing in different ways depending on the Triplestore used. The above query can be written just the same using two from named statements and using just the graph keyword. As an example:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a ?g
from named <http://example.com/Rawl_B502>
from named <http://example.com/LL>
where {
	graph ?g {
		?a irishRel:nomName "Find"
	}	
}

The last question that may be asked is: how does a user query all graphs without needing to specify each? This is not currently possible in the SPARQL 1.1 standard so this is where the situation becomes complicated and the solution depends on the Triplestore that the user chooses. In Stardog there are two ways to do this. First there is a special graph named <tag:stardog:api:context:all>. This special graph, only available in Stardog, will automatically pull Named Graph all quads into the Default Graph. Second, the user can set a property on the database in the Stardog settings named Query All Graphs which pulls all Named Graph quads into the Default Graph by default, which makes using the from named, from, and graph keywords unnecessary to search all quads at once.

Turning to GraphDB, the situation is slightly different. Taking the original naive query from above:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
where {
    ?a irishRel:nomName "Find"
}

In GraphDB, this query will return 105 results, which is the total number of instances of “Find” in the dataset. Why does this return so many results? This is due to the way in which GraphDB constructs its Default Graph:

GraphDB constructs the default dataset as follows:

The dataset’s default graph contains the merge of the database’s default graph AND all the database Named Graphs;

The dataset contains all Named Graphs from the database.

The reason given in the documentation is thus:

There are two reasons for this behavior:

It provides an easy way to execute a triple pattern query over all stored RDF statements.

It allows all Named Graph names to be discovered, i.e., with this query: SELECT ?g { GRAPH ?g { ?s ?p ?o } }.

The query returns so many results because all queries are as if Stardog’s “Query All Graph” preference is enabled. In other words, all quads in all Named Graphs and all triples in the Default Graph are merged by default in GraphDB. If a query specifies a from and from named, GraphDB’s documentation states:

If either FROM or FROM NAMED are used, the database’s default graph is no longer used as input for processing this query. In effect, the combination of FROM and FROM NAMED clauses exactly defines the dataset.

In other words, to have GraphDB act in a similar way as Stardog’s default, the user will need to specify the datasets exactly in their query.

SameAs and SPARQL searches

The medieval Irish genealogies contain instances of the same individual across multiple MSS. IrishGen links the instances of these individuals to each other using OWL’s sameAs predicate. Using this predicate to link all instances of an individual together has consequences for searching which has an effect on the use of Named Graphs and a large effect on the results returned. Users must be aware of this as it can have a large effect on the results returned and the interpretation of the results returned.

Stardog’s documentation states that:

The way sameAs reasoning works differs from the OWL semantics slightly in the sense that Stardog designates one canonical individual for each sameAs equivalence set and only returns the canonical individual. This avoids the combinatorial explosion in query results while providing the data integration benefits.

This means that if the user searches a Named Graph, if the canonical owl:sameAs is chosen such that it is in a different graph, it could appear in results in place of the one that is in the Named Graph being searched. This can cause surprise to a user but they are technically and by definition the same; however, there is nothing that anyone can do about this due to the fact that Stardog does the above. Additionally, for an individual who is the same across many graphs, this can change each time the database is loaded as the system randomly assigns a canonical individual.

GraphDB has an approach which is the exact opposite to Stardog. Instead of using forward-chaining to create sameAs relations, GraphDB creates an equivalence class and does backwards-chaining:

There is no restriction on how to choose this single node that will represent the class as a whole, so we pick the first node that enters the class. After creating such a class, all statements with nodes from this class are altered to use the class representative. These statements also participate in the inference.

The equivalence classes may grow when more owl:sameAs statements containing nodes from the class are added to the repository. Every time you add a new owl:sameAs statement linking two classes, they merge into a single class.

During query evaluation, GraphDB uses a kind of backward chaining by enumerating equivalent URIs, thus guaranteeing the completeness of the inference and query results. It takes special care to ensure that this optimization does not hinder the ability to distinguish between explicit and implicit statements.

More detail is available in GraphDB’s documentation.

To demonstrate how this effects GraphDB’s results, if the user runs the query in Stardog:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a
from <http://example.com/LL>
where {
    ?a irishRel:nomName "Find"
}

There will be 19 results.

http://example.com/LL/ciannacht.trig#Find
http://example.com/LL/clanna_ébir_i_l-leith_chuind.trig#Find-0dc31110
http://example.com/LL/clanna_ébir_i_l-leith_chuind.trig#Find-7b6d0720
http://example.com/LL/clanna_ébir_i_l-leith_chuind.trig#Find-194ec360
http://example.com/LL/dáil_caiss.trig#Find
http://example.com/LL/do_thaicraige_arad.trig#Find
http://example.com/LL/flaithe_h_riacain.trig#Find
http://example.com/LL/forslonti_dáil_messi_corb.trig#Find-6ec2dee0
http://example.com/LL/genelach_clainde_brannduib.trig#Find-68dfd695
http://example.com/LL/genelach_h_falgi.trig#Find
http://example.com/LL/h_airgialla.trig#Find
http://example.com/LL/h_n_echdach.trig#Find
http://example.com/LL/lagin.trig#Find
http://example.com/LL/n_dési.trig#Find
http://example.com/LL/rig_ailig.trig#Find
http://example.com/LL/rig_h_falge.trig#Fhind
http://example.com/LL/síl_daimini.trig#Fhind
http://example.com/LL/síl_daimini.trig#Fhind-a5cb5100
http://example.com/LL/tairdelbaig.trig#Find

If it is run under GraphDB, it will return 16 results. The difference is:

http://example.com/LL/clanna_ébir_i_l-leith_chuind.trig#Find-7b6d0720
http://example.com/LL/clanna_ébir_i_l-leith_chuind.trig#Find-194ec360
http://example.com/LL/síl_daimini.trig#Fhind-a5cb5100

Which one of these results is correct? Both are if the user considers that GraphDB creates an equivalence class that combines answers when doing inferencing. The two answers can be reconciled using GraphDB’s pseudo-graphs:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 
prefix onto: <http://www.ontotext.com/>

select ?a
from <http://example.com/LL>
from onto:explicit
where {
    ?a irishRel:nomName "Find"
}

What is onto:explicit? As the GraphDB documentation discussed above states:

FROM onto:explicit – using only this clause (or with FROM onto:disable-sameAs) produces the same results as when the inferencer is disabled (as with the empty ruleset). This means that the ruleset and the disable-sameAs parameter do not affect the results.

This disables the inferencer and returns what is explicitly stated in the IrishGen files. Is this what the user will always want though? That is up to the user at the time the query is run.

To return to the query above which contains two Named Graphs, if the user wished to have all the instances of irishRel:nomName "Find" from LL and Rawl B502 without the owl:sameAs equivalencies interfering with the results, the query would be:

prefix onto: <http://www.ontotext.com/>
prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a ?g
from <http://example.com/Rawl_B502>
from named <http://example.com/LL>
from onto:explicit
from named onto:explicit
where {
    { ?a irishRel:nomName "Find" }
    union {
		graph ?g {
			?a irishRel:nomName "Find"
		}
    }	
}

Although, it would be easier to do the following:

prefix onto: <http://www.ontotext.com/>
prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?a ?g
from named <http://example.com/Rawl_B502>
from named <http://example.com/LL>
from named onto:explicit
where {
	graph ?g {
		?a irishRel:nomName "Find"
	}
		
}

The above is easier to write and read. This returns 44 results, which contrasts with the 21 results returned from Stardog. This means that there are 44 instances of the name “Find” but only 21 distinct individuals who could be identified as owl:sameAs each other. To state this slightly differently, the key to the distinction is that there are 44 instances of the name “Find” while there are only 21 actual individuals who can be treated the same as each other. To reinstate the owl:sameAs and reimpose the owl:sameAs logic, all one needs to do is remove the onto:explicit and the above query will return 21 just the same as Stardog.

As the reader can see, using OWL 2’s sameAs has various effects on the outcome of queries. This needs to be kept in mind whenever a query is run. A user must keep in mind the effects of choosing one Triplestore over another as each Triplestore makes different choices on how to represent the OWL 2 standard and these choices can have a large effect on the result sets returned.

Conclusion

Named Graphs allow IrishGen to partition the various MSS sources into their own sub-graphs. This allows the user to interrogate one MS tradition or another, which can be important to certain investigations such as when a user is searching to verify whether a certain piece of information exists in a certain MS version. For instance, a name is found within a MS and the user wishes to check the genealogies contained in the same MS quickly to see if that name is found within it.

However, choosing a Triplestore has consequences for querying in the presence of Named Graphs. GraphDB’s implementation mentioned above has this consequence: it makes searching the amalgam of the data taken from all MSS easy. Stardog adheres to the standard more closely but because of its backwards chaining reasoning, it can be very slow when reasoning is enabled. Moreover, the way in which owl:sameAs is treated in Stardog and GraphDB has an effect on the result set that needs to be considered when writing and running queries.