Making IrishGen SPARQL Part I: Selects

Christopher Guy Yocum

This post will begin to cover the SPARQL query language using IrishGen as the example dataset. This is the culmination of several posts that guide the reader through the sometimes confusing world of Linked Data. The post assumes that the reader is already familiar with Linked Data, Semantic Web, Triplestores, and logical reasoning using OWL 2. Eystein Thanisch has already covered some of this in his post which gives a practical and useful introduction to using SPARQL with IrishGen. This and following posts is meant to deepen the reader’s understanding and to give examples of most forms of SPARQL so that the reader feels confident in asking and answering their own questions of the IrishGen dataset.

Báeth: A Useful Guide

In the Electronic Dictionary of the Irish Language (eDIL) defines báeth as “foolish, stupid, silly, thoughtless, reckless” and also “in Laws applied to one not fully responsible either through nonage or mental deficiency …”. “Báeth” as a name also appears in the genealogies and will serve as a guide for many of the different kinds of queries that one can perform using SPARQL. It is unclear why this was used as a personal name in the genealogies but its use could be connected to the dictionary definition described above.

SPARQL Query Forms: Select

SPARQL has several different kinds of queries: select, construct, ask, and describe. Select is the most common and will be the focus of this post. Construct queries are the next most common but should have their own treatment as they are more complex syntactically so will be deferred to another post.

We will begin our journey with Báeth with a very simple query (with reasoning disabled for the moment):

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?x 
from <tag:stardog:api:context:all>
where {
    ?x irishRel:nomName "Baeth"
}

SPARQL resembles the TRiG file format used in IrishGen in many ways. However, there are differences between them. For the most part and in most practical situations, this does not pose a problem and most tools that work with SPARQL that the user will encounter will identify those for the user.

The select keyword tells the Triplestore that what is to follow is a select query. The select’s purpose is to return what is formally termed a solution sequence but for practical purposes can be understood as a list of terms which match the query. The ?x is the query variable that the search is interested in producing and will be reused in the query to indicate to the Triplestore places in the query that will be filled by its search. One can think of the variable ?x as a hole in the RDF which the computer is then asked to fill in. There could be many different solutions to filling the hole so each is returned as a valid way of solving the problem of filling in the hole in the TRiG form presented by the user. The term ?x is nothing special and it could very well have been ?baeth or any other set of characters. Query variables in SPARQL are prefixed with a ? or a $ so the above variable could also be expressed as $x. Naming variables is well-known as one of the hard problems of Computer Science. A rule of thumb is to make the variable name meaningful unless the code is short as it is in this case.

The from keyword controls what graph to create a default graph from. In this instance, the <tag:stardog:api:context:all> is a Triplestore dependent, in this case Stardog, variable which combines all available graphs. If, for instance, one only wanted to search for triples in the Book of Leinster then the from statement would be from <http://example.com/LL>. This is a powerful way to constrain searches to a particular MS or set of MSS and was one of the motivating factors in choosing TRiG with its graph declarations rather than continuing with Turtle which does not have the ability to constrain searches in this way.

The where keyword introduces the main section of the query which is known as the basic graph pattern. The basic graph pattern is basically a set of triples in Turtle form where the ?x is inserted to represent what part of the triple the user is interested in being returned from the query. The irishRel:nomName is the predicate that we are interested in being matched. Finally, “Báeth” is the string literal that the user is interested in matching.

To rewrite the above into more plain language: “Return to me the list of all the subject URLs which have the predicate irishRel:nomName and the object string “Báeth”” or, even more informally, “I want all URLs where the person’s nominative name is Báeth”.

The results of the query using the current IrishGen are:

http://example.com/LL/forthart_fea.trig#Baeth

http://example.com/LL/ciarraige.trig#Baeth

http://example.com/LL/lagin.trig#Baeth

http://example.com/LL/genelach_h_mugroin_i_m-maig_liphi.trig#Baeth

http://example.com/LL/genelach_h_n-enechglais.trig#Baeth-fdda055e

http://example.com/LL/eoganachta_casil.trig#Baeth

http://example.com/LL/de_genelach_dail_messi_corbb.trig#Baeth-a245b020

http://example.com/LL/genelach_h_n-enechglais.trig#Baeth-50d79733

http://example.com/LL/de_genelach_dáil_nia_corbb.trig#Baeth-a3e4de7a

A note about reasoning will be helpful here. The reasoning capabilities of Stardog, due to its backwards chaining strategy, can be enabled or disabled at the time the query is sent to Stardog for processing. If it is turned off, the query will return only results from the dataset as it stands in the files that are loaded into it. GraphDB on the other hand has reasoning always enabled because it must always pre-compute the entire graph before the query can be processed. From this point onward, unless otherwise stated, assume that Stardog is used and reasoning is enabled.

While this may be useful, someone knowledgeable concerning the medieval Irish genealogies would question the overall usefulness as there are many different spelling variations and other details to consider. In the instance of the query above, only instances of “Baeth” that have a nominative name which exactly matches will be returned and all spelling variations will be ignored.

Additionally, Stardog elides some of the information as it will only return one URL if they are all marked owl:sameAs, which is an implementation detail which must be borne in mind by the user (see the note attached to the owl:sameAs reasoning in the Stardog manual for more information). Thus, Stardog will return all canonical URLs for Báeth but not all instances of Báeth in the Triplestore. The difference between individual and instance is a subtle one that would require a more extended treatment elsewhere. In this case, not all URLs which match the query will be returned.

The next query will show how to have queries with multiple triples in it:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?x ?y
from <tag:stardog:api:context:all>
where {
    ?x irishRel:nomName "Baeth";
       irishRel:genName ?y
}

This is a slight expansion of the above. This query adds the ; which means “reuse the same subject” then gives the predicate irishRel:genName then the variable ?y which will capture any genitive forms of the name that may appear. More informally, this can be translated: “Give me all subject URLs which have a nominative name “Báeth” and also capture the genitive name and return it with the subject URL”. One detail will need to be noted here but the query will only match where there is an extact match of “Baeth” in the nominative and the URL has a genative name. Three results are returned:

?x	?y
http://example.com/LL/forthart_fea.trig#Baeth	“Baeth”
http://example.com/LL/genelach_h_n-enechglais.trig#Baeth-fdda055e	“Baeth”
http://example.com/LL/de_genelach_dail_messi_corbb.trig#Baeth-a245b020	“Báeth”
http://example.com/LL/genelach_h_n-enechglais.trig#Baeth-50d79733	“Baeth”

To create a more interesting query, let us ask who are the children of Báeth? To do this a new concept will need to be introduced: the blank node. Blank nodes are URLs which have no subject or in terms of IrishGen, there are people who are mentioned in the genealogies but they are not given a name. Since all URLs are predicated on the individual having a name, a URL cannot be constructed for nameless individuals. To avoid this problem, as nameless individuals (most often women) are genealogically important, RDF blank nodes are used to refer to these individuals. In SPARQL specification terms, blank nodes in graph patterns act as variables. This, regrettably, can be confusing to the beginner and expert alike but they serve different purposes and is useful to keep in mind even if often the term variable is used less formally to refer most often to query variables. The usefulness of this will be shown in the query below:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 
prefix rel: <http://purl.org/vocab/relationship/>
prefix foaf:  <http://xmlns.com/foaf/0.1/>

select ?x
from <tag:stardog:api:context:all>
where {
   ?x rel:childOf [
      a foaf:Person;
      irishRel:nomName "Baeth"
   ]
}

The query can be translated as “give me all the people who are the child of something which is a person and has the nominative name of ‘Baeth’”. Another way to think about the blank node in SPARQL is to imagine that it is like ?x but with some added constraints which must be satisfied before the solution is added to the solution set.

In this case there are 7 results:

http://example.com/LL/forthart_fea.trig#Echdach

http://example.com/LL/genelach_h_mugroin_i_m-maig_liphi.trig#Thig

http://example.com/LL/eoganachta_casil.trig#Butheni

http://example.com/LL/de_genelach_dail_messi_corbb.trig#Éitchen

http://example.com/LL/genelach_h_n-enechglais.trig#Echach-faeed481

http://example.com/Rawl_B502/genelach_úa_m-bairrche.trig#Breccán-f3d5ac80

http://example.com/Rawl_B502/genelach_úa_m-bairrche.trig#Fóelchú

To return to the original objection, what about spelling variations? To capture these kinds of variations we will need to find a way to both broaden and filter our results. This is possible in SPARQL using the filter keyword. The filter, according to the SPARQL specification, “restrict[s] the solutions of a graph pattern match according to a given constraint”. Essentially, filter allows the user to constrain the kinds of results which are returned. This can be very useful in a variety of situations. In the question above, the query will need to be broadened but the user wants to restrict the kinds of solution sets returned. To wit:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?x ?y
from <tag:stardog:api:context:all>
where {
    ?x irishRel:nomName ?y.
    filter regex(?y, "^B[áa]eth$", "i")
}

This query returns 21 results. The operative feature here is the line filter regex(?y, "^B[áa]eth$", "i"). This starts with the filter keyword which tells the Triplestore that it should restrict the solution set to the expression to follow. regex is a function which applies a regular expression to the variable ?y with the option of "i" which means apply the regular expression with case insensitivity (essentially, match upper and lower case letters with each other). The term regular here comes from Latin regula with the meaning “a rule, pattern, model, example” (see Lewis and Short II regula and Oxford English Dictionary 3a regular). Regular expressions are a very large topic on their own and are well worth the time to study as they appear in many computer programming and query languages. To quickly explain, regular expressions are a string matching pattern language. In the above example, ^B[áa]eth$ can be translated into informal language as “tell me if ?y matches the following pattern: at the beginning of ?y, B followed by either a or á followed by eth and then end of the string”. The option of case-insensitivity is active so B will also match b and so on.

There is an extensive set of functions available to build various filters and other constructs in SPARQL. It is another good use of time to look over the list to gain some familiarity with what is available.

To return to the children of Báeth example, in order to gain a full appreciation of the number of children, now that we have a method for capturing more forms of Báeth, a merged query to see what that might look like:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 
prefix rel: <http://purl.org/vocab/relationship/>
prefix foaf:  <http://xmlns.com/foaf/0.1/>

select ?x
from <tag:stardog:api:context:all>
where {
   ?x rel:childOf [
      a foaf:Person;
      irishRel:nomName ?y.
   ]
   filter regex(?y, "^B[áa]eth$", "i")
}

This query returns 22 results which is only one more than the original but with this we can see how to combine blank nodes with a filter and that single individual may be crucial to a researcher’s interest so it is well worth the extra complexity.

One of the great benefits of RDF and Linked Data is its flexibility. In relational databases, due to the closed world assumption, all columns must be filled with values. Optional values must still have a dummy value inserted and must be accounted for in SQL queries. RDF does not have this restriction. In the context of IrishGen, this means that not all individuals will have all predicates. What this means is that a URL may or may not have any predicate. When a user writes a query with a predicate, that predicate must exist for the match to be made. What if the user did not know in advance? SPARQL has a solution for this through the optional keyword.

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?x ?nominative ?genitive
from <tag:stardog:api:context:all>
where {
    ?x irishRel:nomName ?nominative.
    optional { ?x irishRel:genName ?genitive }
    filter regex(?nominative, "^B[áa]eth$", "i")
}

This query returns 19 results. While in earlier queries, the results were deferred so that the reader was not overwhelmed or distracted by the table, it is instructive to reproduce the whole table here.

?x	?nominative	?genitive
http://example.com/LL/forthart_fea.trig#Baeth	“Baeth”	“Baeth”
http://example.com/LL/ciarraige.trig#Baeth	“Baeth”
http://example.com/LL/genelach_h_mugroin_i_m-maig_liphi.trig#Baeth	“Baeth”
http://example.com/Rawl_B502/mínugud_senchusa_laigin_and_so_sís.trig#Báeth	“Baeth”
http://example.com/LL/genelach_h_n-enechglais.trig#Baeth-fdda055e	“Baeth”	“Baeth”
http://example.com/LL/genelach_h_n-enechglais.trig#Baeth-fdda055e	“Báeth”	“Baeth”
http://example.com/LL/eoganachta_casil.trig#Buith	“Baeth”
http://example.com/Rawl_B502/genelach_corco_m_druad.trig#Báeth	“Báeth”
http://example.com/Rawl_B502/genelach_ceníuil_bóguine.trig#Báeth	“Báeth”
http://example.com/Rawl_B502/do_primforslointib_Lagen_inso.trig#Báeth-604282d0	“Báeth”	“Báeth”
http://example.com/LL/de_genelach_dail_messi_corbb.trig#Baeth-a245b020	“Baeth”	“Báeth”
http://example.com/LL/de_genelach_dail_messi_corbb.trig#Baeth-a245b020	“Báeth”	“Báeth”
http://example.com/LL/genelach_h_n-enechglais.trig#Baeth-50d79733	“Baeth”	“Baeth”
http://example.com/LL/genelach_h_falgi.trig#Báeth	“Báeth”
http://example.com/Rawl_B502/genelach_benntraige.trig#Báeth	“Báeth”	“Báeth”
http://example.com/Rawl_B502/clann_aingeda.trig#Báeth	“Báeth”	“Báeth”
http://example.com/Rawl_B502/genelach_ceníuil_dalláin.trig#Báeth	“Báeth”
http://example.com/Rawl_B502/genelach_úa_m-bairrche.trig#Báeth	“Báeth”	“Báeth”
http://example.com/Rawl_B502/genelach_úa_m-bairrche.trig#Bóeth	“Baeth”

In this case though, there is another thing to note. The query will return all results which have a nominative name. This could result in the tens of thousands. It will then filter those by the regular expression and that will spawn tens or hundreds of thousands of computations or more. While in this case it took only 1816 millisecond (1.8 seconds), this is very inefficient. Users will need to have some awareness of the complexity of their queries when running them and strive for the most specific query that will satisfy their requirements. Some queries, however, will just take time and the user will need to keep in mind that their queries will not always be instantaneous.

The next natural question is: what happens when both names could be present? As the reader can see from the example above the optional keyword needs { and } which implies that more than one thing can be optional at a time. The general case can be re-specified as: the subject can have either irishRel:nomName or irishRel:genName or both. In this case SPARQL has the union keyword which combines the results of graph patterns. Thus, in the case where both irishRel:nomName or irishRel:genName can be optional, the query can be rewritten:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?x ?y
from <tag:stardog:api:context:all>
where {
    optional { 
      {?x irishRel:nomName ?y }
      union
      { ?x irishRel:genName ?y } 
    }
    filter regex(?y, "^B[áa]eth$", "i")
}

The above query will return 24 results. This will not distinguish between nominative or genitive and will combine both if found. Most of the queries so far have not taken advantage of the reasoning capabilities that were discussed previously. Thus, there is an alternative way for this query to be written so that it will do a similar search, that is search for names irrespective of their grammatical case, but rely instead on reasoning rather than optionality. If the reader consults the definition of nomName predicate in the earlyIrishRelationship.ttl file, the reader will notice that:

:nomName
    a owl:DatatypeProperty ;
    rdfs:subPropertyOf oldIrish:nominative, foaf:name .
	
:genName
    a owl:DatatypeProperty ;
    rdfs:subPropertyOf oldIrish:genitive, foaf:name .

This means that the nominative name is a sub-property of foaf:name. In other words, all :nomName are foaf:name. Further, :genName is similarly defined. Taken as a whole, this means that, in the presence of a Triplestore with a sufficiently powerful reasoner, foaf:name can take the place of the other combination of names. Thus:

prefix foaf:  <http://xmlns.com/foaf/0.1/>

select ?x ?y
from <tag:stardog:api:context:all>
where {
    ?x foaf:name ?y
    filter regex(?y, "^B[áa]eth$", "i")
}

The result of this query is also 24 as there are very few dative and accusative cases encoded in the dataset.

There is a final form of the select query that is useful to discuss before moving to construct queries is aggregates. This kind of query generally count or add or generally combines or aggregates information, hence the name. We will need to part ways for a moment from our useful guide, Báeth. There is encoded in the dataset the :numChild predicate. In the medieval texts of the genealogies, the number of children a person had would occasionally be recorded. Thus we can ask who had the most recorded children in the genealogies:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?x (max(?numChild) as ?maxNumChild) 
from <tag:stardog:api:context:all>
where {
    ?x irishRel:numChild ?numChild
} 
group by ?x
order by desc (?maxNumChild)

This query takes all URLs that have a irishRel:numChild declared on them then counts the max of them for each individual. This means that it will find the largest irishGen:numChild that an individual has declared on them. It orders them descending (desc) and it groups them by ?x which means that the query will break each group represented by the variable ?x into a separate group then work on that group. In this case, it is ?x so that each URL is counted as its own group. The result is an order list going from most to least which shows each individual and their :numChild which is more than 1000 results. Thankfully, this is a very fast query at 68 milliseconds. Evaluating software performance, particularly speed of execution, is a exhaustingly large subject. There is a very high variance in what the user may experience on their own computers due to inter alia CPU L1 and L2 cache strategies, RAM clock rate, disk seek speed, query optimiser, and even more complicated events outside the control of the Triplestore.

As the result is over 1000, we will only concern ourselves with the top five. We can do that progrmmatically by using the limit keyword.

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#> 

select ?x (max(?numChild) as ?maxNumChild) 
from <tag:stardog:api:context:all>
where {
    ?x irishRel:numChild ?numChild
} 
group by ?x
order by desc (?maxNumChild)
limit 5

?x	?maxNumChild
http://example.com/Laud_Misc_610/CGH/do_minigud_senchais_fer_muman.trig#Óengus	48
http://example.com/Rawl_B502/de_genelogia_síl_ébir_16.trig#Óengus-e74701aa	48
http://example.com/Rawl_B502/de_peritia_&_de_genealogis_dál_niad_cuirp_incipit.trig#CathairMár	33
http://example.com/Rawl_B502/de_peritia_&_de_genealogis_dál_niad_cuirp_incipit.trig#CathairMár-82bed530	30
http://example.com/LL/de_genelach_dáil_nia_corbb.trig#CathairMár	30

The results of this raise an interesting point. The reader will notice that Cathair Már appears three times in the results. This means that there are three different declared number of children for Cathair Már and they are not counted the same as max should return only the largest for a single individual. As explained above, the difference between individual and instance is subtle and deserves its own treatment. To investigate this is outside the scope of this post but a few hypothesises suggest themselves. First, that Cathair Már has differing numbers and the genealogists had differing accounts and were taking from their own sources or constructing genealogies with competing interests. Second, these are all the same person but they have not been linked together by owl:sameAs by the curators which would then match what is expected, which, given that the same number of children are reported, would suggest that this is the cause, at least in this case. Third, there is a discrepancy between LL and Rawl B502 that is now apparent by looking at this query. Whatever the actual underlying cause, the query demonstrates something that would need to be done painstakingly by hand with an even larger margin for error than using a Triplestore to do the calculation for the user.

There is, of course, another method of counting by taking advantage of the reasoner. The alternative method is to count the children separately and not to return the figures given explicitly in the genealogies. In fact, many do not have explicit numbers of children mentioned so this is the method by which most counting may be done. The query below demonstrates this:

prefix irishRel: <http://example.com/earlyIrishRelationship.ttl#>
prefix rel: <http://purl.org/vocab/relationship/>

select ?x (count(distinct ?y) as ?numChildren)
from <tag:stardog:api:context:all>
where {
    ?x rel:parentOf ?y
}
group by ?x
order by desc(?numChildren)
limit 5

As a reader familiar with the IrishGen dataset will know rel:parentOf is not encoded very often within the dataset. The reasoner knows that rel:parentOf is the inverse of rel:childOf and thus can logically infer the number of children by applying this logic to the dataset. Thus, queries can be constructed based on data that is not encoded within the dataset. This does come at a cost that shifts the burden of calculating this from human beings encoding it in the dataset to the computer but this is a much less labour intensive way of determining these things. In this case, the query is runs for ~1022 milliseconds (about 1 second). The outcome of this query for the top five people is:

?x	?numChildren
http://example.com/Laud_Misc_610/CGH/do_minigud_senchais_fer_muman.trig#Óengus	34
http://example.com/Laud_Misc_610/CGH/senchus_dáil_fíatach.trig#AilellaÁuluim	30
http://example.com/LL/laigsi.trig#CathairMár	29
http://example.com/Rawl_B502/úi_meic_h_eirc.trig#ConallClóen	25
http://example.com/Rawl_B502/do_forslointib_ulad_iar_coitchiund_in_so.trig#Buan	25

While again outside the scope of the present post, it is interesting to note that Cathair Már has 29 calculated children while the genealogists have counted differently. Again, there could be various reasons why this is, including errors in the data introduced by the curators during the translation process into RDF and it would bear more investigation but the overall point is that there are now ways to do these kinds of queries which were not possible or overly onerous previously.

Conclusion

This post has covered some of the basic select query forms that a researcher will encounter when using SPARQL by using Báeth to explore each in turn. There is, of course, much more to discover and it takes time, practice, experimentation, and experience to use a Triplestore with SPARQL. However, a researcher should now feel confident that they have enough to get started asking their own research questions of the IrishGen dataset and begin to explore the new possibilities afforded by the system.

A note of caution is useful here. The data which comprises IrishGen is under review and will change as errors and omissions are corrected. All queries results should be examined for correctness before relying on them. Over time, these errors and omissions will be rectified but it is a time consuming process which will possibly never be fully completed.

In the next post, the second most common form of SPARQL query will be explored: the construct query form which will introduce the reader to the real power of reasoning and how to extract sub-graphs from IrishGen and create visualisations which will help in their own research.