Biodiversity Data Journal :
Software Description
|
Corresponding author: Mariya Dimitrova (m.dimitrova@pensoft.net)
Academic editor: Anne Thessen
Received: 20 Apr 2021 | Accepted: 08 Sep 2021 | Published: 24 Sep 2021
© 2021 Mariya Dimitrova, Viktor Senderov, Teodor Georgiev, Georgi Zhelezov, Lyubomir Penev
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Dimitrova M, Senderov VE, Georgiev T, Zhelezov G, Penev L (2021) Infrastructure and Population of the OpenBiodiv Biodiversity Knowledge Graph. Biodiversity Data Journal 9: e67671. https://doi.org/10.3897/BDJ.9.e67671
|
|
OpenBiodiv is a biodiversity knowledge graph containing a synthetic linked open dataset, OpenBiodiv-LOD, which combines knowledge extracted from academic literature with the taxonomic backbone used by the Global Biodiversity Information Facility. The linked open data is modelled according to the OpenBiodiv-O ontology integrating semantic resource types from recognised biodiversity and publishing ontologies with OpenBiodiv-O resource types, introduced to capture the semantics of resources not modelled before.
We introduce the new release of the OpenBiodiv-LOD attained through information extraction and modelling of additional biodiversity entities. It was achieved by further developments to OpenBiodiv-O, the data storage infrastructure and the workflow and accompanying R software packages used for transformation of academic literature into Resource Description Framework (RDF). We discuss how to utilise the LOD in biodiversity informatics and give examples by providing solutions to several competency questions. We investigate performance issues that arise due to the large amount of inferred statements in the graph and conclude that OWL-full inference is impractical for the project and that unnecessary inference should be avoided.
The OpenBiodiv system is a system uniting biodiversity knowledge extracted from academic publications and databases about biological diversity. It is based on a knowledge graph which aims to integrate knowledge sourced from articles from different journals and publishers and to allow querying of this knowledge through the establishment of semantic links within and between articles. Most recently, the general aspects of the system have been discussed and presented by
The existence of multiple biodiversity infrastructures which manage distinct datasets, (e.g. species occurrence data, taxonomic data, literature, sequence data etc.) has necessitated the establishment of a system to link these datasets (
We have chosen the knowledge graph technology as opposed to a relational database because it does not require a rigid schema from the beginning and allows us to add different entity types (e.g. RDF classes and properties) during different development stages of the project. We took full advantage of this by integrating additional resource types into OpenBiodiv, as described in more detail below.
The OpenBiodiv dataset comprises biodiversity information extracted from academic journals and public repositories of biodiversity data. OpenBiodiv-LOD is a synthetic RDF dataset, adhering to the Principles of Linked Open Data (
In the next sections, we discuss the sources of information that were combined to create the OpenBiodiv-LOD, the types of information that have been extracted, as well as the overall data model. We also discuss the Principles of Linked Open Data (LOD) that tie everything together. Finally, we discuss how the dataset was generated and demonstrate some of its applications using examples of SPARQL queries.
OpenBiodiv
Biodiversity informatics and semantic publishing.
The OpenBiodiv architecture
The OpenBiodiv knowledge graph integrates biodiversity and publishing axioms contained in various ontologies, which combined form the OpenBiodiv-O ontology (
The data in OpenBiodiv-LOD comes from three major sources: from the GBIF Backbone Taxonomy (
Data sources: the Global Biodiversity Information Facility (GBIF) backbone taxonomy
GBIF is the largest international repository of occurrence data (
Visualisations of nodes and the relationships between them, generated by GraphDB's Visual Graph.
Keeping in mind this particular aspect of GBIF, it is evident how the backbone taxonomy allows GBIF to integrate name-based information from diverse sources of biodiversity information and to provide a facility for taxonomic searching and browsing. Some of the better known sources of information for GBIF include the Encyclopaedia of Life (EOL), GenBank and the International Union for Conservation of Nature (IUCN). In order to grant the same capabilities to OpenBiodiv-LOD, we have imported Nub as instances of openbiodiv:TaxonomicConcept according to the OpenBiodiv-O ontology (
The RCC-5 representation further allows the future evolution of OpenBiodiv-LOD to incorporate other simultaneous views of taxonomic alignment. For example, as the GBIF backbone taxonomy is updated regularly through an automated process from over 56 sources, future updates may be ingested as new statements into OpenBiodiv-LOD without altering existing records: namely, as a new set of taxonomic concepts and RCC-5 relations linked to potentially already-existing taxonomic names.
Data sources: journal content from Pensoft and Plazi
Pensoft is one of the leading publishers of journals on biodiversity. Its publications are open access and available as HTML, XML and PDF. Plazi is an aggregator specialising in harvesting of and providing access to legacy biodiversity publications openly on the web as XML. Articles from the journals listed in Table
Journal name |
Number of Articles |
Number of treatments |
ZooKeys |
4715 |
31966 |
PhytoKeys |
968 |
4956 |
Biodiversity Data Journal |
695 |
1360 |
Journal of Hymenoptera Research |
419 |
1235 |
Comparative Cytogenetics |
338 |
41 |
MycoKeys |
365 |
1482 |
Zoosystematics and Evolution |
158 |
926 |
Subterranean Biology |
152 |
187 |
Zoologia |
149 |
78 |
Nota Lepidopterologica |
124 |
135 |
Neotropical Biology and Conservation | 100 | 42 |
Italian Botanist |
81 |
15 |
Deutsche Entomologische Zeitschrift | 80 | 609 |
Journal of Orthoptera Research |
78 |
272 |
Herpetozoa | 72 | 22 |
African Invertebrates |
55 |
189 |
Alpine Entomology |
54 |
173 |
Arctic Environmental Research | 50 | 0 |
Evolutionary Systematics |
41 |
171 |
International Journal of Myriapodology |
18 |
97 |
Snippet of XML markup of a taxonomic name according to the TaxPub schema and the corresponding RDF triples.
XML |
<tp:taxon-name> <tp:taxon-name-part taxon-name-part-type = "genus" reg = "Zelus"> P.</tp:taxon-name-part> <tp:taxon-name-part taxon-name-part-type = "species" reg = "casii" >casii</tp:taxon-name-part> </tp:taxon-namе> |
RDF |
http://openbiodiv.net/5BBC353E-CC39-4F2C-B4CE-DC2636CB2DC8 rdf:type openbiodiv:ScientificName; rdfs:label "Zelus casii"; dwc:genus "Zelus"; dwc:specificEpithet "casii"; dwc:verbatimTaxonRank "species"; openbiodiv:hasGbifTaxon openbiodiv:F1DD0CF0-217D-422B-BAA4-58901976D7B4-9146644-scName . |
The data types (article sections and other objects) which have been marked up in TaxPub and TaxonX, then converted to RDF and integrated in OpenBiodiv-LOD are listed in Table
Data types marked up in articles following TaxPub and TaxonX schemas and the corresponding RDF types of the generated RDF resources. The TaxPub and TaxonX columns contain boolean values (True or False) indicating whether the information about the data type is retrieved from XML files encoded in the corresponding schema or not. For example, Plazi's XMLs, which follow the TaxonX schema, do not contain an Introduction section, hence no resource of type deo:Introduction is created from them.
Data type | TaxPub | TaxonX | RDF Type |
Article metadata | True | True | fabio:JournalArticle and related |
Keyword group | True | False | openbiodiv:KeywordGroup |
Abstract | True | True | sro:Abstract |
Title | True | True | doco:Title |
Author | True | True | foaf:Person |
Introduction section | True | False | deo:Introduction |
Discussion section | True | True | orb:Discussion |
Treatment section | True | True | openbiodiv:Treatment |
Nomenclature section | True | True | openbiodiv:NomenclatureSection |
Materials examined | True | True | openbiodiv:MaterialsExamined |
Diagnosis section | True | True | openbiodiv:DiagnosisSection |
Distribution section | True | True | openbiodiv:DistributionSection |
Taxonomic key | True | True | openbiodiv:TaxonomicKey |
Figure | True | True | doco:Figure |
Taxonomic name usage | True | True | openbiodiv:TaxonomicNameUsage |
Bibliographic reference list | True | False | doco:BibliographicReferenceList |
Bibliographic reference | True | True | deo:BibliographicReference |
Institution | True | True | openbiodiv:Institution, openbiodiv:GRSciCollInstitution |
Identification | True | True | dwc:Identification |
Occurrence | True | True | dwc:Occurrence |
Event | True | True | dwc:Event |
Location | True | True | dwc:Location |
Workflows and processes
In this section, we explain how information from scholarly articles and the GBIF backbone is transformed into Linked Open Data which are stored and queried within the OpenBiodiv knowledge graph.
The inputs of the transformation pipeline are either XML (Pensoft and Plazi) or CSV (GBIF). Thus, the raw data-streams are semi-structured and the dataset generation problem can be thought of as an information retrieval and transformation problem. The input is encoded in three different data models: DarwinCore CSV (GBIF), TaxPub XML (Pensoft) and TaxonX XML (Plazi). The output of the transformation pipeline is knowledge represented in a fully-structured RDF according to the ontology OpenBiodiv-O.
1. Obtaining the data
GBIF’s taxonomic backbone is available at: https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c. There is an RSS feed from which Plazi treatments can be downloaded on a daily basis under http://tb.plazi.org/GgServer/xml.rss.xml. Each of Pensoft’s journals has a public API end-point under https://[journal_name].pensoft.net/lib/journal_archive.php?issue=xxx, where [journal_name] should to be replaced with the name of the Pensoft journal, for example, Zookeys to make https://zookeys.pensoft.net/lib/journal_archive.php?issue=1000. We use these sources of input to periodically obtain data and store it on our local servers.
2. Tools
In order to carry out the dataset generation, we made use of the following tools:
RDF4R and ROpenBio packages developed by us (https://github.com/pensoft/rdf4r, https://github.com/pensoft/ropenbio).
TSV4RDF, which is a PHP library for mapping CSV to RDF developed by us (https://github.com/pensoft/tsv4rdf) .
The OpenBiodiv base package (https://github.com/pensoft/OpenBiodiv).
In the rest of the section, we describe the transformation from XML as it is implemented in ROpenBio.
3. XML to RDF transformation
In order to transform an article represented as an XML document to RDF, we make use of the hierarchical nature of XML and solve the problem recursively with the Extractor procedure, shown in Fig.
Information extraction from the article XMLs
The atoms of an XML node consist of all text-fields that can be reached from the XML node with an XPATH expression (attribute values or text values) and can be directly converted to RDF as literals or identifiers. They all belong to one or to several related resources. For example, in Table
XML |
<contrib contrib-type ="author" corresp ="no"> <name name-style ="western"> <surname>Zhang</surname > <given names>Guanyang Zhang</given-names> </name> <uri content-type ="orcid">https://orcid.org/0000-0003-4389-4270</uri> <xref ref-type = "aff" rid="3">3</xref> </contrib> <aff id="A3"> <label>3</label> <addr-line>Florida Museum of Natural History, University of Florida, Gainesville, FL, USA</addr-line> </aff> |
RDF |
openbiodiv:51DE6A4F-4651-4540-A54D-21A307105405 rdf:type foaf:Person; rdfs:label "Guanyang Zhang"; foaf:surname "Zhang"; openbiodiv:affiliation "Florida Museum of Natural History, University of Florida, Gainesville, FL, USA"; datacite:hasIdentifier orcid:0000-0003-4389-4270. |
Divide-and-Conquer
After atom extraction, we proceed to transform the content of each atom to RDF. This is done with a recursive call to the Extractor for all nodes that are hierarchically dependent on the current node. For example, the article node contains all the other other nodes such as sections, figures etc.
Transformation specification
In order for the Extractor to work, therefore, we need to specify an XML schema. The specification includes what XML nodes we are looking for and their location. It then recursively specifies for each node, what sub-nodes we are looking for and their XPATH location relative to their parent node. Finally, for every node, we need to give the atom locations and write a constructor. The transformation specification is done with the R6 framework in R. We have specified two schemata that share some constructors: one for TaxPub*
In the most recent release of the OpenBiodiv-LOD and OpenBiodiv-O, we have introduced new resource types to represent biodiversity knowledge from the Materials Examined section and other elements of the article, such as links to the external genomic databases BOLD and GenBank, as well as the ORCID database for researchers. These changes to the ontology were also reflected in the TaxPub and TaxonX schema objects in the ROpenBio package, as well as the respective constructors which generate the triples from information extracted these schemas.
RDF generation
The process of RDF generation has three parts: (1) setting unique identifiers for each resource, (2) ascribing semantic classes to each resource and (3) linking resources via RDF properties.
Setting identifiers is an essential step to ensure that each resource can be uniquely identified across Linked Open Data. We use a MongoDB database (
The get_or_set_mongoid function retrieves the identifier associated with the matched hash, so it can be re-used in semantic relations within the current RDF serialisation. If there is no matching record within the MongoDB database, the function generates a new universally-unique identifier using the UUIDgenerate function from the R package uuid (
Organising resources into semantic classes, according to the OpenBiodiv-O ontology and creating links between them, is conceptually straightforward. For each atom, we know its type because the XML schema used to extract it contains a type field. Each type of resource has its own constructor function which generates RDF statements defining the resource types using rdf:type and links between resources. The author example is given in Table
It should be noted here that the semantics of certain node types, such as taxonomic name usage (reified as :TaxonomicNameUsage), reflect the relative position of the node in the XML document. For example, a taxonomic name usage may be inside a figure, inside an introduction section, inside a title etc. Therefore, besides the atoms, the constructor receives information about the relative position of the resource in the article by means of the unique identifier of the parent node(s). Then this information is encoded in RDF as given in Table
openbiodiv:570F0E79-5632-FF88-A155-73625E50C567 rdf:type fabio:JournalArticle ; prism:doi "10.3897/BDJ.4.e8150" ; dc:publisher "Pensoft Publishers" ; prism:publicationDate "2016-07-08"^^xsd:date ; dcterms:publisher openbiodiv:09EAAD23-3913-421E-9249-3FAAF1BA12DB . openbiodiv:0BD7ED36-1192-47A5-99F9-113998EF3099 rdf:type deo:Introduction ; po:isContainedBy openbiodiv:570F0E79-5632-FF88-A155-73625E50C567 . |
4. Submission to graph database and post-processing
The generated RDF statements are submitted to a repository in a GraphDB instance residing on http://graph.openbiodiv.net/. The repository, OpenBiodiv2020, has been initialised with OpenBiodiv-O*
Update rule for replacement name. We state that a scientific name A replaces a scientific name B, if there exists a taxonomic name usage of A with taxonomic status :ReplacementName and B is mentioned by a taxonomic name usage in the nomenclatural citations of the treatment, where the discussed taxonomic name usage of A is in the nomenclature section (Table
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> INSERT { GRAPH <http://openbiodiv.net/Updates> { ?name2 openbiodiv:replacementName ?name . } } WHERE { ?tnu1 dwciri:taxonomicStatus openbiodiv:ReplacementName ; pkm:mentions ?name. ?name rdfs:label ?vname ; dwc:verbatimTaxonRank ?rank. ?nomenclature po:contains ?tnu1; po:contains ?citations a openbiodiv:NomenclatureSection. ?citations rdf:type openbiodiv:NomenclatureCitationsList; po:contains ?citation. ?citation po:contains ?tnu2 . ?tnu2 rdf:type openbiodiv:TaxonomicNameUsage ; pkm:mentions ?name2. ?name2 rdfs:label ?vname2. ?name2 dwc:verbatimTaxonRank ?rank. } |
Update rule for related name. The related names update rule is similar to the one for a replacement name: two scientific names A and B are considered related if they are both mentioned in the nomenclature section of a treatment (Table
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> INSERT { GRAPH <http://openbiodiv.net/Updates> { ?name2 openbiodiv:relatedName ?name . } } WHERE { ?nom_sec rdf:type openbiodiv:NomenclatureSection ; po:contains ?tnu1 . ?tnu1 rdf:type openbiodiv:TaxonomicNameUsage ; pkm:mentions ?name. ?nom_sec po:contains ?tnu2 . ?tnu2 rdf:type openbiodiv:TaxonomicNameUsage ; pkm:mentions ?name2. FILTER(?name != ?name2) } |
For example, the names Muscidae and Aethiopomyia Malloch, 1921 are considered related (Table
PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT * WHERE { ?name_1 openbiodiv:relatedName ?name_2 . ?name_1 rdfs:label ?label_1. ?name_2 rdfs:label ?label_2. } LIMIT 100 |
We shall illustrate and evaluate the LOD by issuing sample SPARQL queries illuminating aspects of it.
1. Simple queries
Query for author. Authors are instances of foaf:Person (except in the rare institutional case, in which they would be foaf:Agent). The SPARQL query in Table
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX fabio: <http://purl.org/spar/fabio/> SELECT (SAMPLE(?name) AS ?name) (COUNT(DISTINCT ?paper) as ?npapers) WHERE { ?author rdf:type foaf:Person ; rdfs:label ?name . ?paper dcterms:creator ?author . ?paper a fabio:ResearchPaper. } GROUP BY ?author ORDER BY DESC (?npapers) |
Query for a scientific name. Biological Latin names are stored in the system as :ScientificName and are mentioned by taxonomic name usages. Table
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> SELECT (SAMPLE(?name) as ?name) (COUNT(DISTINCT ?tnu) AS ?nmentions) WHERE { ?s rdf:type openbiodiv:ScientificName ; rdfs:label ?name . ?tnu pkm:mentions ?s . } GROUP BY ?s ORDER BY DESC(?nmentions) |
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> SELECT ?label (COUNT(?tnu) AS ?nmentions) WHERE { ?s rdf:type openbiodiv:ScientificName ; rdfs:label ?label ; dwc:specificEpithet ?species ; dwc:genus ?genus . ?tnu pkm:mentions ?s . } GROUP BY ?s ?label |
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX fabio: <http://purl.org/spar/fabio/> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> SELECT (SAMPLE(?name) AS ?n) (COUNT(DISTINCT ?a) AS ?narticles) WHERE { ?s a openbiodiv:ScientificName ; rdfs:label ?name ; dwc:specificEpithet ?sp ; dwc:genus ?g . ?tnu pkm:mentions ?s . ?a po:contains ?tnu ; a fabio:JournalArticle . } GROUP BY ?s ORDER BY DESC(?narticles) |
Query the article structure. A unique feature of OpenBiodiv-LOD is that articles are broken down to their components (see Table 3) and taxonomic name usages are connected to the specific part of the article and not just to the article in general. Combining this feature with queries from the previous paragraph, we can, for example, look for the most mentioned scientific name in a figure (Table
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX doco: <http://purl.org/spar/doco/> SELECT (MAX(?name) AS ?name) (COUNT(DISTINCT ?a) AS ?nmentions) WHERE { ?s rdf:type openbiodiv:ScientificName ; rdfs:label ?name . ?tnu pkm:mentions ?s . ?a po:contains ?tnu . ?a rdf:type doco:Figure . } GROUP BY ?s ORDER BY DESC(?nmentions) |
PREFIX fabio: <http://purl.org/spar/fabio/> PREFIX prism: <http://prismstandard.org/namespaces/basic/2.0/> PREFIX doco: <http://purl.org/spar/doco/> PREFIX c4o: <http://purl.org/spar/c4o/> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> SELECT ?f WHERE { ?a a fabio:JournalArticle ; prism:doi "10.3897/mycokeys.1.1966" . ?f a doco:Figure . ?a po:contains ?f . } |
Query for taxonomic concepts. We can create a query uniting information from the GBIF Backbone Taxonomy with semantics coming from the article structure. The query in Table
PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dwciri: <http://rs.tdwg.org/dwc/iri/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX prism: <http://prismstandard.org/namespaces/basic/2.0/> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> SELECT * WHERE { ?n rdfs:label "Curculionidae" . ?c openbiodiv:scientificName ?n . ?s skos:broader ?c . ?s openbiodiv:scientificName ?sn . ?sn dwc:genus ?vgenus . ?tnu pkm:mentions ?name; dwciri:taxonomicStatus openbiodiv:TaxonomicDiscovery . ?name dwc:genus ?vgenus; rdfs:label ?verbatim . ?article po:contains+ ?tnu; prism:publicationDate ?date . } |
Fuzzy Queries via Lucene. The SPARQL endpoint of OpenBiodiv-LOD supports fuzzy matching via a Lucene connector (
Sample Lucene query via SPARQL. We have intentionally misspelled the person’s name.
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#> PREFIX lucene: <http://www.ontotext.com/connectors/lucene#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT * WHERE { ?search a inst:NewSearch-excluded ; lucene:query "label:Lubomir Penev" ; lucene:entities ?resource . ?resource lucene:score ?score ; rdfs:label ?label . } ORDER BY DESC (?score) |
Competency question answering via SPARQL
Validity of a name. Of central importance to biological nomenclature is the question of whether a given taxonomic name is valid or not. We shall consider a taxonomic name invalid if and only if at least one of the following invalidation criteria holds:
The name has been replaced: i.e. there is a :replacementName property originating in the name and there are no loops (it is impossible to follow the :replacementName edges and come back to the name). This query is illustrated in Table
The name has been invalidated: i.e. there is a taxonomic usage of the name with the status :UnavailableName and there is no newer taxonomic name usage revalidating it (:AvailableName). Illustrated in Table
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX openbiodiv: <http://openbiodiv.net/> ASK { ?name rdf:type openbiodiv:ScientificName ; rdfs:label "Pentatomidae" . ?name openbiodiv:replacementName ?replacementName . FILTER NOT EXISTS {?replacementName openbiodiv:replacementName ?anotherName .} } |
PREFIX pkm: <http://proton.semanticweb.org/protonkm#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dwciri: <http://rs.tdwg.org/dwc/iri/> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX prism: <http://prismstandard.org/namespaces/basic/2.0/> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> ASK { ?tnu pkm:mentions ?name . ?name rdfs:label "Messerschmidia incana G. Mey. 1818" . ?tnu dwciri:taxonomicStatus openbiodiv:UnavailableName . ?article po:contains+ ?tnu . ?article prism:publicationDate ?date . FILTER NOT EXISTS { ?tnu2 pkm:mentions ?name . ?tnu2 dwciri:taxonomicStatus openbiodiv:AvailableName . ?article2 po:contains+ ?tnu2; prism:publicationDate ?date2 . FILTER (?date2 > ?date) } } |
The case of Museu Nacional de Rio de Janeiro (MNRJ). In order to illustrate the capabilities of OpenBiodiv and draw attention to the scientific impact of the tragically lost collection in the fire of the Museu Nacional de Rio de Janeiro (MNRJ), we can ask our system to give us the number of times a specimen from that collection was used in a taxonomic article and in which ones (Table
Impact of the fire in Museu Nacional de Rio de Janeiro (MNRJ) on biodiversity knowledge.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX fabio: <http://purl.org/spar/fabio/> PREFIX prism: <http://prismstandard.org/namespaces/basic/2.0/> SELECT ?institution_name (COUNT(?institution_code) AS ?times_mentioned) (COUNT(DISTINCT ?title) AS ?articles) (GROUP_CONCAT(DISTINCT ?title; SEPARATOR=", ") AS ?doi_of_articles) (GROUP_CONCAT(DISTINCT ?name; SEPARATOR=", ") AS ?names_mentioned) (COUNT (DISTINCT ?name) AS ?number_of_taxa) (COUNT(DISTINCT ?tnu) AS ?number_of_tnus) WHERE { BIND("Museu Nacional de Rio de Janeiro (MNRJ)" as ?institution_name) BIND ("MNRJ" as ?institution_code) ?treatment openbiodiv:institutionName|dwc:institutionCode|dwc:collectionCode ?institution_code . OPTIONAL { ?treatment openbiodiv:institutionName ?institution_name } OPTIONAL {?treatment dwc:institutionID <http://grbio.org/cool/zi1i-a0b5>} ?treatment (po:contains)|(po:contains/po:contains) ?tnu; a openbiodiv:Treatment. ?tnu pkm:mentions ?s. ?s a openbiodiv:ScientificName; rdfs:label ?name. ?article po:contains ?treatment ; rdf:type fabio:JournalArticle ; prism:doi ?title . } GROUP BY ?institution_name |
It turns out that MNRJ has been mentioned 362,062 times in our system in a total of 509 articles. Perhaps more interestingly, we can see specimens of which taxa may have been lost, have declining populations or are threatened by extinction. Examples include the insects (Xestoblatta, Charinus, Lamproclasiopa etc.) which are extinct, Keays's Rice Rats (Nephelomys keaysi) which have declining populations and many others for a total of 1,348 distinct names mentioned in taxonomic articles which reference MNRJ.
Specimen collection. Some of the most important information about biodiversity in an article is within the Materials Examined section. It contains information about the collection of biodiversity samples (specimens), the location where they were found, the taxonomists who identified them, their habitats, the institutions where the specimens are kept and much more. For example, we can query all people who have collected specimens belonging to the insect genus Zelus (Table
People who have collected specimens belonging to the insect genus Zelus.
PREFIX : <http://openbiodiv.net/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX frbr: <http://purl.org/vocab/frbr/core#> PREFIX prism: <http://prismstandard.org/namespaces/basic/2.0/> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX fabio: <http://purl.org/spar/fabio/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX c4o: <http://purl.org/spar/c4o/> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> SELECT ?label ?recorder ?eventDate WHERE { ?article dc:title ?articleTitle ; po:contains ?treatment. ?treatment rdf:type openbiodiv:Treatment; po:contains ?materials; po:contains ?nomenclature. ?materials rdf:type openbiodiv:MaterialsExamined; dwc:occurrenceID ?occurrence; dwc:eventID ?event. ?occurrence dwc:recordedBy ?recorder. ?event dwc:eventDate ?eventDate. ?nomenclature rdf:type openbiodiv:NomenclatureSection; po:contains ?tnu. ?tnu pkm:mentions ?name. ?name rdfs:label ?label; dwc:genus "Zelus". } |
Institutional impact. We can use a SPARQL query to understand how collections from different instutions are used to describe taxa. In the example query, we have linked institutional identifiers with treatments which mention them to find out institutional impact per family (Table
PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX : <http://www.essepuntato.it/2008/12/pattern#> SELECT ?family (COUNT(?treatment) as ?treatment) ?inst ?instName WHERE { ?tnu pkm:mentions ?scName. ?scName dwc:family ?family. ?treatment po:contains ?tnu; a openbiodiv:Treatment; dwc:institutionID ?inst. ?inst a openbiodiv:Institution; openbiodiv:institutionName ?instName. } GROUP BY ?inst ?instName ?family |
Links between holotype descriptions (literature), institutions holding the holotypes and genomics. OpenBiodiv integrates information about examined specimens and taxonomic descriptions from literature with external identifiers of institutions holding the specimens, as well as records identifiers pertaining to the genomic sequences of the specimens. We can retrieve this information with the SPARQL query in Table
PREFIX datacite: <http://purl.org/spar/datacite/> PREFIX openbiodiv: <http://openbiodiv.net/> PREFIX deo: <http://purl.org/spar/deo/> PREFIX doco: <http://purl.org/spar/doco/> PREFIX po: <http://www.essepuntato.it/2008/12/pattern#> PREFIX pkm: <http://proton.semanticweb.org/protonkm#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> PREFIX fabio: <http://purl.org/spar/fabio/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX prism: <http://prismstandard.org/namespaces/basic/2.0/> SELECT ?materialsExamined ?genomicLabel ?system?name ?label ?institution ?doi WHERE { ?genomicIdentifier datacite:usesIdentifierScheme ?system; rdfs:label ?genomicLabel. FILTER (?system IN (datacite:genbank, datacite:boldsystems)) . ?materialsExamined openbiodiv:mentionsIdentifier ?genomicIdentifier; a openbiodiv:MaterialsExamined; po:contains ?holotypeDescr. ?holotypeDescr a openbiodiv:HolotypeDescription. ?treatment po:contains ?materialsExamined; a openbiodiv:Treatment; po:contains ?nomenclature; dwc:institutionID ?institution. ?nomenclature a openbiodiv:NomenclatureSection; po:contains ?tnu. ?tnu pkm:mentions ?name. ?name rdfs:label ?label. ?article a fabio:JournalArticle; po:contains ?treatment; prism:doi ?doi. } |
Fulfilment of the Principles of Linked Open Data
Linked Open Data (
Use URIs as names for things.
Use HTTP URIs so people can look up these things.
When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
Include links to other URIs so they can discover more things.
We have followed these guidelines when creating the OpenBiodiv-LOD. We will now discuss each of these points separately.
Usage of URIs as resource identifiers. Every instance in OpenBiodiv-LOD is uniquely identifiable by a HTTP URI of the following form: http://openbiodiv.net/uuid-(suffix). All instance identifiers in OpenBiodiv LOD follow this schema. The optional suffix field is assigned only to resources extracted from GBIF.
Identifiers for resources from Pensoft and Plazi. During the RDF-isation of the sources Pensoft and Plazi, when a new concept is discovered (e.g. a person, a scientific name etc.), a UUID is generated. Then the resource is always referred to in the database by this UUID in the OpenBiodiv namespace, http://openbiodiv.net/. Pensoft and Plazi furthermore share the UUID part of the identifier in the semi-structured representation of treatments. For example, Lyubomir Penev is a resource identified by http://openbiodiv.net/416FDF84-1029-4115-B43F-E9E734004489.
Identifiers for GBIF taxonomic concepts. GBIF offers its taxonomic backbone as a DarwinCore (
Usage of HTTP URIs and dereferencing. As per the Linked Data principles, we use dereferenceable HTTP URIs for our resources. For example, if a web browser opens http://openbiodiv.net/416FDF84-1029-4115-B43F-E9E734004489, a webpage is displayed (Fig.
Visualisation of a semantic resource via a template on the OpenBiodiv website. The figure shows information related to a Person resource which is displayed when the resource is resolved.
Linking to other resources. All resources in OpenBiodiv form a graph (there are no disconnected parts), following a data model discussed in the next subsection. Second, resources are linked to external databases via properties like datacite:hasIdentifier and openbiodiv:mentionsIdentifier. These identifiers can be: GBIF IDs, ZooBank IDs, Zenodo IDs, ORCIDs, BOLD BINs, BOLD Records or GenBank accession numbers. We have created links between people and their ORCID records, publications and their GBIF dataset records, as well as Zoobank records and genomic records within BOLD and GenBank. See Table
Data Model
When creating the RDF graph, we have conformed to the OpenBiodiv-O ontology described in
OpenBiodiv-O is an ontology that links the publishing domain with the biodiversity domain. Major resource types covered by each of the ontology families are given in the box below the Venn diagram. Each of them is present in the OpenBiodiv-O ontology as a class. Important resources from the publishing domain are listed in the left-most column and from biodiversity informatics in the right-most column. The middle one covers important OpenBiodiv-O resources.
SPAR provides facilities to deal with the dichotomy between the abstract representation of knowledge through the class Work and its concrete representation through the class Expression. For example, a fabio:JournalArticle can be the realisation of a fabio:ResearchPaper. On the other hand, the DwC community standard gives a standard way to express properties from taxonomy and biodiversity science.
In the most recent verision of OpenBiodiv-LOD (
Performance
The current iteration of the database holds over 360 Mln triples. The expansion ratio under the RDFS-Plus (Optimised) ruleset is 1.20, i.e. for each asserted statement, we materialise, on average, 1.20 implicit statements. In a previous release of the dataset, which was under the OWL2-RL rule-set, the most complex rule-set supported by GraphDB, the expansion ratio was about 3.7; however, we encountered significant performance issues using it. Even with the lighter rule-set (RDFS-Plus Optimised) (
We observed a steady increase of implicit (inferred) statements during the upload of new triples. An example of such an inferred statement is :A po:contains :B, generated from the statement :B po:isContainedBy :A because po:isContainedBy is an inverse property of po:contains. Upon closer inspection, it turned out that the import of external ontologies, in addition to OpenBiodiv-O, leads to the generation of superfluous inferred statements. For instance, in the SKOS ontology, the property skos:exactMatch is transitive and is also a sub-property of skos:closeMatch. The same ontology defines that skos:closeMatch is a sub-property of skos:mappingRelation. Therefore, after importing the SKOS ontology, GraphDB infers that all treatments which have the property skos:exactMatch (these are only Plazi treatments for which we have information about their Plazi treatment id, for example, openbiodiv:03894A65-5824-FFE9-571B-B65D2F47F95E skos:exactMatch plazi_treatment:03894A65-5824-FFE9-571B-B65D2F47F95E), also have an additional statement with property skos:mappingRelation. This inferred statement does not actually bring any new semantic information to the knowledge graph, hence we consider it superfluous.
We came to the conclusion that all necessary RDF logic is stored in OpenBiodiv-O and does not require the import of other ontologies, since OpenBiodiv-O already includes the essential relations from these ontologies. Therefore, in the latest release of the repository, we have only imported the OpenBiodiv-O ontology.
Another important aspect of performance is the RDF-isation time or the time it takes to convert a single XML into RDF in trig serialisation and to upload it to the database. Our observations show that the most time-consuming part of this process are the MongoDB requests used to get and set resource identifiers. Even though they provided an improvement to the previous model, which used queries to GraphDB to obtain and set identifiers, MongoDB requests can add up to a significant amount of time per XML document. We noted that adding a MongoDB index and using it to search for text content does not improve the speed at all. As an alternative solution, we now use sha256 hashing to compact value strings associated with identifiers to a fixed-length hash string. This method is explained in detail in the Methods section.
The generated dataset OpenBiodiv-LOD, similar to the expanded ontology OpenBiodiv-O, is already a solid resource for biologists, as it includes information from most articles published by Pensoft and Plazi and counts over 360 million RDF triples.
An important conclusion that can be drawn from the work is that it is possible to use a semantic graph for the integration of a large volume of data on biodiversity. We were unexpectedly given the opportunity to illustrate the power of the knowledge graph by analysing the damage from the tragic fire at the Museu Nacional in Rio de Janeiro. In addition, we have illustrated that it is possible to write relatively simple logical conclusions to check the validity of a taxonomic name.
Due to the large amount of data, we found that, although the use of a semantic graph was possible, some of the initially-chosen technologies proved to be inapplicable or difficult to apply. We have observed that the practical application of the full logical OWL model is difficult due to performance problems. Instead in the end, we utilised RDFs which are less powerful, but faster. In addition, we found that triple stores are not a universal solution to all data integration problems, but can be used in combination with other database technologies (e.g. MongoDB) to efficiently store and query semantic resources.
A great difficulty was the disambiguation of resources, such as author names or taxonomic names. In the functional design of the RDF4R package, we have input modules that allow us to insert a list of functions/rules for disambiguation when searching for an identifier for a given resource. However, we had only limited success with the rule-based disambiguation and, for this reason in the production system, it was discontinued for the moment.
Considering these and other "lessons", the future development of the OpenBiodiv-LOD project can be outlined in the following way:
We envision OpenBiodiv-LOD as an integral part of the existing semantic network of biodiversity knowledge, based on HTTP identifiers and controlled vocabularies. By semantically enhancing and linking knowlege in OpenBiodiv to existing machine-readable data, we augmment biodiversity data quality and increase the potential for its reuse.
This research received funding from the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Sklodowska-Curie grant agreements BIG4 (No 642241) and IGNITE (No 764840).
M.D. authored the final draft of the manuscript and leads the development of the OpenBiodiv system.
V.S. lead the original effort on the OpenBiodiv system and prepared the first draft of the manuscript. T.G. and G.Z. were involved in the development of the OpenBiodiv system. L.P. supervised the development of the OpenBiodiv system and edited the manuscript.