Biodiversity Data Journal :
Taxonomy & Inventories
|
Corresponding author: Giulio Montanaro (giuliomontanaro98@gmail.com), Sergei Tarasov (sergei.tarasov@helsinki.fi)
Academic editor: Panakkool Thamban Aneesh
Received: 23 Feb 2024 | Accepted: 22 May 2024 | Published: 13 Jun 2024
© 2024 Giulio Montanaro, James Balhoff, Jennifer Girón, Max Söderholm, Sergei Tarasov
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Montanaro G, Balhoff JP, Girón JC, Söderholm M, Tarasov S (2024) Computable species descriptions and nanopublications: applying ontology-based technologies to dung beetles (Coleoptera, Scarabaeinae). Biodiversity Data Journal 12: e121562. https://doi.org/10.3897/BDJ.12.e121562
|
Taxonomy has long struggled with analysing vast amounts of phenotypic data due to computational and accessibility challenges. Ontology-based technologies provide a framework for modelling semantic phenotypes that are understandable by computers and compliant with FAIR principles. In this paper, we explore the use of Phenoscript, an emerging language designed for creating semantic phenotypes, to produce computable species descriptions. Our case study centers on the application of this approach to dung beetles (Coleoptera, Scarabaeinae).
We illustrate the effectiveness of Phenoscript for creating semantic phenotypes. We also demonstrate the ability of the Phenospy python package to automatically translate Phenoscript descriptions into natural language (NL), which eliminates the need for writing traditional NL descriptions. We introduce a computational pipeline that streamlines the generation of semantic descriptions and their conversion to NL. To demonstrate the power of the semantic approach, we apply simple semantic queries to the generated phenotypic descriptions. This paper addresses the current challenges in crafting semantic species descriptions and outlines the path towards future improvements. Furthermore, we discuss the promising integration of semantic phenotypes and nanopublications, as emerging methods for sharing scientific information. Overall, our study highlights the pivotal role of ontology-based technologies in modernising taxonomy and aligning it with the evolving landscape of big data analysis and FAIR principles.
Phenoscript, taxonomy, semantic data, phenotypic traits, characters, morphology, Grebennikovius, microCT
Taxonomists have produced vast amounts of phenotypic data through species descriptions published in numerous papers and monographs. Yet, scientists outside taxonomy largely under-utilise this resource because it is challenging to comprehend these data and analyse them computationally (
Ontology-based technologies have emerged as a promising solution to this challenge (
However, ontology-driven modelling of species descriptions remains challenging due to their more flexible nature compared with character matrices for phylogenetic analyses. Furthermore, previous semantic approaches to phenotypes were mostly class-based in which phenotypic statements were expressed as ontology classes (
In this paper, we aim to explore the utility of an individual-based approach by semantically describing four species of dung beetles from the genus Grebennikovius as a case study (
To accelerate the creation of semantic species descriptions, we apply Phenoscript, a newly-designed computer language. Phenoscript enables constructing knowledge graphs from textual code in the text editor VS Code, using the respective Phenoscript plugin. The Phenoscript code is then converted into the Web Ontology Language (OWL), a standard format for working with ontologies, allowing for computational comparisons and analyses of semantic data. This conversion is mediated by Phenospy, a python package that also translates OWL phenotypes into annotated NL descriptions for publication and traditional scientific communication.
Phenoscript and Phenospy, still in development, are assessed in this study for their practicality and effectiveness in managing phenotypic data. This is the second paper in the series that tests Phenoscript (
The data files and scripts necessary to reproduce the results of this study are available as Supplementary material that can be accessed either through Zenodo or via the Github repository.
For this proof-of-concept study, we selected the dung beetle genus Grebennikovius (Coleoptera, Scarabaeinae), recently revised by
To observe and illustrate morphological characters in great detail, we obtained micro-CT images of a specimen of Grebennikovius basilewskyi (Balthasar, 1960). Imaging was conducted at the Finnish Museum of Natural History LUOMUS (University of Helsinki) using a Nikon XT H 225 and the following settings. Multi-metal target with molybdenum setting, 70–100 kV beam energy, 70–100 uA beam current, 1420 ms exposure time and 4476 projection images with four frames of averaging per projection. Detector binning was set to 1x1, gain to 24 dB and white target to 60k. The complete scan time was approximately seven hours and the resulting voxel size of the dataset 2.998 µm. The volumetric dataset was reconstructed from the projection images using Nikon CT pro-3D Version XT 6.9.1 and the dataset was exported to VGSTUDIO MAX 2023.2 (Volume Graphics GmbH, Heidelberg, Germany) in 16-bit format. Excess material was excluded from the dataset. The dataset was visualised using volume renderer (Phong) and aligned correctly. Images from the sample were rendered from anterodorsal, dorsal, lateral, posterior and ventral views. The aedeagus of G. basilewskyi in Fig.
Visual guide to positional terminology; see the section "Anatomically consistent positional conventions" for details.
Grebennikovius basilewskyi in dorsal and ventral views. Numbers next to arrows indicate patterns of phenotype statements explained in the section "Phenoscript: main patterns of phenotype statements". Arrow numbers from T1 to T5 illustrate individual body parts.
To describe species semantically, we employed the Phenoscript language powered by its dedicated plugin for VS Code (Fig.
For this study, we used the ontologies listed in Table
Ontologies used in the species descriptions. For details, see the OBO Foundry repository https://obofoundry.org.
Ontology |
URI |
Description |
Ontology for the Anatomy of the Insect SkeletoMuscular system (AISM) |
http://purl.obolibrary.org/obo/aism.owl |
General anatomy of insects, includes terms such as “pronotum”, “wing”. |
Coleoptera Anatomy Ontology (COLAO) |
Anatomy of Coleoptera, for example, “elytron”, “mesoventrite”. |
|
Phenoscript Ontology (PHS) |
Phenoscript metadata, for example, "has trait", "OTU Block". |
|
Phenotype And Trait Ontology (PATO) |
Phenotypic qualities, for example, “red”, “convex”, “length”, "setose" |
|
Biological Spatial Ontology (BSPO) |
Spatial regions of anatomical parts, for example, “distal region”, “ventral side”. |
|
Comparative Data Analysis Ontology (CDAO) |
Taxon metadata, for example, “TU” (taxonomic unit). |
|
Information Artifact Ontology (IAO) |
Information entities, for example, “denotes”. |
|
Relation Ontology (RO) |
Mostly relationships between antomical parts and qualities, for example, “part of”, “has characteristic”. |
|
Units of measurement ontology (UO) |
Units of measurement, for example, "millimeter". |
|
Biological Collections Ontology (BCO) |
Darwin Core terms, for example, "catalogNumber", "TaxonID". |
|
Uberon multi-species anatomy ontology (UBERON) |
General anatomy terms, for example, "female organism", "adult organism". |
|
Taxonomic rank vocabulary (TAXRANK) |
Taxonomic rank terms, for example, "species". |
Writing in Phenoscript closely resembles composing natural language (NL) descriptions, albeit with its own distinct syntax, which is still quite akin to NL. The language documentation and tutorials are available on the Phenoscript repository. The initial step typically involves setting up a YAML configuration file to specify author names, project title and the ontologies to be used. As a next step, Phenospy can generate snippets for the necessary ontology terms. Snippets, which are ontology terms or small blocks of Phenoscript code, can be selected from a drop-down menu in the Phenoscript description, appearing upon typing the first letters. Once the snippets are ready, the user can begin coding semantic descriptions in VS Code using the Phenoscript plugin. For convenience, we present below an overview of the major character patterns used in describing species of Grebennikovius, both in NL and in Phenoscript (see the section "Phenoscript: main patterns of phenotype statements").
Once the Phenoscript description is complete, it can be processed and analysed as outlined in the pipeline shown in Fig.
The pipeline (Fig.
Step 1. Once Phenoscript description is written as a Phenoscript file, it must be converted into OWL format using the Phenospy package, which provides the necessary command-line tools for this conversion. This creates the ABox component of the ontology for further processing.
Step 2. This stage involves validating the OWL file with SHACL (Shapes Constraint Language) to ensure that semantic data satisfy the requirements of the data models employed by the user. SHACL is a conventional tool for validating RDF graph patterns against a set of predefined criteria. As an example, in our context, these criteria require that all phenotypes are linked to species names and include the necessary metadata. We used the SHACL command-line interface provided by the Apache Jena framework. Proceed to the next step if validation succeeds. If it does not, return to the Phenoscript description and correct it.
Step 3. Make a TBox file by downloading and merging all the source ontologies used to create semantic descriptions. This step is automated using Phenospy and ROBOT (
Step 4. Perform ontology reasoning using the ABox (step 1) and TBox (step 3) files. This step is mediated by the materializer tool which uses the whelk reasoner. Ontology reasoning refers to the process of deriving logical conclusions from a set of asserted facts or axioms within an ontology and knowledge graph. Reasoning is used to logically validate the ontology and infer the class membership of the individuals in the ABox.
Logical validation ensures that the ontology contains no contradictions in its structure, definitions or relationships between its entities. If this is the case, the ontology is referred to as "consistent". If the ontology is found to be inconsistent at this stage, it is most likely because there are logical errors within the semantic descriptions that need to be corrected. Additionally, Class inference generates new data from the initial assertions, which can be used for downstream semantic queries. If the validations at steps 2 and 4 are successful, the user can proceed to the next stages.
Step 5. Using Phenospy, automatically generate the annotated NL description from the OWL file. See the section below for more information.
Step 6. Perform semantic queries to extract trait data from the descriptions. See the section below for more information.
NL descriptions were created using Phenospy's algorithm which traverses the knowledge graph encoded in an OWL file and translates the graph patterns into NL. The algorithm searches for character patterns, such as, for example, the presence or absence of anatomical entities, their measurements and then translates them into human-readable NL text.
Generated NL descriptions consist of hierarchical trait statements that usually resemble entity-quality syntax (
In order to demonstrate the ease with which phenotypic information can be retrieved from our descriptions, we employed two sets of semantic queries. The first set aimed to determine the number of individuals per species associated with an ontology class representing one of the following phenotypic characteristics: colour, shape, size and texture (Table
Entities \ Species |
G. armiger |
G. basilewskyi |
G. lupanganus |
G. pafelo |
1. colour (PATO:0000014) |
3 |
5 |
3 |
3 |
2. shape (PATO:0000052) |
32 |
25 |
28 |
34 |
3. size (PATO:0000117) |
24 |
24 |
22 |
24 |
4. texture (PATO:0000150) |
2 |
1 |
2 |
2 |
5. insect head (AISM:0000107) or its parts |
23 |
23 |
20 |
20 |
6. insect thorax (AISM:0000108) or its parts |
65 |
74 |
58 |
70 |
7. insect abdomen (AISM:0000109) or its parts |
11 |
15 |
12 |
11 |
8. insect leg (AISM:0000031) or its parts |
29 |
32 |
27 |
31 |
The second set of queries focused on determining the number of individuals associated with the major body parts: head, thorax, abdomen and leg (Table
We expanded the Insect Anatomy Ontology (AISM;
New AISM terms (110 terms, v.2023-04-14 to v.2024-05-11) describe the more general insect body plan and most of them are applicable to a range of insect orders. These terms cover appendage subdivisions (specific tarsomeres, flagellomeres, palpomeres etc.), specific cuticular protrusions (protibial and clypeal teeth, carinae, etc.) and different types of cuticular punctures.
Punctures are particularly important characters in dung beetle taxonomy, especially at the interspecific level (
Conversely, the new terms introduced in COLAO (42 terms, v.2023-03-30 to v.2024-02-14) are more specific to beetles, particularly to Scarabaeoidea and Scarabaeinae. These terms include descriptions of elytral striation, pronotal protrusions and depressions and genital structures, such as parameres and endophallites (
Notably, we use the term “mesometaventral sulcus” instead of the more traditional “meso-metasternal suture”, since the use of meso- and metaventrite should be preferred over meso- and metasternum (
Traditional positional terminology used in Scarabaeinae taxonomy sometimes does not reflect the true positional relationships between the insect body and its parts. For example, in most scarab beetles (Scarabaeoidea), legs are flattened anteroposteriorly and rotated forwards (fore legs) or backwards (middle and hind legs), making more intuitive (but incorrect) to use dorsal and ventral instead of anterior and posterior for referring to their broad, flattened sides and the reverse for the narrow dorsal and ventral sides (see Fig.
Another example can be found in the parameres of the male genitalia (Fig.
The revised interpretation of position has been implemented in the natural language (NL) descriptions by
In the examples below, we provide traditional NL statements followed by their equivalent Phenoscript statements (in italics), for the main types of semantic traits used in our descriptions. The "note" section briefly explains the rationale for the use of certain semantic constructs.
In a nutshell, a Phenoscript statement consists of a sequence of nodes and edges (i.e. relationships) in a knowledge graph. Edges are defined by a preceding dot "." symbol. Each statement should begin and end with a node. The nodes are followed by edges, which specify the relationships between the nodes. Node–edge sequences may be as long as necessary. The semantic statement is closed with a semicolon ";". Every ontology term is prefixed with the abbreviation of its originating ontology, functioning as a namespace. For instance, the term "aism-cuticular_carina" signifies that "cuticular carina" comes from the AISM ontology. This prefixing helps to clearly identify the source ontology of each term.
Most edges refer to object properties, such as "has_part", "has_characteristic" and their inverses. To improve readability, Phenoscript uses aliases instead of long labels. The following aliases are used:
We encourage the reader to consult the Phenoscript language guide for syntax details. For more information about semantic characters, we suggest the Phenoscape Guide to Character Annotation and other relevant materials (
Presence Phenotypes
Grebennikovius basilewskyi in different views. Numbers next to arrows indicate patterns of phenotype statements explained in the section "Phenoscript: main patterns of phenotype statements". Arrows T6–T12 illustrate individual body parts.
Absence Phenotypes
Count Phenotypes
Qualitative Phenotypes
Absolute and Relative Measurement Phenotypes
Relative Comparison Phenotypes (within the same species)
Relative Comparison Phenotypes (between species)
Other Phenotypes
The asterisk (*) next to "hind wing: length;" denotes an incomplete conversion from OWL to generated NL, suggesting that the Phenospy algorithm requires refinement. The correct statement should read "the length of the hind wing is smaller than that of the male of G. basilewskyi".
The asterisk (*) next to "hind wing of male organism" denotes an improper conversion to NL. The statement should not appear in the generated NL description, but does so due to similar errors elsewhere.
The asterisk (*) next to "hind wing: length;" denotes an incomplete conversion to NL, see the description of G. armiger for details.
The asterisk (*) next to "hind wing: length;" denotes an incomplete conversion to NL; see the description of G. armiger for details.
In this work, we assessed the utility of the Phenoscript language for creating taxonomic descriptions of four Grebennikovius species, based on an individual-based approach. We initially wrote the descriptions in Phenoscript code and subsequently converted them to ontology format (OWL) and annotated NL text.
The ontology format represents a semantic description as a knowledge graph (i.e. the ABox composed of ontology individuals), where nodes indicate anatomical structures, their metadata (or characteristics) and edges indicate their relationships. The nodes and edges come from pre-selected ontologies. To create the semantic descriptions for this study, we used twelve different ontologies (Table
While the ontology format is not easily understandable for humans, it is essential for making the descriptions semantically queriable. Using simple queries, we demonstrated the practical application of the semantic approach. With them, we were able to obtain the number of individuals for the selected ontology classes (Table
In contrast to the ontology format, the generated description in NL format was included in this publication to facilitate human-friendly reading. The NL format annotates all phenotypic terms with hyperlinks, allowing the reader to access the term's definition, properties and relationships directly through a web browser.
Generally, the algorithm for converting OWL to NL in Phenospy worked well since most of the statements are easily readable. However, the algorithm could not properly convert four semantic statements (one for each species) dealing with relative comparisons between species into NL. In the taxon treatments above, these statements are indicated by an asterisk "*" and discussed in the "Note" section therein.
Processing semantic descriptions involves several steps and requires the use of a variety of software programmes. To streamline this process, we created an openly available computational pipeline using the makefile tool (Fig.
We also created four nanopublications (see the section "Nanopublications") using the nanodash tool, accessible via the Biodiversity Data Journal (BDJ) portal. With this service, nanopublications can easily be created and integrated with BDJ. Our nanopublications specify that each Grebennikovius species inhabits a forest environment.
The exploration of ontology-based technologies in our study highlights their significant potential for modelling computable phenotypes and species descriptions, effectively integrating taxonomy into the domain of phenomics (
Our paper assessed the utility of Phenoscript, an emerging language for semantic descriptions and its associated tools for producing semantic data using an individual-based approach. Our results demonstrated the effectiveness of Phenoscript in creating semantic descriptions thanks to its syntax that is similar to NL expressions. In addition, the syntax was also improved over the previous version of the language (
The proposed computational pipeline automates the production of semantic descriptions and can be applied to any other taxon using a desktop computer. Due to our focus on dung beetles, the developed approach can be specifically applied to them with ease, enabling further semantic-based research in this group. Thus, the proposed semantic approach opens up possibilities for new types of publications, where taxonomists can semantically re-describe known species in order to unlock their traits for other cross-disciplinary research within biology.
Semantic descriptions, however, are currently slower to write than traditional NL ones for a number of reasons. A shortage of comprehensive educational resources and a relatively small community create a high initial barrier to learning semantic methods in evolutionary biology (
At present, composing semantic descriptions involves the addition of many new terms to ontologies, due to ontology incompleteness. In this study, we had to add 152 terms to AISM and COLAO to cover all the necessary morphological terminology for complete species descriptions. Eventually, we expect this task will diminish as usage of ontologies increases and they become saturated with terms.
Significant time is spent thinking about how to code particular traits semantically. Despite the establishment of the necessary protocols (
In order to demonstrate the queriability of the descriptions, we conducted semantic queries as a proof of concept. Currently, such queries can be used to retrieve phenotypic information semi-manually for analysis and comparison across species. The process involves creating a query targeting specific traits and applying it to a set of semantic phenotypes. However, automatic comparison, such as identifying common or different traits between species, is not feasible with current methods and remains a topic for future research.
The conversion of semantic descriptions to natural language is not trivial. Our study encountered difficulties in accurately translating certain character patterns, underscoring the need for improved methods in this area. It is also essential to develop new methods for post-processing and analysing semantic descriptions. Particularly in taxonomy, innovative approaches for species diagnosis and comparison would be highly beneficial.
A nanopublication is a concept in scientific data management that is particularly relevant in the context of big data and FAIR principles (
Although the concept of nanopublications is still emerging, it promises to revolutionise information sharing and analysis (
Currently, nanopublications created using the nanodash service are not subject to peer review. Thus, authors are encouraged to take full responsibility for the content. However, the nanodash service does distinguish peer-reviewed from non-peer-reviewed nanopublications. Moreover, its integration with BDJ facilitates the direct integration of nanopublications into conventional academic publications, where they do undergo the peer-review process.
As semantic phenotypes and nanopublications have technological similarities and both aim to be computable and FAIR, integrating them into one framework would be beneficial. In our research, we generated both semantic phenotypes and nanopublications. At the moment, these datasets are not integrated and stored separately, not in a unified semantic graph or triple store. As a result of this disintegration, we are unable to simultaneously query species traits and the data from nanopublications. Therefore, there is a need for additional methodological advancements in order to achieve effective integration. Data from this study can be used to explore and develop new methods for such an integration.
Semantic phenotypes offer a significant improvement in the generation, analysis and sharing of taxonomic data, marking a substantial move towards FAIR phenotypes and computable information. To fully integrate semantic phenotypes and descriptions into standard practices in taxonomy and biology, further advancements in computational methods are needed, along with the development of platforms for managing semantic phenotypes and active engagement from the scientific community.
We express our gratitude to Tobias Kuhn (Vrije Universiteit Amsterdam, Amsterdam, The Netherlands) and Lyubomir Penev (Pensoft Publishers, Sofia, Bulgaria) for their support and help with the creation of nanopublications; to István Mikó (University of New Hampshire, New Hampshire) for valuable comments on the definition of terms from ontologies. GM thanks all the people at the Finnish Museum of Natural History (Helsinki, Finland) and at the University of Padova (Padova, Italy) who helped him with his MSc thesis of which this article is partially a continuation. We also thank the Editor and the Reviewer for their constructive comments on the draft.