Biodiversity Data Journal :
Short Communications
|
Corresponding author: Brian J. Stucky (stucky.brian@gmail.com)
Academic editor: Vincent Smith
Received: 23 Jan 2019 | Accepted: 28 Feb 2019 | Published: 13 Mar 2019
© 2019 Brian Stucky, James Balhoff, Narayani Barve, Vijay Barve, Laura Brenskelle, Matthew Brush, Gregory Dahlem, James Gilbert, Akito Kawahara, Oliver Keller, Andrea Lucky, Peter Mayhew, David Plotkin, Katja Seltmann, Elijah Talamas, Gaurav Vaidya, Ramona Walls, Matt Yoder, Guanyang Zhang, Rob Guralnick
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Stucky B, Balhoff J, Barve N, Barve V, Brenskelle L, Brush M, Dahlem G, Gilbert J, Kawahara A, Keller O, Lucky A, Mayhew P, Plotkin D, Seltmann K, Talamas E, Vaidya G, Walls R, Yoder M, Zhang G, Guralnick R (2019) Developing a vocabulary and ontology for modeling insect natural history data: example data, use cases, and competency questions. Biodiversity Data Journal 7: e33303. https://doi.org/10.3897/BDJ.7.e33303
|
Insects are possibly the most taxonomically and ecologically diverse class of multicellular organisms on Earth. Consequently, they provide nearly unlimited opportunities to develop and test ecological and evolutionary hypotheses. Currently, however, large-scale studies of insect ecology, behavior, and trait evolution are impeded by the difficulty in obtaining and analyzing data derived from natural history observations of insects. These data are typically highly heterogeneous and widely scattered among many sources, which makes developing robust information systems to aggregate and disseminate them a significant challenge. As a step towards this goal, we report initial results of a new effort to develop a standardized vocabulary and ontology for insect natural history data. In particular, we describe a new database of representative insect natural history data derived from multiple sources (but focused on data from specimens in biological collections), an analysis of the abstract conceptual areas required for a comprehensive ontology of insect natural history data, and a database of use cases and competency questions to guide the development of data systems for insect natural history data. We also discuss data modeling and technology-related challenges that must be overcome to implement robust integration of insect natural history data.
insects, natural history, biodiversity informatics, ontology, data modeling
Insects are possibly the most diverse class of multicellular organisms on Earth, not only in sheer number of species, but also in terms of ecological diversity (
Currently, data about the natural history of insects are widely scattered among a multitude of sources, including labels on specimens in biological collections, specialized (and often obscure) publications, field notebooks, and taxon-specific databases. Thus, finding relevant natural history data for a given insect species can be a daunting task. Furthermore, insect natural history data are highly heterogeneous. For example, they commonly differ in observational methodology (e.g., observations in the field versus in the lab), observational detail (e.g., differences in temporal resolution or certainty of biotic associations), or in the terminology used by the observers. Aggregating these data so that they can be analyzed and disseminated efficiently, without information loss, is a major informatics challenge.
A critical step towards meeting this challenge is developing comprehensive standards to guide the design and implementation of data systems for aggregating insect natural history data. To support robust data integration, these standards need to include two major components: first, a well-defined vocabulary of natural history terms that is suitable for recording natural history observations across all insect taxa and, second, an ontology that provides computable semantics for the vocabulary so that computers can understand how the terms in the vocabulary relate to one another (ontologies are described in the next section). Such data standards can have a major impact on large-scale biodiversity science, as exemplified by the success of the "Darwin Core" vocabulary for aggregating and exchanging species occurrence data (
Here, we report initial results of a new effort to develop a standardized vocabulary and ontology for insect natural history data, an effort that was initiated at a three-day workshop, held at the University of Florida from 1 May to 1 June 2018, that convened entomologists, computer scientists, and data modelers. Although work on a draft ontology is still in progress, in this short communication we describe several key results of our work so far that are likely to be of broader interest, including an analysis of high-level ontology concept areas, a conceptually comprehensive database of example insect natural history data, and a database of ontology use cases and ontology competency questions.
To make this work tractable, we have mostly focused on natural history information from specimens in collections, with taxonomic scope limited to the five mega-diverse insect orders (Hemiptera, Coleoptera, Diptera, Lepidoptera, Hymenoptera), which include the vast majority of insect species and ecological diversity (
Before turning to discussion of our vocabulary and ontology development efforts, we recognize that many readers might have little experience with ontologies, so we briefly introduce ontologies and why they are important for integrating natural history data.
An ontology, as the term is used in computer and information science, is an explicit, precise, machine-interpretable conceptualization of some knowledge domain. Although we do not have space in this manuscript to provide a detailed introduction to ontologies, we will try to provide some intuition by way of a simple example. Suppose we have two natural history observations: observation 1 asserts that an individual of species A was a parasitoid of an individual of species B and observation 2 asserts that an individual of species C was a predator of species B (Fig.
Given observations 1 and 2, a human biologist can easily infer that species A and C are both known to feed on species B, but a computer does not automatically understand that “parasitoid of” and “predator of” both imply trophic relationships. With an ontology, we can provide formal logic statements, called axioms, that allow a computer to make this inference. To continue with the example, we could write axioms that assert that the relationships “parasitoid of” and “predator of” are both special cases of a more general relationship called “feeds on” Fig.
With only two observations and a few vocabulary terms, this might seem like a trivial accomplishment, but when we have hundreds, thousands, or even millions of heterogeneous natural history observations, with hundreds of logical relationships among the terms in a large vocabulary, ontologies make it possible to automate complex data integration and querying tasks that would be practically impossible for a human. Thus, ontologies are critical to any effort to develop robust systems for aggregating insect natural history data. Furthermore, although this brief discussion has focused on the value of ontologies for data aggregators and users, ontologies are also beneficial for data creators and providers because they provide a standardized vocabulary that, once adopted, makes an individual's or organization's data immediately interoperable with similar data from other sources. This, in turn, makes the data more likely to be used (and cited) by other researchers. For readers who wish to learn more about data modeling with ontologies,
We now return to discussion of the ontology design and development work initiated at the workshop, which has been organized around four major tasks: 1) assembly of example data; 2) analysis of example data and ontology scoping; 3) high-level ontology design and concept identification; and 4) identifying use cases (and users) and authoring ontology competency questions. We briefly describe each of these tasks and present the results of our work so far.
Insect natural history is an extremely broad domain, which means that identifying an appropriate scope for a new data vocabulary and ontology is not a simple task. Our approach to this problem was to assemble example natural history data, drawn from real data sources, for each of the five major insect orders. This served two purposes. First, examining a well-drawn set of example data is a practical method for delimiting the scope of a new vocabulary and ontology, and second, a good example dataset also provides valuable test cases for use during vocabulary and ontology development.
To generate the example dataset, we worked in five small groups. Each group was assigned one of of the five major insect orders, and we ensured that each group included at least one entomologist with expertise in the assigned order. Then, each group gathered example natural history data for their insect order, with the goal of compiling a concise dataset that represented the various kinds of natural history information recorded on specimen labels for each major insect order. We attempted to capture both the breadth of biological information and the range of observational detail found in label data. Although we focused on information from insect specimen labels, we also included some data from literature sources and online databases such as iNaturalist (https://www.inaturalist.org) and GloBI (
Our final dataset includes 189 natural history observations covering a wide range of concepts and observation types (see next section). We expect that this dataset will have value to other researchers as well, so we have included it with this manuscript as two supplemental files, with one file formatted as a PDF document (Suppl. material
After assembling the example data, we used them to delimit the high-level scope of the new vocabulary and ontology. Again working in small groups, we analyzed the kinds of information contained in the example data, with each group focusing on one of the five major insect orders. For each order, we summarized the kinds of biological information that were observed (e.g., various multi-organism interactions, developmental data) and the ways in which the information was recorded (e.g., qualitative or quantitative). Then, we reconvened as a large group, each small group reported their findings, and we synthesized the results to arrive at a set of 10 high-level conceptual areas required for the final ontology (Table
Ten top-level conceptual areas required for a comprehensive ontology of insect natural history data. "Relevant extant ontologies" are existing ontologies that provide at least partial coverage of the concepts in a given conceptual area. All of the ontologies mentioned here are part of the Open Biological and Biomedical Ontology (OBO) Foundry (
Conceptual area |
Description |
Relevant extant ontologies |
Observations and observing processes |
Observations of insect natural history and the processes that generate them, including information about the observers (whether human or machine) and where and when observations are made. |
Biological Collections Ontology [1,2] |
Relationships and interactions |
Behaviors that involve interactions among organisms. Includes pairwise interactions (e.g., mating or herbivory) and multi-way interactions (e.g., cooperative colony defense or ants defending aphids from a potential predator). |
Gene Ontology [3,4], Relations Ontology [5] |
Single-organism behaviors |
Behaviors that do not necessarily involve interactions with other organisms (e.g., perching or locomotion). |
Neurobehavior Ontology [6] |
Ontogeny |
Developmental information (e.g., instar number or length of larval stage). |
Gene Ontology [3,4], Uberon [7] |
Organism products and traces |
Non-living objects or artifacts generated by insects (e.g., nests or leaf mines). |
|
Habitat, locality, and substrates |
The physical context in which an organism is found, at all scales (e.g., a geopolitical boundary or a specific microhabitat). |
Environment Ontology [8,9], GAZ [10] |
Positional and spatial information |
Information about the location of an organism relative to some other object or reference point (e.g., underneath the bark of a log, the south side of a rock). |
Biological Spatial Ontology [11], Relations Ontology [5] |
Weather and climate |
Information about weather conditions or climate (e.g., momentary or long-term observations of temperature or precipitation) at any spatial scale. |
|
Collecting methods |
The methods used to obtain specimens or individuals for observation (e.g., sweep netting or pitfall trapping) and information about how those methods are implemented. |
Biological Collections Ontology [1,2] |
Curation |
Information about how specimens or other artifacts are managed (e.g., where they are housed and how they are preserved). |
Biological Collections Ontology [1,2] |
Together, these conceptual areas cover virtually all of the kinds of information contained in the example data we assembled, and we therefore propose that an ontology that provides suitable coverage of all 10 of these areas will be sufficient for modeling nearly all insect natural history data from specimen labels as well as a substantial proportion of insect natural history data from other sources, including literature-based data. This conclusion is dependent, of course, on the extent to which our example data capture the conceptual breadth and depth of all available insect natural history information. Although we were not able to formally evaluate this, given the collective entomological expertise of the workshop participants (many of whom have years of experience examining specimens and labels from entomology collections around the world) and the effort spent compiling example data, we are confident that we at least came close to achieving this goal for natural history data from insect specimen labels.
We also note that several of these conceptual areas overlap with the domains of extant ontologies, and in Table 2, we list the ontologies that are most relevant to each conceptual area. To ensure broad compatibility, reusability, and extensibility, we plan to use existing ontological resources wherever possible and contribute (or suggest) new entities for extant ontologies, when appropriate.
Of the 10 conceptual areas we identified (Table
This initial design work revealed several critical data modeling challenges, the thorniest of which is the problem of recording metadata about natural history observations that include interactions between organisms. Such observations are common in natural history data and include, for example, observations about feeding relationships, parasite/host relationships, courtship, and many more. As with any other natural history observation, it is important to be able to record metadata about interaction observations, such as who made the observations, when they occurred, and so on. Without plunging into too much technical detail, the central problem is that the technology most often used for implementing ontology-enabled data, the Resource Description Framework (RDF,
A second important data modeling problem is the challenge of accurately capturing information about what organisms were observed, which means dealing with the myriad difficulties posed by the use of taxonomic names (
The last major task of our preliminary design and development work was drafting detailed ontology competency questions and identifying potential users and user cases. Ontology competency questions (OCQs,
To identify use cases and develop OCQs, we divided into three groups on the last day of the workshop, with each group working independently and recording their results. After the workshop, one of us (BJS) synthesied the results of each group’s efforts into a single, comprehensive set of use cases and OCQs. The use cases we identified cover seven main user groups or domains:
The full sets of use cases and OCQs are too large to report in the main text, so we instead provide them in Suppl. material
With the work and results reported in this paper, we have laid a foundation for ongoing efforts to design, develop, and implement a robust vocabulary and ontology for modeling insect natural history data. Our next immediate goals are to identify the best solution for dealing with the problem of interactions metadata, discussed above, and to produce and release a draft ontology implementation for public review. We welcome additional participants in these efforts; readers who would like to be involved should contact the corresponding author (BJS). In the meantime, we hope that the foundational work reported in this paper, including the comprehensive example dataset and OCQs, will prove useful to other researchers interested in the informatics challenges surrounding insect natural history data.
This work was supported by the National Science Foundation Postdoctoral Research Fellowship in Biology under grant no. 1612335 to BJS, a University of Florida Informatics Institute fellowship to BJS, and by an iDigBio (NSF DBI-1547229) workshop grant. We thank C. Bester, K. Love, and other iDigBio personnel who helped coordinate workshop logistics. We also thank four reviewers for their helpful and insightful comments. Publication of this article was funded by the University of Florida Open Access Publishing Fund.