Biodiversity Data Journal :
Research Article
|
Corresponding author: Sophia Ananiadou (sophia.ananiadou@manchester.ac.uk)
Academic editor: Anne Thessen
Received: 08 Sep 2018 | Accepted: 03 Jan 2019 | Published: 22 Jan 2019
© 2019 Nhung Nguyen, Roselyn Gabud, Sophia Ananiadou
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nguyen N, Gabud R, Ananiadou S (2019) COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal 7: e29626. https://doi.org/10.3897/BDJ.7.e29626
|
Background
Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities.
Results
Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences.
Conclusion
The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.
Biodiversity, text mining, named entity recognition, species occurrence, gold standard
Biodiversity plays a central role in our daily lives, given its implications on ecological resilience, food security, species and subspecies endangerment and natural sustainability. Research in this domain has recently seen accelerated growth, leading to the "big data" scenario of the biodiversity literature. For instance, the Biodiversity Heritage Library (BHL)*
Text mining can be defined as a process that aims to extract interesting and non-trivial patterns or knowledge from unstructured textual data in document collections (
This work is part of the COPIOUS project*
Most text mining work in the biodiversity domain has focused on discovering species names; tools designed for this purpose include TaxonGrab (
Up until now, there are no existing resources (either corpora or tools) that correspond directly to our area of interest. To address this situation, we have constructed the COPIOUS corpus— a gold standard corpus annotated with five different categories of entities that are relevant to biodiversity: Taxon, Geographical location, Habitat, Person and Temporal expression. The basis for the gold standard corpus was a set of English documents downloaded from the Biodiversity Heritage Library (BHL). We randomly selected 668 documents and asked our annotators to manually mark up the documents based on our guidelines. The average inter-annotator agreement of 78.22% F-score demonstrated that the annotations in our corpus were consistent and reliable.
To demonstrate the utility of the gold standard corpus, we used it to assist in the development of two types of text mining tools necessary for the construction of a biodiversity knowledge repository, i.e. named entity recognition (NER) and relation extraction. We trained two NER tools on the gold standard using two different machine learning approaches, i.e. Conditional Random Fields (CRF) (
For the relation extraction experiment, we aimed to extract relations that can be used to form species occurrence records. These relations include Taxon-Geographical location, Taxon-Habitat, Taxon-Person and Taxon-Temporal expression. Since we do not have any gold standard annotations for these relations, we applied PASMED (
There are two corpora that are annotated with taxon entities similar to our work, i.e. Linnaneus (
Corpus | Document Type | Num. of Documents | Num. of Sentences | Num. of Words | Num. of Mentions |
Linnaeus | PMC full paper | 100 | 17.580 | 502,507 | 4,259 |
S800 | PubMed abstract | 800 | 8.064 | 201,981 | 3,708 |
Although Linnaeus and S800 are useful corpora and are large enough to allow training of a machine learning-based NER, they were developed for the biomedical domain rather than the biodiversity domain. Additionally, most of the annotated scientific names in both corpora are in the format of binomial nomenclature, i.e. names with two parts of genus and species, which overlooks other variants of scientific names, e.g. family names, genus names and names including information pertaining to authority. We therefore decided to construct a novel gold standard corpus for biodiversity species names, whose annotations cover both variants of scientific names and vernacular names.
Previous work on recognising taxonomic names has mostly used dictionary-based approaches, i.e. text is matched against a predefined dictionary of species names. TaxonGrab (
BiOnym (
NetiNeti (
In addition to species names, extracting locations and habitats of species from literature is also important for domain experts, because such information can help to answer questions such as "What lives here?" or "What is the distribution of this organism?" (
Bacteria Biotope (
ACE 2005 (
The same situation applies to both CoNLL 2003 (
In contrast to previous work in the biodiversity domain, which has focused only on taxon names or microogranisms and their habitats, or other work in the general domain, whose annotated entity types only partially overlap with the types of information that are of interest to us, the work described in this article has produced a corpus that is especially designed for the biodiversity domain, including documents relevant to this domain. The corpus has been manually annotated with domain-specific entities belonging to five different semantic categories. These categories were chosen with the specific target of detecting species occurrences from literature.
In this section, we describe in detail how we constructed the gold standard corpus. We present our method of selecting the data, the annotation guidelines and the annotation process.
The source of data for our corpus is the Biodiversity Heritage Library (BHL)---an open access library that has digitised millions of pages of legacy literature on biodiversity spanning over 200 years (
As mentioned above, we annotated five categories of entities in our corpus, i.e. taxon names, geographical locations, habitats, temporal expressions and persons. Details of each category are described in the following subsections. It should be noted that, in the examples provided, annotations in square brackets should be annotated while the underlined terms should not be annotated.
Taxon entities are expressions that pertain to members of the taxonomic ranks, e.g. species, genus, family etc. Specifically, we annotated current and historical scientific names (e.g. [[E. salmonis] Müller, 1784]]; [[Salvelinus alpinus] (L.)]). For scientific names that include authorship information, two overlapping entities (with/without the authorship) are annotated as shown in the examples. In this category, vernacular names of species (e.g. [flying fox], [insectivorous bats]) are also marked up. However, vernacular names of taxonomic classes for general species, i.e. general names such as fish, birds, mammals, reptiles, amphibians, animals and plants, were not tagged as taxon entities. For example, "birds" in "A few birds seem to range widely from ..." and "amphibian" in "… a list of amphibian known from South Gigante Island" are excluded from annotation. In contrast to the Bacteria Biotope corpus (
Mentions of geographical locations, i.e. any identifiable point or area in the planet, ranging from continents, major bodies of water (e.g. oceans, rivers, lakes), named landforms, countries, states, cities and towns, were marked up as geographical location entities. These types of mentions do not only include Philippine geographical locations but also worldwide locations (outside of the Philippines). Similarly to several corpora annotated with geographical entities, e.g. MUC-7 (
Habitat entities are mentions of environments in which organisms live. These are textual expressions describing natural environments, e.g. [Lowland forest], [coconut groves] and [banana plantations] and places where ectoparasites or epiphytes are residing, e.g. "… parasitic on [Achillea holosericea]". It should be noted that informative modifiers, i.e. those which provide information in terms of composition, altitude or weather conditions should be included in text spans, e.g. [subalpine calcareous pastures] or [rocky slopes]. Since microorganisms are excluded from the current annotation effort, their habitats, such as diseases, symptoms, experimental materials and methods, are excluded too. Other exclusions from annotations of habitats are (1) habitat attributes, i.e. altitude, depth, elevation or area, e.g. "In the [mossy forest], altitude about" and "... [second-growth forests] at elevations from ..."; (2) habitat attribute values, i.e. descriptive references containing numerical values to indicate habitat attributes, e.g. 12-29 fathoms or 520 metres. We also excluded modifiers that convey information within the context of a geographic location but not on their own, e.g. the western [slopes] and adverbs or prepositions that precede the habitat, e.g. under [logs] or [rocks]. Similarly to Geographical entities, each item within enumerations of habitat descriptions was tagged separately, i.e. coordinating words and characters, e.g. and, or, and commas were excluded from the annotation.
We labelled proper nouns pertaining to person names, including generational suffixes (e.g. Jr and Jnr), used in the context of an occurrence or a historical account (e.g. "In 1905, [Tattersall] follows [Milne Edwards] in..."). Person names in citations that convey observations related to a species were marked up, e.g. "In the East China Sea, [Koto] et al. (1959) report that sailfish migrate northward...". However, we did not label them if they were not related to any observations, e.g. "These three genera included the main component species ... (Inoue & Hamid, 1994; LaFrankie et al., 1995)". Names of persons that appear as parts of taxon names (e.g. Scolopsis bulanensis Evermann & Seale) were not tagged. Titles (e.g. Dr. [Waring] recommends ...) and characters which are not part of the name but appear in the same token (e.g. Dr. [Johnston]'s findings) were also excluded from the annotation span. Unlike MUC-7 (
We annotated spans of text pertaining to points in time as temporal expressions. These expressions can be any mention of a specific date (e.g. [10 June 2013]), month or year (e.g. from [March] to [November]), decade (e.g. in the [1920s]), a regular occurrence, e.g. seasons and geochronological ages (e.g. during the [late Pleistoncene]). In contrast to temporal expressions in MUC-7, we did not mark up mentions pertaining to time-of-the-day information, e.g. "Specimens were found between 19:40 and 20:10". Similarly to Person entities, if temporal expressions in citations conveyed species observations, we annotated them. However, if they did not convey such observations, we did not annotate them, e.g. "In the East China Sea, Koto et al. ([1959]) report that sailfish migrate northward...". Expressions used as part of a taxonomic name's authority (e.g. Emesopsis infenestra Tatarnic, Wall & Cassis, 2011) were not tagged as temporal expression. Characters and coordinating words used to indicate a range (e.g. words "from" and "to" in the previsous example) were also excluded from the tagged span of text.
The detailed guidelines, with further instructions and more examples, are provided in Suppl. material
We firstly recruited two annotators with expertise in biology: one a master student and the other a graduate with a BSc. We then conducted a two-stage annotation process. In the first stage, we randomly selected 200 documents for double annotation by the two annotators. During this stage, the annotators were encouraged to provide us with feedback or comments to improve the annotation guidelines. We iteratively revised the guidelines and the annotations until we obtained an acceptable inter-annotator agreement between the two annotators. In the second stage, each annotator was assigned a separate portion of the 468 remaining documents to annotate.
In order to support the annotators, we utilised Argo, a workbench for building text mining solutions (
In this section, we present details of the COPIOUS corpus and the results of NER experiments applied to the corpus. We conducted two different NER experiments. Firstly, we trained NER tools using CRF (
During the first stage of the annotation process, we calculated the inter-annotator agreement (IAA) between the two annotators using F-scores. We applied 'strict' matching criteria, which means that the two annotators were considered to agree on a named entity only if they tagged exactly the same span and assigned the same entity category. Table
Inter-annotator agreement on different named categories over 200 doubly-annotated documents. The categories are arranged in descending order of agreement.
Category | Precision | Recall | F-score |
Geographical Location | 94.32 | 94.89 | 94.60 |
Person | 88.93 | 91.76 | 90.33 |
Temporal Expression | 86.59 | 87.25 | 86.92 |
Taxon | 81.09 | 83.87 | 82.45 |
Habitat | 45.85 | 48.36 | 47.07 |
Overall | 82.09 | 81.62 | 81.86 |
The level of agreement between the two annotators was high for most of entity types, except for Habitat. Aside from usual human errors, e.g. confusions between Geographical Location and Habitat entities, the disagreements between the two annotators with Habitat entities were often due to the specificity of the expressions in this category. Specifically, one annotator tagged more general habitat terms while the other tagged longer and more descriptive terms. Examples include "extensive [forests]" vs "[extensive forests]" and "margins of [primitive forests]" vs "[margins of primitive forests]". Another reason is the inclusion of adverbs or prepositions in between two general habitat terms that may pertain to a more specific habitat description. In such cases, one annotator tended to exclude the adverbs or prepositions and tagged two separate habitat terms, while the other annotator included the prepositions and tagged as only a single habitat term. For instance, "[primary forest] on [hilly ground]" vs "[primary forest on hilly ground]" and "[damp ravines] at [low altitudes]" vs "[damp ravines at low altitudes]". Although it was mentioned in the guidelines that coordinating words should be excluded, one of the annotators sometimes made mistakes, e.g. by annotating "[hilly or steep localities]" instead of "[hilly] or [steep localities]".
After the double annotation stage, we asked each annotator to label their own individual sets of documents. As a result, our gold standard consists of 668 documents. The numbers of sentences, words and entities of the corpus are presented in Table
Statistics of the gold standard corpus. The categories are arranged in descending order of the instance number
Number of documents | 668 | |
Number of sentences | 26,277 | |
Number of words | 33,475 | |
Number of entities | Taxon | 12,227 |
Geographical Location | 9,921 | |
Person | 2,889 | |
Temporal Expression | 2,210 | |
Habitat | 1,554 |
The gold standard is publicly available at: http://nactem.ac.uk/copious/copious_published.zip.
We randomly divided the annotated corpus into three different sets: (1) the training set with 80% of the data (543 documents), (2) the development set with 10% (67 documents) and (3) the test set with the remaining 10% (67 documents). This division is provided in Suppl. material
Category | Train | Dev | Test |
Taxon | 9,357 | 1,548 | 1,322 |
Geographical Location | 8,121 | 992 | 878 |
Person | 2,479 | 180 | 230 |
Temporal Expression | 1,800 | 157 | 253 |
Habitat | 1,308 | 91 | 115 |
We trained both CRF and BiLSTM models by using the training set, tuned the models using the development set and evaluated their performance using the test set. To train the CRF model, we used NERSuite (
Performance of CRF and BiLSTM on the testing set. The categories are arranged in descending order of F-score for each type of model.
Model | Category | Precision | Recall | F-score |
CRF | Geographical Location | 82.35 | 83.49 | 82.92 |
Taxon | 75.27 | 62.40 | 68.23 | |
Temporal Expression | 77.19 | 52.17 | 62.26 | |
Person | 72.82 | 43.10 | 54.15 | |
Habitat | 63.55 | 44.16 | 52.11 | |
Overall | 77.67 | 66.29 | 71.53 | |
Bi-LSTM | Geographical Location | 85.05 | 85.63 | 85.34 |
Taxon | 77.42 | 69.67 | 73.34 | |
Habitat | 64.10 | 64.94 | 64.52 | |
Temporal Expression | 70.67 | 54.36 | 61.45 | |
Person | 58.92 | 48.44 | 53.17 | |
Overall | 77.49 | 71.89 | 74.58 |
Amongst the five categories, CRF performed the worst on Habitat entities, with an F-score of 52.11%. This is expected, as the number of Habitat entities is the lowest amongst all the categories (as shown in Table
The second lowest F-score for the CRF model is 54.15% for Person entities. Identifying Person names was challenging for a number of reasons. Firstly, they can sometimes be a part of a Taxon name, leading to the confusion between Person and Taxon entities. We observed that Person entities with abbreviations, i.e. containing comma or full stop within the text, e.g. Alonzo, S., Apostolaki, P,E., were sometimes predicted as part of a Taxon name. Determining whether Person name forms part of a citation could also be confusing. The model also failed to recognise some instances of Person names that are followed by a year inside a pair of parentheses that pertain to actual observation, e.g. "[Voss] (1953) believe that there be a population of sailfish present". Furthermore, Person names that were spelled in all uppercase were not identified by the model. The general performance of the CRF model over all 5 entity types was acceptable, with an F-score of 71.53%.
Regarding the BiLSTM approach, the performance on Person entities was surprisingly low. Similarly to the CRF model, the BiLSTM model often tagged person names in citations and species names, even though these mentions should be excluded. For example, BiLSTM labelled "Schepman'' in "... a foreign journal (Schepman, 1907)'' as a Person entity, which is not correct according to our annotation scheme. Another reason for the low performance is that the model sometimes confused Person and Geographical Location entities. For instance, "Charles Glass" in "... have been received from Charles Glass of Santa Barbara ...'' should be a Person name, while the model included the whole name in a Geographical Location entity as "Charles Glass of Santa Barbara''. In contrast, "Ringim Mukr'' in "... Ringim Mukr, 2500 ft., flowers bright pink ...'' should be a Geographical Location entity, but the BiLSTM tagged it as a Person.
Although the BiLSTM model obtained higher scores than those achieved by the CRF model for the majority of categories, the overall performance of the two models was not significantly different. It can be seen that BiLTSM had wider coverage, i.e. higher recall, for all categories, but in some cases, e.g. Person and Temporal Expression, it was less precise than the CRF model. This can be explained by the fact that the BiLSTM only used word vectors as input features, while the CRF model used advanced features, namely POS and chunk tags. However, the fact that both types of models obtained good results serves to demostrate that our corpus has potentially wide utility for developing NER tools for biodiversity.
To the best of our knowledge, there is no available tool that can automatically detect all of the above-mentioned categories of named entities in biodiversity texts. Rather, the only other relevant tools that are currently available are those that can detect taxon names (
The results reported in Table
Performance of different NER tools on Taxon entities in the COPD corpus test set. In this table, we report the best performance for taxon names by the BiLSTM model.
Tool | Precision | Recall | F-score |
Our NER (BiLSTM) | 77.42 | 69.67 | 73.34 |
GNRD | 77.61 | 54.02 | 63.70 |
SPECIES Tagger | 86.79 | 4.51 | 8.57 |
Since the SPECIES tagger detected species names based on the NCBI Taxonomy (
In terms of F-scores, the model trained on our gold standard could attain better performance than both the GNRD and SPECIES tagger.
Occurrence data and species distribution play an important role in monitoring as well as preserving the biodiversity (
To this end, we firstly consulted our domain experts to define a schema for relations between entities for species occurrence records. The schema specifically describes two types of relations: occur and seen_by. Occur relations pertain to occurrence records of species, i.e. Taxon, in specific Geographical Locations or Habitats or at a point of time, i.e. Temporal Expression. Meanwhile, seen_by relations denote observations of a specific Person on specific species. Consequently, we attempted to identify four binary relations between Taxon entities and the other types of entities, as shown in Fig.
Since PASMED extracts relations based on predicate-argument structures, we firstly applied the Enju parser (
Extracting species occurrences from text would be an initial step towards developing a semi-automatic system that can complement the primary data of species occurrences with those described in literature. A potential system would consist of three steps. The first step is to ask domain experts to verify the extracted species observations. The second step is to normalise taxon names, geographical locations and habitats. Finally, we can straightforwardly convert the normalised information into Darwin Core Standard (
In this paper, we have described the process of constructing the COPIOUS corpus, which is annotated with five entity categories relevant to the study of biodiversity: Taxon names, geographical locations, habitats, temporal expressions and persons. With 668 documents and 28,801 entities annotations, the corpus is sufficiently large for both training and evaluating text mining tools. Our experimental results have demonstrated that the corpus is useful for text mining biodiversity texts in terms of both NER and occurrence extraction.
As future work, we aim to improve the performance of the NER tools, especially for the most problematic categories of Habitat and Person and then to apply the NER to the whole collection of BHL English pages. This will allow us to produce another semantic layer for BHL documents, in addition to the current layer of annotated scientific names, which should pave the way for an advanced semantic search system over the BHL. Another long-term goal is to extract species occurrence data from the whole BHL collection using the two-step method of occurrence extraction. Although our gold standard was developed specifically for the use case of Philippine species, the corpus is general enough to be employed for the whole BHL. However, beyond the large amount of computation that will be required to do this, there is one further limitation in terms of scaling up the task: BHL documents contain a large number of misspelt words, which are caused by errors from OCR tools and such errors may adversely affect the NER performance. Accordingly, we are investigating the application of OCR correction tools, such as,
We would like to thank Prof. Marilou Nicolas and Dr. Riza Bastita-Navarro for their valuable inputs. We sincerely thank our annotators for their hard work and fruitful feedbacks. We also thank Mr. Paul Thompson for his valuable comments.
Newton Fund Institutional Links
Conserving Philippine Biodiversity by Understanding Big Data (COPIOUS): Integration and analysis of heterogeneous information on Philippine biodiversity
The National Centre for Text Mining, University of Manchester, Manchester, United Kingdom
All authors contributed to the production of this work. SA proposed the idea and supervised all steps of the work. NN and RG constructed the guidelines, trained annotators, calculated the IAA and conducted all experiments. It is noted that NN and RG contributed equally. All authors read and approved the final manuscript.
The authors declare that they have no conflicts of interests.
A .pdf file presents our guidelines to mark up five categories of entities. The guidelines provide specific instructions to annotators about the annotation scope and the annotation span of each category. Examples are used to demonstrate these instructions. The guidelines also describe some exceptions that the annotators must follow during their annotation process.
A compressed file contains three divided subsets: 80% for training, 10% for development and 10% for testing, used in our named entity recognition experiments.
Biodiversity Heritage Library. https://www.biodiversitylibrary.org
Conserving Philippine biodiversity by understanding big data (COPIOUS). NaCTeM. http://nactem.ac.uk/copious
Global Biodiversity Information Facility. https://gbif.org
NCBI Taxonomy Database. https://www.ncbi.nlm.nih.gov/taxonomy
Biodiversity Heritage Library API v2 Documentation. http://www.biodiversitylibrary.org/api2/docs/docs.html
ARGO: A Web-based Text Mining Workbench by he National Centre for Text Mining. http://argo.nactem.ac.uk
Global Names Architecture. Global Names Recognition and Discovery. http://gnrd.globalnames.org
Taxon Finder. http://taxonfinder.org