Biodiversity Data Journal :
Data Paper (Biosciences)
|
Corresponding author: Nora Abdelmageed (nora.abdelmageed@uni-jena.de)
Academic editor: Vincent Smith
Received: 24 Jun 2022 | Accepted: 07 Sep 2022 | Published: 07 Oct 2022
© 2022 Nora Abdelmageed, Felicitas Löffler, Leila Feddoul, Alsayed Algergawy, Sheeba Samuel, Jitendra Gaikwad, Anahita Kazem, Birgitta König-Ries
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Abdelmageed N, Löffler F, Feddoul L, Algergawy A, Samuel S, Gaikwad J, Kazem A, König-Ries B (2022) BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain. Biodiversity Data Journal 10: e89481. https://doi.org/10.3897/BDJ.10.e89481
|
Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named Entity Recognition (NER) and Relation Extraction (RE). While the former enables better content discovery and understanding, the latter fosters the analysis by detecting connections between entities and, thus, allows us to draw conclusions and answer relevant domain-specific questions. To automatically predict entities and their relations, machine/deep learning techniques could be used. The training and evaluation of those techniques require labelled corpora.
In this paper, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts that can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques. These corpora are manually labelled and verified by biodiversity experts. In addition, we explain the detailed steps of constructing these datasets. Moreover, we demonstrate the underlying ontology for the classes and relations used to annotate such corpora.
entity annotation, relation annotation, Named Entity Recognition (NER), Relation Extraction (RE), Information Extraction (IE), biodiversity research, gold standard
The increasing amount of scientific datasets in public data repositories calls for more intelligent systems that automatically analyse, process, integrate, connect or visualise data. An essential building block in the evolution of such computer-supported analysis tools is Information Extraction with its sub-tasks, Named Entity Recognition (NER) and Relation Extraction (RE). That process aims to automatically identify important terms (entities) and groups of terms/expressions, which can fall into a certain category in the data (NER), as well as relationships between these entities (RE). However, the advancement of such tools is applicable if gold standards, manually labelled test corpora, are available. This supports the training of machines (for machine-learning approaches) and allows an evaluation of the developed tool. For applied domains, such as biodiversity research, gold standards are very rare.
In this work, we present a novel gold standard for biodiversity research. We provide a NER corpus, based on scientific metadata files and abstracts with manual annotations of important terms, such as species (ORGANISM), environmental terms (ENVIRONMENT), data parameters and variables measured (QUALITY), geographic locations (LOCATION), biological, chemical and physical processes (PHENOMENA) and materials (MATTER), for example, chemical compounds. In addition, we provide an RE corpus, based on a portion of the same data that consists of important binary and multi-class relationships amongst entities, such as OCCUR_IN (Organism, Environment), INFLUENCE (Organism, Process) and HAVE/OF (Quality, Environment). We also added these identified entities and relationships to a conceptual model developed in our previous work (
Our contribution is threefold:
We provide the results in formats that allow easy further processing for various Natural Language Processing (NLP) tasks, based on machine-learning and deep learning techniques. The code and the data are publicly available as follows:
Biodiversity research is a sub-research domain of the Life Sciences that comprehends the totality and variability of organisms, their morphology and genetics, life history and habitats and geographical ranges (
Natural Language Processing (NLP), with its sub-task Information Extraction, is a research field that uses these structured data or scientific publications. The aim is to develop systems that automatically identify important terms and phrases in the text. That supports scholars in obtaining a quick overview of unknown texts, for example, in search or allows improved filtering. In the Life Sciences, Information Extraction has a long history (
In the first step, we had to figure out which categories (or entity types) are relevant for biodiversity research. In addition, we also had to explore occurring relations amongst these entities. Therefore, we selected two sources from our previous works: 1) BiodivOnto (
Biodiversity Questions
The biodiversity question corpus consists of 169 questions provided by around 70 scholars of three biodiversity research-related projects (
The identified relevant entity types from this question corpus were aligned with the detected categories of classes from BiodivOnto in several discussion rounds. The final outcome (see Table
Summary of the categories (entity types) used for NER annotation. Explanations are adapted from (
Tag | Explanations | Examples |
---|---|---|
ORGANISM | all individual life forms such as microorganisms, plants, animals | mammal, insect, fungi, bacteria |
PHENOMENA | occurring natural, biological, physical or chemical processes including events | decomposition, colonisation, climate change, deforestation |
MATTER | chemical and biological compounds, and natural elements | carbon, H2O, sediment, sand |
ENVIRONMENT | Natural or man-made environments ORGANISM live in | groundwater, garden, aquarium, mountain |
QUALITY | data parameters measured or observed, phenotypes and traits | volume, age, structure, morphology |
LOCATION | geographic location (no coordinates) | China, United States |
The main idea for the relation detection process was to come up with categorisation for relations similar to the categories for noun entities. The detection process was conducted in several rounds. In the first pilot phase, three scholars analysed only a few questions about the existence of relations. The initial instruction was to manually inspect the questions and to identify binary relations between the occurring entities. Scholars were also advised to inspect the given verbs (which mainly describe a relation) and to think about suitable categories for the relations. In a second round, the proposed relation categories were discussed. The outcome was used for the final detection round. The final agreed relation categories are:
Complex questions with several entities were split into several relations. For example, the question "How do (autotrophic microorganisms)[ORGANISM] influence (carbon cycling) (PHENOMENA) in (groundwater aquifers)[ENVIRONMENT]"? This resulted in detecting two relations: influence (autotrophic microorganisms ORGANISM, carbon cycling PHENOMENA) and occur (carbon cycling PHENOMENA, groundwater aquifers ENVIRONMENT). Fig.
BiodivOnto
BiodivOnto is a conceptual model of the core concepts and relations in the biodiversity domain. The first version of BiodivOnto (
BiodivOnto initially had the following relations:
However, we merged the outcome from the analysis of the Biodiversity Questions as we did for classes. Thus, we included new relations as follows:
On the other hand, BiodivOnto initially included both "part_of" and "is_a" relations. However, we do not include them in the new ontology version since the most common relations in the Biodiversity Questions lack them. We picked on the relations that appear in both sources only.
Fig.
Data Sources
To construct our corpora, we re-used our previous work's collected metadata and abstracts (
The loss of biodiversity has a lot of concerns and it considers a major issue in our life (
Named Entity Recognition (NER) Corpora
BIOfid (
COPIOUS (
Species-800 (
Linnaeus (
QEMP (
The existing datasets have several limitations. They focus on species only, like the case of BioFid and COPIOUS. They are based on legacy data, as in COPIOUS and BioFID. They rely on only Pubmed abstracts like the case of Species800 and Linnaeus. They miss one important concept in the field, like the case of QEMP; it does not contain species. In this work, we create an NER corpus that contains various biodiversity classes for abstracts and metadata files.
Relation Extraction (RE) Corpora
Identifying the important entities is the first step in creating an RE gold standard. Based on this information, relationships amongst the entities in a sentence can be determined in a second step. There is a variety of approaches in the biomedical domain to identify relations amongst genes, diseases, proteins and drugs BioInfer (
There are only two approaches for relation extraction in the biodiversity domain: BacteriaBiotop (
To the best of our knowledge, there is no gold standard with relations also from dataset metadata. The introduced corpora only have the main focus on species, habitats and locations. However, biodiversity research is a diverse research field with other important categories, such as data parameters, processes, materials and data types (
This project aims at constructing two corpora for NER and RE tasks, based on abstracts and metadata files from Biodiversity datasets.
Methodology
In this section, we describe the process of constructing the NER and RE corpora.
BiodivNER Construction Pipeline
In this section, we explain the construction pipeline of the NER corpus as shown in Fig.
We followed a modified version of our previous project guidelines to construct the QEMP corpus (
We parsed the original data collection into sentences. For each sentence, we tokenised it into a set of words using ntlk library. Since our used annotation format is BIO-scheme, where a word is annotated either with B-tag as a beginning of an entity or, I-tag as an inside of entity or, O as outside of the entity, each word is initialised with an O tag. Each sentence as a set of words with O tags is stored vertically in a CSV file, as shown in Fig.
Four authors of this paper were responsible for annotating the corpus. Two authors have previous experience with biodiversity text annotation. The four annotators received periodical guidance from two biodiversity experts. Initially, we established a trial or a pilot phase before the actual annotation process took place. The purpose of this phase is to ensure the training of the annotators (participant guidance) as well as to revise the annotation guidelines. Around 2% (450 sentences) of the entire corpus is assigned to each annotator pair. Each annotator labelled a local copy of the pilot phase data in an Excel file. During this process, each annotator was asked to annotate a relevant term with one and only one tag from the provided tags. The results of this process are represented in Fig.
After the pilot phase, we familiarised ourselves with the annotation process and the guidelines. Each half of the corpus was assigned to an annotator pair. We followed the same procedure as in the pilot phase. Each annotator from the annotators' pair worked blindly on a local copy of the sheet. We refer to blindly as without access to the annotation of the other colleague. This procedure ensures the higher quality of annotated data and allows the calculation of the inter-rater agreement. Each annotator was asked to complete the annotation of half of the corpus. This annotation process was time-consuming and lasted for several months. Annotating a term is considered to be done if the annotator found the target tag in the selected existing data sources. However, if the annotator was unsure about the correct annotation, the term with a suggested tag was kept in a separate sheet named "Open Issues". We held various meetings with the biodiversity experts during this stage to solve the open issues. Since we had two annotator pairs, let's say, team A and B for two different sheets, where each sheet represented half of the corpus, we were able to calculate the inter-rater agreement for each team. We used Kappa's score for the agreement computation since it is one of the most common statistics to test inter-rater reliability (
We have extracted the mismatches in a separate sheet per annotator pair. A sheet contained the actual sentence with each of the annotator's answers. The task of each annotator pair was to reconcile their mismatches and to reach a final annotation that the two agreed on. We noticed that a significant cause for the mismatches was the rule of longest text span consideration in the annotation guidelines. For example, one annotator labelled the entire phrase "Secondary Metabolites" as MATERIAL, while the other tagged only "Metabolites" as MATERIAL. Such cases were the easiest to solve. However, other cases, where an annotator pair could not agree on one correct annotation were discussed with the biodiversity experts. For example, "Soil lipid biomass" seemed to be confusing as it could be either classified as MATTER or QUALITY. In such a case, we followed the biodiversity expert's opinion and settled on MATTER.
BiodivRE construction Pipeline
In this section, we describe our pipeline of constructing the binary and multi-class RE corpus on top of the BiodivNER. Initially, we transformed the annotated data for NER to suit the RE annotations process. Then, we tried to sample a subset of sentences to obtain a reasonable size of the RE corpus to be annotated. For each sampling method, we detailed its advantages and disadvantages. Afterwards, we explained the annotation process for the RE corpus.
We considered the final NER corpus as an input for the RE corpus construction. We prepared the data in such a way to be more readable. Each sentence is represented by one row, followed by its corresponding NER annotations in the following line. The NER corpus contains sentences with multiple tags. However, an RE corpus should be designed in a way that each sentence contains exactly two tags. We generated all possible combinations for sentences with more than two tags, including exactly two tags. Fig.
In the pilot phase of BiodivRE construction, we used a random sampling mechanism amongst the created corpus. We did not consider any selection criteria. We directly stacked the entire corpus in a list, shuffled it and randomly picked "n" sentences. We started annotating the resultant smaller corpus and, by doing so, we encountered two issues. At first, we found long sentences with too far tags, i.e. have many words between them, which makes the existence of a relation between the two tags impossible. Second, some of the relation pairs in the ontology have not appeared in the corpus at all. There are two reasons for the second issue. Either such kinds of relations do not appear in the original corpus or they are missed by the sampler since it depends purely on the random selection. The conclusion from the pilot phase is the need for changing the sampling strategy.
We developed a Balance-Biased sampler to have more control over what to include in the final RE corpus. It is inspired by the Round-robin scheduler. We grouped the sentences from the initial construction by tag-pair, where a valid pair is the one appearing in the BiodivOnto and the unsupported co-occurrences were grouped into a new category, "Other". At this stage, we handled the relations bidirectionally between entities of interest to cover cases like ENVIRONMENT have QUALITY and QUALITY of ENVIRONMENT. Afterwards, we iterated over the groups, including the entire set of tag-pairs, as well as the "Other" group. We picked one sentence from each group until a threshold was reached. By this means, we avoided any bias that could be caused by a random sampler. In our case, we selected 4000 sentences as a threshold. An additional criterion is that we limit the number of words between the two entities of interest to a certain value, for example, 30 words. In this way, we solved the two problems that appeared using the random sampling method. At first, we guarantee that we cover all the relations of the BiodivOnto, if it exists in the text, in the final corpus. Second, we avoid cases with FALSE sentences due to too far entities, since it is clear that no relation could exist between them.
We directly referred to BiodivOnto and limited the accepted relations to those supported by the ontology. On the one hand, for each sentence, we checked whether there is a relation between its two named entities. On the other hand, whether this relation has a semantic correspondence in the BiodivOnto. For example, a verb relation "has an impact on" is considered a synonym for the ontological relation "influence". FALSE examples would be either the relation is not supported by the BiodivOnto or it has a different meaning than the ontological relation. For example, "Climate change (B-Phenomena I-Phenomena) impacts the carbon dioxide (B-Matter I-Matter)" is a FALSE sentence since there is no ontological relation between PHENOMENA-MATTER. Such a sentence would appear since we also choose from the "Other" group in the selected sampling method. Another FALSE example might occur between two entities with a relation in the BiodivOnto. "Trees (B-Organism) with extrafloral nectaries (B-Matter I-Matter)" is a FALSE statement since the word with does not imply the relation influence between ORGANISM and MATTER.
Similar to our procedure to construct the NER corpus, we also applied a pilot phase for RE annotation. Two of the authors annotated the same 50 sentences that were randomly picked. Afterwards, we calculated the inter-rater agreement (Kappa's score), which resulted in 0.94. Due to this high score, we decided to split the corpus and individually continue the annotation.
During the real annotation phase, we encountered issues regarding the entity tags, especially for the longest span annotation. This rule does not seem to be correctly followed during the annotation of the NER corpus. For example, "earthworm invasion" was annotated as "B-Organism" "B-Phenomena", instead of "B-Phenomena" "I-Phenomena". For those cases, we fixed them to follow the rule of the annotation declared originally in the NER guidelines. Fig.
Not Applicable
Three files per named entity recognition (NER) represent train, dev and test splits.
Column label | Column description |
---|---|
Sentence# | Number of sentence in increasing order. |
Word | Tokenised sentence into words. |
Tag | Corresponding NER tag that follows BIO-schema. Possible values are B/I-Environment, B/I-Phenomena, B/I-Matter, B/1-Quality and B/I-Location, B/I-Organism. |
Three files for Relation Extraction (RE) represent train, dev and test splits.
Column label | Column description |
---|---|
Not Applicable | Possible values are 1 for relation exisits and 0 for relation does not exist. |
Not Applicable | the actual sentence with two anonymised entities that are supposed to have (have not) a relation. |
Column label | Column description |
---|---|
Not Applicable | Possible values NA (Not Applicable where the relation is undetermined), influence, have and occur_in. |
Not Applicable | the actual sentence with two anonymised entities that are supposed to have (have not) a relation. |
Results and Discussion
In this section, we give an overview of our final NER and RE corpora. We illustrate the characteristics of each corpus, for example, the class distribution in the NER corpus. In addition, we compare them to existing state-of-the-art corpora.
BiodivNER Characteristics
The final version of the NER corpus consists of three folds: train, dev and test because our corpus mainly addresses various tasks in NLP that could be solved, based on machine-learning techniques. We followed the split of 80%, 10% and 10% for the train, dev and test sets, respectively. All files are given in a CSV format, each of which consists of three entries Sentence#, Word and Tag, as shown in Fig.
Moreover, we compare our BiodivNER to the existing common corpora. Table
State-of-the-art comparison of NER corpora. The number of documents, statements and categories are given by #Doc., #Stat. and #Cate. respectively.
Corpus | Data Source | Type | #Doc. | #Stat. |
#Words (#Tokens) |
#Cate |
#Mentions (#Annotations) |
#Unique Mentions |
---|---|---|---|---|---|---|---|---|
COPIOUS | BHL | Publications | 668 | 26,277 | 502,507 | 5 | 26,007 | 6,753 |
QEMP | idiv, BEXIS, Pangeya, Dryad, BFChina | Dataset Metadata | 50 | 2,226 | 90,344 | 4 | 5,154 | 480 |
Species-800 | PubMed | Abstracts | 800 | 14,756 | 381,259 | 1 | 5,330 | 1,441 |
Linneaus | PubMed Central (PMC) | Publications | 100 | 34,310 | 828,278 | 1 | 3,884 | 324 |
BiodivNER | iDiv, BExIS, Pangeya, Dryad, BEF-china, PubMed |
Dataset Metadata, Abstracts |
150 | 2,398 | 102,113 | 6 | 9,982 | 1,033 |
COPIOUS has two categories closely related to biodiversity (Habitat and Taxon) and two general Categories (TemporalExpression and GeographicalLocation). QEMP has four categories derived from the biodiversity domain (Environment, Material, Process and Quality). As there is already a variety of corpora for species, we only concentrated on missing categories in QEMP. BiodivNER also covers such an essential category in addition to the same closely-related classes as QEMP and a general domain LOCATION category.
BiodivRE Characteristics
Similar to BiodivNER, we created three folds in a CSV format for both binary and multi-class RE corpus. The files consist of two columns: (1) the relationship either in a binary or label form and (2) the sentence where the actual named entities are encoded with their tags. An example line in the file of binary relations: "1 Our study shows a significant decline of the @QUALITY$ of @ENVIRONMENT$.". However, it would be in the multi-relations files as: "have, Our study shows a significant decline of the @QUALITY$ of @ENVIRONMENT$." This format will facilitate the training procedure for any machine-learning technique. We followed the same split setting for 80%, 10%, 10% of the train, dev and test sets, respectively.
Fig.
Table
Corpus | Relations |
#TRUE Statements |
#FALSE Statements |
Total |
---|---|---|---|---|
GAD | Binary | 25,209 | 22,761 | 53,300 |
EU-ADR | Binary | 2,358 | 837 | 3,550 |
BioRelEx | Multi-class | 1,379 | 62 | 1,606 |
BiodivRE | Binary, Multi-class | 1,369 | 2,631 | 4,000 |
Conclusions and Future Work
We introduced BiodivNERE as a package for two corpora for NER and RE tasks that are based on abstracts and metadata from the biodiversity domain. We manually annotated and revised them with biodiversity experts. BiodivNER, the NER corpus, consists of six important classes in the biodiversity domain. BiodivRE is a binary and multi-class benchmark containing three relations from the domain. Both classes and relations are derived from the analysis of our previously-developed work (Biodiversity Questions and BiodivOnto). We release our corpora and code as publicly available.
Future Work
We see multiple areas to extend this work. We plan to include more classes and relations from the biodiversity domain. For example, we restore the dropped relations from BiodivOnto, for example, "part_of" and "is_a". In addition, we include more data sources to cover a broader range of the domain. Moreover, we evaluate them in terms of the quality of the annotations. Last but not least, we apply both corpora to a machine-learning model to bring them to the actual use case.
The authors thank the Carl Zeiss Foundation for the financial support of the project "A Virtual Werkstatt for Digitization in the Sciences (K3, P5)" within the scope of the program line "Breakthroughs: Exploring Intelligent Systems for Digitization" - explore the basics, use applications which funds Nora Abdelmageed and Sheeba Samuel. Alsayed Algergawy' s work has been funded by the Deutsche Forschungsgemeinschaft (DFG) as part of CRC 1076 AquaDiva (Project Number 218627073). Jitendra Gaikwad acknowledges the support provided by the Deutsche Forschungsgemeinschaft (DFG) and Friedrich Schiller University Jena via NFDI4Biodiversity (Project Number 442032008). Felicitas Löffler was partially funded by DFG in the scope of the GFBio project (Project Number 229241684). Anahita Kazem is funded by DFG in the scope of the German Centre for Integrative Biodiversity Research (iDiv) (Project Number 202548816).