Biodiversity Data Journal :
Research Article
|
Corresponding author: Maria Auxiliadora Mora (mariamoracross@gmail.com), José Enrique Araya (jaraya@itcr.ac.cr)
Academic editor: Yasen Mutafchiev
Received: 28 Sep 2017 | Accepted: 11 Jun 2018 | Published: 26 Jun 2018
© 2018 Maria Mora, José Araya
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Mora MA, Araya JE (2018) Semi-automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions Written in Spanish. Biodiversity Data Journal 6: e21282. https://doi.org/10.3897/BDJ.6.e21282
|
|
Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format. The amount of publications generated is very large; therefore, to process it in order to obtain high structured texts would be complex and very expensive. Approaches like citizen science may help the process by selecting whole fragments of texts dealing with morphological descriptions; but a deeper analysis, compatible with accepted ontologies, will require specialised tools. The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles).
It is necessary to develop standards and software tools to extract, integrate and publish this information into existing free and open access repositories of biodiversity knowledge to support science, education and biodiversity conservation.
This document presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. The developed algorithm is based on the work of Dr. Hong Cui from the University of Arizona; it uses semantic analysis, ontologies and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the books Trees of Costa Rica Volume III (TCRv3), Trees of Costa Rica Volume IV (TCRv4) and to a subset of descriptions of the Manual of Plants of Costa Rica (MPCR) with very competitive results (more than 92.5% of average performance). The system receives the morphological descriptions in tabular format and generates XML documents. The XML schema allows documenting structures, characters and relations between characters and structures. Each extracted object is associated with attributes like name, value, modifiers, restrictions, ontology term id, amongst other attributes.
The implemented tool is free software. It was developed using Java and integrates existing technology as FreeLing, the Plant Ontology (PO), the Plant Glossary, the Ontology Term Organizer (OTO) and the Flora Mesoamericana English-Spanish Glossary.
Information Extraction, Natural Language Processing, Biodiversity Informatics
The transformation of texts from taxonomic literature into structured data remains a key challenge in biodiversity informatics, recognised by international initiatives such as the Global Biodiversity Information Facility (GBIF), the Encyclopedia of Life (EOL), and the Biodiversity Heritage Library (BHL) (
Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. The scientific community has described more than 1.9 million species, which represents around 17% of the planet's expected biodiversity (
Morphological descriptions synthesise observations made by taxonomists over centuries of research. They contain statements that detail morphological aspects (i.e. shape and structure) of species useful to identify it. Statements may describe structures, substructures, characters, states, and relationships between structures. Examples of structures are leaves, apex, flowers, or fruits. Examples of characters are length, width, pigmentation patterns, smell, or architecture. An example of part of a description is the statement "hojas simples, alternas, 4-10 (-14) × 1-3.3 cm. / Simple leaves, alternate, 4-10 (-14) × 1-3.3 cm.". In this example, the described structures are the "hojas / leaves", one of their characters is the architecture (not mentioned) and the state of this character is "simple". To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge such as the ones developed by GBIF, EOL and BHL, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, it is needed to identify structures and substructures names and the characters that describe them. The generated data would allow, amongst other services, the development of applications to identify specimens (e.g. electronic keys), to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions.
Fortunately, morphological descriptions of plants use a semi-structured language characterised by:
Examples of types of phrases (chunks) used in morphological descriptions included in TCRv4 (total number of phrases = 2,457). File is semi-colon separated.
Phrase |
Grammar structure (Spanish) |
Amount of phrases with this structure |
% of occurrence in text |
inflorescencias fasciculadas. / fasciculated inflorescences. |
Name + adjective + punctuation mark (NAF) |
1152 |
17.5 |
deciduas, / deciduous, |
Adjective + punctuation mark (AF) |
828 |
12.5 |
6-30 m de altura. / 6-30 m high. |
number + unit of measurement + preposition + name + punctuation mark (ZUSNF) |
613 |
9.8 |
6-30 × 2-10.5 cm |
Number + area symbol +number + name +punctuation mark (ZGZNF) |
289 |
4.2 |
Biodiversity Informatics (BI) provides techniques and mechanisms to capture, process, integrate, and publish data and information on the planet's biodiversity. BI initiatives, such as GBIF, EOL, BHL, and the Bar Code of Life, work on the discovery, aggregation and free exchange of genetic data, species occurrences, natural history information, conservation status, management, conservation and geographic data, amongst others. The integrated data enable users to answer questions related to processes that occur in time and space, for example, the possible effects of climate change on particular species, the effects of land-use change on species in an area, prediction of the introduction routes of invasive species, amongst others. To the aforementioned data types, species traits databases have been added more recently. They store in the form of triplets (e.g. flowers, colour, white) information extracted manually or semi-automatically from morphological descriptions, habitat, natural history, species interactions, and distribution. An example of such repositories is the TraitBank designed by EOL to integrate data from multiple databases (
This research is related to the information extraction area (
The implemented system will define the basis to continue the work of processing more than one hundred field guides of plant and other biological groups such as arthropods, mollusks, vertebrates, and fungi published by the National Biodiversity Institute of Costa Rica (INBio) and, in the future, to support the Latin American community in this process. To process this information manually constitutes a monumental task and algorithms like the one proposed in this research would greatly alleviate the large effort required.
The following are the major contributions of this research:
With the present investigation, an effort has been made to generate structured information semi-automatically with high semantic value from morphological descriptions of plants written in Spanish. The goal of the algorithm is to extract structures, substructures, characters states, to relate each character to its corresponding structure or substructure, and to establish relationships between structures. Specifically, the algorithm must convert morphological descriptions, that are initially in text format, into records of a database.
The system receives morphological descriptions in tabular format (i.e. scientific name and description), processes them, and generates documents in XML according to the scheme proposed by
Fig.
As can be seen in the red frame on Fig.
<character name="arrangement" value="elípticas" notes="Carácter repetido"/>
<character name="shape" value="elípticas" notes="Carácter repetido"/>
The system adds a note of "Carácter repetido / Repeated character" so that, at a later stage, an expert could determine the correct character to be used.
The rules and conditions that guide the extraction process were defined from a morphosyntactic analysis performed on descriptions of the TCRv4 book (
The algorithm analyses the role of each token and the dependencies between tokens in a chunk and creates or modifies the corresponding objects in a database. Three types of objects are generated: structures, characters, and relationships. Tokens that do not generate one of those types of objects are considered modifiers. Chunks, that are part of the same clause, are processed from left to right. The order of processing is important because not all chunks include the structure with which the characters should be associated and therefore the algorithm should look for it in previously processed chunks.
To associate each character with the correct structure or substructure, the algorithm uses the order of appearance of tokens and their concordance in gender and number. Fig.
The algorithm does not extract all information available in taxon descriptions. It does not process all prepositional or verbal phrases, however, as a proof of concept, the prepositional phrases that begin with tokens "sin / without" or "con / with" are structured. The rest of prepositional or verbal phrases are only delimited as constraint_preposition and constraint_verb respectively. Fig.
Semantic annotation is done using dependency trees generated by the FreeLing Dependency Parser (
Adjectives (A) rules:
Names (E) rules:
Numerals (Z) rules:
Adverbs (R) rules:
An example of associating an adverb with a character is shown in Fig.
If adverb R1 appears in a sentence in parentheses, the algorithm associates R1 with the token that corresponds within the sentence as shown in Fig.
Chunk |
Contents |
T11L5S7 |
levemente revoluta / slightly revolute |
T112L4S2 |
hojas simples, alternas (rara vez opuestas) / simple leaves, alternating (rarely opposed) |
T3L9S5 |
densamente tomentulosas / densely tomentose |
T16L5S8 |
margen finamente aserrado / finely sawed margin |
Determiners (D) rules:
The algorithm processes quantifiers (i.e. some, quite, none, all, several) and articles (i.e. the) as a proof of concept. Table
Chunk |
Contents |
T235L6S1 |
varias por racimo / several per cluster |
T6L5S8 |
el margen dentado-mucronado en la parte distal de la lámina / dentate-mucronate margin in the distal part of the lamina |
Other elements such as articles, pronouns, and conjunctions are processed in the same way by the algorithm.
The texts used in this research are from the books Trees of Costa Rica Volume III - ACRv3 (
In this section, we describe the results of executing the algorithm in a random sample of clauses extracted from the ACRv3 and MPCR books (5% of the total available clauses). The data sample was produced using the Roulette Wheel selection algorithm (
Average number of structures and characters in the evaluated clauses of the ACRv3 and MPCR books.
Book |
Number of Descriptions |
Total of Clauses |
Sample Size (Clauses) |
Average of |
|
Structures |
Characters |
||||
ACRv3 |
233 |
1,738 |
87 (5%) |
2.85 |
3.62 |
MPCR |
237 |
2,230 |
106 (5%) |
3.42 |
3.69 |
The ACRv3 book includes information on 233 species with 1,738 clauses, of which 87 (5%) were included in the data sample. From the MPCR, 237 species descriptions were selected from which a random sample of 106 (5%) was extracted.
The complexity of the clauses in the samples taken from the ACRv3 and MPCR books were well distributed. 52% of clauses were simple and 48% complex in ACRv3 and 53% of clauses were simple and 47% complex in MPCR. It is estimated that a clause is simple if it has two or less structures and complex with more than two structures. Fig.
The metrics generally used in IE to evaluate the results are precision and recall (
Precision of the algorithm when applied to samples of ACRv3 and MPCR books.
Book |
Identification of structures (precision) |
Character structuring (precision) |
Association of characters to structures (precision) |
Association of conjunctions (precision) |
ACRv3 |
97.9% |
98.1% |
98.7% |
96.4% |
MPCR |
98.1% |
92.8% |
86.4% |
92.4% |
Recall of the algorithm when applied to samples of ACRv3 and MPCR books.
Book |
Identification of Structures (Recall) |
Character Structuring (Recall) |
Association of characters to structures (Recall) |
Association of conjunctions (Recall) |
ACRv3 |
97.9% |
99.0% |
98.7% |
96.4% |
MPCR |
98.1% |
93.0% |
86.4% |
92.4% |
Performance (F) of the algorithm when applied to samples of the ACRv3 and MPCR books.
Book |
Identification of Structures (F-1) |
Character Structuring (F-1) |
Association of characters to structures (F-1) |
Association of conjunctions (F-1) |
Average (F-1) |
ACRv3 |
97.9 |
98.5 |
98.7 |
96.4 |
97.9 |
MPCR |
98.1 |
93.3 |
86.4 |
92.4 |
92.5 |
The following datasets used by this research are available in GitHub (
That site also contains a detailed description of the files.
The semantic annotation results showed that, due to the semi-structured nature of morphological descriptions of plants, it is feasible to implement, with excellent results, a simple semantic analysis algorithm based on rules using available technology (i.e. FreeLing, OTO, PO, and Flora Mesoamericana English-Spanish Glossary). From Table
The algorithm is scalable (within the biological group of plants) as demonstrated by evaluating it not only in tree records (ACRv3), but also in records of aquatic plants, shrubs, epiphytes, grasses, and lianas described by different authors of the MPCR. Although the MPCR clauses are somewhat more complex as shown in Fig.
Fig.
English translation: "Leaf blade 3-12 (-13) x 2-6 (-7) cm, oblong or elliptic, obtuse or rounded at base, acutely or shortly acuminate at apex, very closely serrated to sub-entire or entire, sparsely pubescent with red trichomes (rarely reddish-cream), usually with black spots on the underside."
In this example, all chunks should be associated with the main structure (Fig.
The performance of the algorithm can be improved using ontologies which include hierarchies of structures / substructures and controlled vocabularies (list of valid characters to describe a structure) or storing additional information in the knowledge base, as follows:
This result should be considered if the algorithm is going to be extended for application to other biological groups (i.e. vertebrates or arthropods) since there are no ontologies for all of them.
Prepositional and Verbal Phrases: The algorithm does not process all prepositional or verbal phrases, however, as a proof of concept, the prepositional phrases that begin with tokens "without" or "with" were structured. The rest of the prepositional or verbal phrases were only delimited as constraint_preposition and constraint_verb, respectively. With the results of this investigation and in a later refinement of the algorithm, the scope of the extraction goal in these cases should be defined in more detail. The refinement should take into account the meaning of each preposition and verb to process the chunk.
Fig.
Fig.
This research presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. It achieves very competitive results (more than 92.5% of average performance) in annotating structures, characters, associating characters with structures and processing conjunctions.
The algorithm is based on rules defined after analysing the morphosyntactic patterns of the dependency trees for the most-used grammatical structures in the ACRv4 book. To define those rules, 73.72% of the book's chunks were analysed.
As the good results of the system depend strongly on the fact that the roles assigned to the words in a chunk are correct, therefore it was necessary to implement a machine learning algorithm to correct the POS labels assigned by the FreeLing's Morphosyntactic Analyzer. The technical terminology based on Latin and the semi-structured language used in morphological descriptions, which is full of names, adjectives and adverbs with few verbs, make FreeLing misassign POS tags to names, adjectives, and verbs.
Although the algorithm achieved a very good performance, it is important to make improvements in some of the stages of the process that are listed below:
Translation of tokens into English: This is the manual process that requires more user attention. Possible improvements include the use of synonyms of the PO; addition of other glossaries; use of Google synonyms; and incorporation of Wiktionary which, in most definitions, includes the botanical meaning of each term.
Semantic annotation of descriptions: Part of the information available in morphological description was not extracted because it was part of a prepositional or verbal phrase. The scope of the information extraction process should be extended in order to structure the information contained in these phrases. Additionally, the selection of the appropriate character amongst the repeated characters could be done using machine learning algorithms.
The implemented algorithm is based on the telegraphic language used by the community of botanical experts. However, it can be generalised to other biological groups by preprocessing the texts of the descriptions to omit some functional words (e.g. the verb "to be") that bring them closer to the telegraphic language used by botanists and extending the functionality of the algorithm.
Ontologies are an important resource to extract information from morphological descriptions since they include functionality to agree on the characters that describe a structure / substructure, to document hierarchies of structures and substructures, and to define the controlled vocabulary for improving the association of characters to structures. However, not all biological groups have general ontologies such as the PO. An ontology integrator such as OTO or a knowledge base designed to include this information will help to improve results and to further develop the algorithm to work with other biological groups.