Biodiversity Data Journal :
Software Description
|
Corresponding author: Gabriel Muñoz (nasua.research@gmail.com)
Academic editor: Donat Agosti
Received: 30 Jul 2018 | Accepted: 19 Dec 2018 | Published: 16 Jan 2019
© 2019 Gabriel Muñoz, W. Daniel Kissling, E. Emiel van Loon
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Muñoz G, Kissling W, van Loon E (2019) Biodiversity Observations Miner: A web application to unlock primary biodiversity data from published literature. Biodiversity Data Journal 7: e28737. https://doi.org/10.3897/BDJ.7.e28737
|
|
A considerable portion of primary biodiversity data is digitally locked inside published literature which is often stored as pdf files. Large-scale approaches to biodiversity science could benefit from retrieving this information and making it digitally accessible and machine-readable. Nonetheless, the amount and diversity of digitally published literature pose many challenges for knowledge discovery and retrieval. Text mining has been extensively used for data discovery tasks in large quantities of documents. However, text mining approaches for knowledge discovery and retrieval have been limited in biodiversity science compared to other disciplines.
Here, we present a novel, open source text mining tool, the Biodiversity Observations Miner (BOM). This web application, written in R, allows the semi-automated discovery of punctual biodiversity observations (e.g. biotic interactions, functional or behavioural traits and natural history descriptions) associated with the scientific names present inside a corpus of scientific literature. Furthermore, BOM enable users the rapid screening of large quantities of literature based on word co-occurrences that match custom biodiversity dictionaries. This tool aims to increase the digital mobilisation of primary biodiversity data and is freely accessible via GitHub or through a web server.
biodiversity data, biodiversity knowledge, biotic interactions, data mobilisation, scientific names, text mining, R.
Mobilisation, digitalization and interoperability of data on biodiversity are vital for sharing our global knowledge of nature (
Text mining is a computational technique used for the automatic and semi-automatic discovery of useful information from large quantities of text (
Here, we present the Biodiversity Observations Miner (BOM), a text mining tool that has been designed to augment the ability of ecologists and biodiversity scientists to implement text mining frameworks into their data compilation workflows. A first approach of implementing BOM into biodiversity research is using it as a tool to speed up and standardise the selection of candidate articles for large-scale meta-analyses. In addition, BOM can also be used for rapid discovery of specific biodiversity data across multiple articles at once. As such, this web tool can be used to discover observations from literature and to populate global biodiversity databases, for example on species traits (e.g. TRY) or species interactions (e.g. GloBI). As such, the BOM allows increasing the digital accessibility and availability of biodiversity data. The main feature of BOM is to identify snippets of text that potentially contain biodiversity information (i.e. data of biodiversity observations) within a given corpus of literature. BOM finds these snippets either by finding text statements linked to taxonomic entities (e.g. species names, genus, family) or by using specific keywords to filter a rank of annotated word co-occurrences inside the corpus of literature. These keywords are a curated list of terms describing a particular biodiversity observation and are provided in BOM as biodiversity dictionaries. Biodiversity Observations Miner is open source and freely accessible via GitHub (BiodiversityObservationsMiner) or via a web server (goo.gl/wt6V9R).
User interface:
The web application follows a dashboard design containing a header, a sidebar menu and the main page (Fig.
Sections of Biodiversity Observations Miner (BOM) user interface: The figure illustrates the different parts that compose the user interface of BOM web application. The interface is composed of three main components, a header (white bar on top), a sidebar menu (dark blue at in the left side) and the main page (cyan in the centre). The header includes the application name (1), a button to collapse the sidebar menu (2) and a notification menu (3). The sidebar menu (4) contains the individual tabs to navigate across the functionalities of BOM. The main page (5) allows the setting of parameters and obtaining the results of the mining steps. In the main page, the header of setting type boxes are colour-coded yellow whereas the result boxes (i.e. Text snippets) are colour-coded with red headers.
Functional description:
OCR of PDF files
Before using Biodiversity Observations Miner, a user needs to create a corpus of relevant literature, stored as a collection of individual PDF files. This biodiversity literature corpus can be compiled by downloading PDFs of scientific articles from web databases such as Web of Science and Google Scholar. The collection of PDF files can be uploaded in batch to BOM. PDF versions from different publications can be very heterogeneous in nature. As such, plain text from PDF file(s) is recognised with the Google Tesseract tool for Optical Character Recognition (OCR) (
Scientific name recognition
Biodiversity Observations Miner makes use of the Global Names Recognition and Discovery (GNRD) (
Calculating word co-occurrences
Individual sentences across the whole literature corpus are considered as text snippets that potentially contain one or more biodiversity observations of particular interest for a user of BOM. As such, word co-occurrence patterns can provide useful information to characterise the content of these text snippets. For example, the words "body" + "size" can be used to tag individual text snippets with information on allometric relations, functional trait relationships etc. In BOM, text strings from the literature corpus are split into sentences using a sentence tokeniser. Then, the individual elements (e.g. nouns, verbs, articles) of these sentences are annotated with a pre-trained, English based, natural language processing (NLP) model (
The skip-n-gram model is a practical, powerful model to infer context from text and is usually applied in processes such as speech recognition (
Example of a moving window of n = 6 of a skip-n-gram model over a piece of text from
Retrieving text snippets
BOM uses indexed scientific names and word co-occurrences to retrieve text snippets across all the uploaded literature corpus. This allows rapid discovery of targeted biodiversity observations inside the corpus text. First, with the byTaxa tab, the use of scientific names to retrieve text snippets and word co-occurences to characterise its content allows for rapid screening of literature based on the particular taxonomic interest of an individual user. Second, with the byKeywords tab, BOM also allows the retrieval of text snippets based on individual word co-occurrences only. These word co-occurrences can be further filtered using custom biodiversity dictionaries.
Biodiversity dictionaries
A biodiversity dictionary is a list of common terms used to describe a particular biodiversity observation. Currently, BOM lists biodiversity dictionaries matching text observations of frugivory and pollination, i.e. specific biotic interaction types. For example, the written description of a plant-animal interaction of frugivory might include terms such as fruit, eat, disperse, swallow, etc. (Fig.
Example of one text snippet resulting from running Biodiversity Observations Miner with
Creative Commons Attribution 4.0 License. CC-BY 4.0
Published literature in ecology holds a vast amount of information from centuries of research (
In ecology and biodiversity science, computational methods such as machine learning algorithms have slowly integrated into research frameworks when compared with other disciplines (
The heterogeneity on terminologies describing particular biodiversity observations creates a challenge to automatically characterise text-based observations into standardised biodiversity data. Currently, there is a lack of standard terminologies to describe particular biodiversity observations. For instance, the term "eat" might match the textual description of many forms of biotic interactions (e.g. predation, frugivory, commensalism). We believe that initiatives, such as BOM, can benefit from future work that promotes the standardisation of terms via ontologies and controlled vocabularies. Furthermore, this could be further expanded to increase biodiversity dictionaries to match observations of natural history (e.g. dispersal distances, habitat preferences), biotic interactions (e.g. parasitism) or species functional traits (e.g. leaf area, flower phenology, body mass, wing length, mandible type, lifetime reproductive output) (
The target audience for this web application includes ecologists and biodiversity scientists at all career stages. Additionally, this application invites developers (ecologists or not) to suggest ideas for improvement. We are open to discussing additional ideas or new tools to expand the current functionalities of this web application.
Biodiversity Observations Miner was written in R (
Biodiversity Observations Miner makes use of R packages designed for text mining and base R functions. The taxize package is used to establish the API connection to the Global Names Recognition and Discovery (GNRF) tool. Taxize is also used for Optical Character Recognition (OCR) of the text in the PDFs and is done by GNA using the Google Tesseract Tool. The stringr is used for string manipulation. Details on the code and custom functions written for this application can be found in the GitHub Repository of this application. In addition, BOM requires the following R packages to run locally: shiny, shinydashboard, stringi, stringr, taxize, reshape, udpipe, tibble, DT.
Biodiversity Observations Miner uses tools from GNA (GlobalNamesArchitecture) implemented in the taxize package. Thanks to Scott Chamberlain for modifications to the scrapenames function in taxize so it returns the OCR content of PDF files (https://github.com/ropensci/taxize/issues/614). Credits to the developers of the individual packages which is Biodiversity Observations Miner-dependent. Terms composing the pollination biodiversity dictionary were selected in collaboration with Joan Casanelles. Tomas Medina provided grammatical corrections and feedback for the first draft text.
GM developed Biodiversity Observations Miner with guidance, comments and input from WDK and EvL. GM wrote the first draft of the manuscript and WDK and EvL provided input. Terms composing the frugivory interactions dictionary were discussed between GM and WDK.
Biodiversity Observation User's manual. Follow this guide to upload literature and mine biodiversity observations using BOM.