Biodiversity Data Journal :
Research Article
|
Corresponding author: Nubia Marques (nubia.marques@pq.itv.org)
Academic editor: Paulo Borges
Received: 02 Aug 2024 | Accepted: 04 Sep 2024 | Published: 20 Sep 2024
© 2024 Nubia Marques, Carla Danielle de Melo Soares, Daniel de Melo Casali, Erick Guimarães, Fernanda Fava, João Marcelo da Silva Abreu, Ligiane Moras, Letícia Gomes da Silva, Raphael Matias, Rafael Leandro de Assis, Rafael Fraga, Sara Almeida, Vanessa Lopes, Verônica Oliveira, Rafaela Missagia, Eduardo Carvalho, Nikolas Carneiro, Ronnie Alves, Pedro Souza-Filho, Guilherme Oliveira, Margarida Miranda, Valéria da Cunha Tavares
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Marques N, Soares CDdeM, Casali DdeM, Guimarães E, Fava F, Abreu JMdaS, Moras L, Silva LGda, Matias R, Assis RLde, Fraga R, Almeida S, Lopes V, Oliveira V, Missagia R, Carvalho E, Carneiro N, Alves R, Souza-Filho P, Oliveira G, Miranda M, Tavares VdaC (2024) Retrieving biodiversity data from multiple sources: making secondary data standardised and accessible. Biodiversity Data Journal 12: e133775. https://doi.org/10.3897/BDJ.12.e133775
|
|
Biodiversity data, particularly species occurrence and abundance, are indispensable for testing empirical hypothesis in natural sciences. However, datasets built for research programmes do not often meet FAIR (findable, accessible, interoperable and reusable) principles, which raises questions about data quality, accuracy and availability. The 21st century has markedly been a new era for data science and analytics and every effort to aggregate, standardise, filter and share biodiversity data from multiple sources have become increasingly necessary. In this study, we propose a framework for refining and conforming secondary biodiversity data to FAIR standards to make them available for use such as macroecological modelling and other studies. We relied on a Darwin Core base model to standardise and further facilitate the curation and validation of data related including the occurrence and abundance of multiple taxa of a region that encompasses estuarine ecosystems in an ecotonal area bordering the easternmost Amazonia. We further discuss the significance of feeding standardised public data repositories to advance scientific progress and highlight their role in contributing to the biodiversity management and conservation.
Darwin Core standard, FAIR data, Golfão Maranhense, secondary data
High-quality, openly available biodiversity datasets (e.g. species occurrence, abundance, traits) are indispensable for the monitoring of species and ecosystems and to improve the development of conservation and management policies (
The availability of biodiversity data is influenced by several factors, including geographic region, scientific interest and resource availability (e.g. financial and infrastructure constraints), which can affect the quality and type of data produced (
Finally, data must be shared and archived to ensure findability, accessibility and reusability. Data sharing can occur in a variety of ways, ranging from private sharing on request to depositing data on a public platform. Often, authors make their data available as supplementary material in scientific publications and post datasets on public websites. Sharing biodiversity data is essential for ecological research, conservation and management, education and policy decision-making (
Research programmes from megadiverse regions, such as the Neotropics, face difficulties in retrieving, organising and providing quality data due to their intrinsic complexity, which includes unrecognised species and unresolved taxa complexes and taxonomy. To address these challenges and to improve the reusability of secondary biodiversity data, our study had three main objectives:
Retrieving biodiversity data is not an easy task, as the use of systematic literature searches alone does not guarantee the quality of the data. Additionally, commonly used pipelines for retrieving, standardising and making secondary data available typically overlook grey literature, despite its potential for biodiversity studies. The proposed pipeline is a combination of data retrieval and data management tools that are typically used separately, such as systematic review and the Darwin Core Standard. Following this Biodiversity Data Retrieval Pipeline ensures that secondary data are cleaned, normalised, shared and archived according to the FAIR Data Principles. Additionally, we discuss the challenges of efficient data retrieval, the potential reuse of secondary data in future studies and its limitations. Finally, we predict that initiatives to collect biodiversity data and make it available for reuse can improve knowledge and advance conservation efforts to protect the species, communities and ecosystems of these regions.
The Biodiversity Data Retrieval Pipeline was built following four stages (Fig.
Step-by-step guide for the proposed Biodiversity Data Retrieval Pipeline to retrieve secondary biodiversity data from various sources (e.g. scientific articles, technical reports, theses, dissertations, databases) according to the FAIR principles (findable, accessible, interoperable and reusable).
To test the pipeline, we selected the Golfão Maranhense, a region located in the extreme north of the Amazon (Brazil), due to its richness, ecological diversity and importance as an ecotonal mosaic between the Amazon Forest and the dry ecosystems of eastern South America and because of the scarcity of knowledge about the biodiversity in the region. Although reports on biodiversity in the region exist, they are presented in heterogeneous forms, including scientific articles and non-peer-reviewed technical reports, making it difficult to understand the true distribution of biodiversity richness in the region.
The study was conducted in the Golfão Maranhense (Maranhão State, Brazil) including 13 municipalities in the surroundings (Fig.
The first step in retrieving secondary data is to find the data. To do this, a systematic review of the literature is recommended. We conducted a systematic review performing searches in the platforms Science Direct and Google Scholar and public data repositories such as GBIF, VertNet, Wikiaves and SpeciesLink. During the search on the platforms, we included all the works found, such as scientific articles, books, theses and dissertations. Our search across platforms covered both published (i.e. papers and books) and unpublished literature (i.e. theses, dissertations and environmental consultancy reports focused on licensing). The searches were carried out over two months (June and July 2021) in Brazil. To ensure transparency, completeness and consistency, we followed the "Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)" guidelines. The PRISMA framework, with its checklist and flow diagram, facilitates reader comprehension and allows for the assessment of the reliability and validity of the findings.
We followed four steps:
To ensure that the data have the lowest possible error rate, they need to go through a validation process. This process reduces the chances that the final data will contain grammatical errors, which can make it difficult to understand and species that have been incorrectly identified.
We conducted a manual validation process for the Golfão Maranhanse data that we optimised in two steps:
1. Identifying and fixing errors - We conducted a thorough examination of the data to identify and correct any errors or inaccuracies.
These errors included:
(a) Scientific names (e.g. “Boana ranipcs” [wrong] vs. “Boana raniceps” [correct]);
(b) Geographic coordinates (e.g. “44,321605 [typo without negative sign]” vs. “-44,321605 [correct]”);
(c) Date: we standardised the sampling dates to "Start Month" "Start Year" "End Month" and "End Year" (the original data column, verbatim date, contains the day of sampling, if available).
Additionally, our data cleaning process involved removing duplicates and standardising entries to ensure consistency. By meticulously correcting these issues, we ensured the data's integrity and reliability, making it suitable for further analysis and interpretation.
2. Checking the records - Species occurrence data are susceptible to misidentification and taxonomic inconsistencies, making this a challenging and dynamic task. To ensure the reliability and validity of the species occurrence data, we reviewed the relevant literature for known geographic species distributions and compared them with the collected points. Our team of taxa specialists meticulously checked each entry for inconsistencies and up-to-date taxonomy, according to the most recent accepted taxonomy of each group. Any mismatches between known and collected geographic distributions served as a first alert, indicating the need for further investigation. Additionally, we reviewed the literature for changes in synonymy and updated the occurrence records accordingly.
To standardise data from the Golfão Maranhense region, we used the Darwin Core standard (DwC) (Wieczorek et al. 2012). In addition, we manually added columns for data that are not covered by the DwC (e.g. conservation status). DwC is one of the most widely used standards for biodiversity data used as a language for sharing biodiversity data that can be understood by human users and interpreted by computational systems. The DwC provides a straightforward, stable standard that simplifies the process of publishing biodiversity data, promoting the sharing, use and reuse of openly accessible biodiversity data (
The last step is to choose the right repository to store the data. For species occurrence data, GBIF (
- Integrated Digitized Biocollections (
-
-
- Open Science Framework (
To test whether the number of occurrences depended on the number of taxa in each group, a simple linear regression was performed using R software.
Considering all biotic groups, a total of 161 bibliographical references, including papers and technical reports were included in the systematic review of the literature (Fig.
Flowchart of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) for all groups showing the process of selecting studies throughout systematic review. The selection process includes three stages: (1) identifying the database and choosing the papers; (2) scanning the references and selecting the papers to be included; (3) including the selected papers.
A total of 2,070 occurrence events were obtained from bibliographic references and 43,947 were obtained by public repositories (n = 46,017) from 3,871 taxa. These include birds (Aves, 458 species; three other taxonomic level), amphibians (Amphibia, 55 species; nine to the genus level), reptiles (two Crocodylia; 86 Squamata; 11 Testudines); mammals (Class Mammalia; 101 species; 21 to the genus level), fish (268 species, 74 other taxonomic levels), phytoplankton (370 species; 105 other taxonomic levels), benthos (188 species; 204 other taxonomic levels) and plants (1,624 species; 292 other taxonomic levels) (Suppl. material
Data were carefully analysed by specialists in each group to check for inconsistencies in identification, spelling and, as much as possible, potentiality of identification correctness (e.g. check if the geographic locations were within expected known geographic distribution for each taxon, checking vouchers when possible). A total of 93 occurrence events were deleted, including 92 from taxa that were not correctly identified (76 birds and 16 mammals) and one bird specimen that was a victim of animal trafficking.
Of the 76 variables extracted from the studies, 59 were standardised using DwC terms and 17 were adapted due to the lack of appropriate terms for these variables within the current DwC models (Suppl. material
We decided not to publish the database in a specialised database such as GBIF because it contains secondary data that includes information extracted from other databases, including GBIF itself. This would result in duplication of information. We decided to store the data in the Open Science Framework (OSF). OSF is an open access repository that maintains version control, allowing them to track changes to their projects over time. The OSF also assigns Digital Object Identifiers (DOIs), making the data citable and ensuring its long-term preservation. The database is publicly available on the link (
The number of occurrences was dependent on the number of taxa in each sampled group (R² = 0.47, p = 0.03). While amphibians and non-bird reptiles were represented by low numbers of both taxa and occurrences, plants, birds and phytoplankton were highly represented for both occurrences and richness. On the other hand, the group “benthos” had a high number of occurrences and a low number of taxa (Fig.
We proposed a workflow to improve our ability to recover higher quality biodiversity data using secondary data sources. We were able to extract a large amount of information about the biodiversity of the Golfão Maranhense and transform this unrelated data into organised and re-usable data. This systematic approach ensured data accuracy and reliability, facilitating the potential reuse of information in future studies. A further step that we have begun to take for some groups is the systematic survey of museum collections and analyses, focusing on relevant questions that we have identified along the way (e.g. general patterns of occurrence of migratory birds, sampling biases and gaps for many groups etc.).
Researchers can use existing datasets, such as those obtained through our biodiversity data retrieval method, to conduct a wide range of studies to advance scientific research (
Long-term monitoring datasets can help to understand patterns and changes in ecological variables over time (
While secondary data can be a valuable resource for scientific research, it is crucial to recognise and address its limitations and ideally estimate the errors within. Common challenges include species identification accuracy, geographic coordinate precision and data entry errors. In addition, datasets from different studies may differ in their sampling methods, data structure and definitions of key variables, making direct comparisons difficult. Finally, some datasets may not be openly accessible, which has implications for data availability and complicates data access and sharing policies.
Other limitations are the sampling and temporal biases, which can arise when working with secondary data, making data interpretation more challenging. Sampling bias occurs when the data sampling disproportionately favours certain species or areas over others, for example, the concentration of specimen records in more easily accessible sites, such as major cities, roads and navigable rivers (
In our study, sampling bias is evident in the São Marcos Bay area, where an industrial ship port is located. Thus, most of the data were obtained from environmental monitoring reports in the region linked to the environmental licensing process. These reports conducted in a port area inherently prioritise certain species and ecological aspects more relevant to the licensing process, overlooking other important components of biodiversity. Within our database, it becomes apparent that some species records originate from technical reports that are not easily available. For example, we found 365 species and varieties of phytoplankton in technical reports, but 101 were not previously catalogued on the Brazilian Biodiversity Platform REFLORA (
The workflow that we employed has facilitated the retrieval of biodiversity data from the ecologically rich and megadiverse Golfão Maranhense region in Maranhão, Brazil. By combining a systematic review approach with standardised worksheets with a Darwin Core base, we were able to effectively search and explore a wide range of scientific articles, technical reports and specialised public repositories. The potential use of secondary data for the advancement of scientific research is significant although it must be taken with care and analysed with precautions observing all bias limitation and filters involved. Many technical survey reports were produced in the Golfão Maranhense linked to environmental licensing process for the port and surrounding activities. By using existing datasets, researchers can carry out a wide range of activities which include meta-analyses, comparative studies, ecological modelling and, most of all, building hypotheses and producing experiment designs to monitor diversity in a standardised base. Our study highlights the value of systematic review methods and the need for an approach to address data limitations and biases. Likewise, this method can facilitate collaboration amongst researchers, enable comparative analyses across different datasets and support evidence-based conservation strategies and policy-making.
We would like to thank the Environmental Management of the Ponta da Madeira Maritime Terminal for their support in developing the project. We are grateful to the reviewer Pedro Cardoso for his suggestions for improving the manuscript and to the collectors of the data we retrieved from the literature. Daniel M. Casali is currently being funded by the grant #2022/00044-7, São Paulo Research Foundation (FAPESP)
Vale Institute of Technology
Keywords used in the systematic review of each biotic group.
Table containing the Darwin Core (DwC) standard terms that were used to make the table and extract the information from the bibliographic references previously selected in the systematic review. Label = name of the column in the DwC standard; Definition = Brief definition of what each column means.
Flowchart of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) separated by groups showing the process of selecting studies throughout the systematic review. The selection process includes three stages: (1) identifying the database and choosing the papers; (2) scanning the references and selecting the papers to be included; (3) including the selected papers.
List of species from the Golfão Maranhense (Maranhão State, Brazil) that were retrieved through the systematic literature review.