Biodiversidata: A novel dataset for the vascular plant species diversity in Uruguay

Abstract Background South America hosts some of the world’s most prominent biodiversity hotspots. Yet, Uruguay – a country where multiple major ecosystems converge – ranks amongst the countries with the lowest levels of available digital biodiversity data in the continent. Such prevalent data scarcity has significantly undermined our ability to progress towards evidence-based conservation actions – a critical limitation for a country with a strong focus on agricultural industries and only 1.3% of the land surface guarded by protected areas. Under today’s rapid biodiversity loss and environmental changes, the need for open-access biodiversity data is more pressing than ever before. To address this national issue, Biodiversidata – Uruguay’s first Consortium of Biodiversity Data – has recently emerged with the aim of assembling a constantly growing database for the biodiversity of this country. While the first phase of the project targeted vertebrate biodiversity, the second phase presented in this paper spans the biodiversity of plants. New information As part of the second phase of the Biodiversidata initiative, we present the first comprehensive open-access species-level database of the vascular plant diversity recorded in Uruguay to date (i.e. all species for which data are currently available and species presence has been confirmed). It contains 12,470 occurrence records from across 1,648 species and 160 families, which roughly represents 60% of the total recorded flora of Uruguay. The primary biodiversity data include extant native and introduced species from the lycophytes, ferns, gymnosperms and angiosperms groups. Records were collated from multiple sources, including data available in peer-reviewed scientific literature, institutional scientific collections and datasets contributed by members of the Biodiversidata initiative. The complete database can be accessed at the Zenodo repository: doi.org/10.5281/zenodo.3954406


Introduction
South America stands out as one of the planet's regions with the highest levels of speciesrichness and endemisms (Myers 2003, Myers et al. 2000. Within this subcontinent, Uruguay occupies an area characterised by a high floristic diversity in one of the most extensive temperate grasslands on the globe (Soriano et al. 1992, Andrade et al. 2018. This country embodies a transitional territory where multiple floristic elements of diverse origin converge (Chebataroff 1942, Chebataroff 1960, Brussa and Grela 2007, Grela 2004. However, the boundaries and ecotones of these phytogeographic regions have remained under sustained debate fundamentally given the lack of comprehensive and accessible databases on the diversity and distribution of species from Uruguay. While recent efforts have endeavoured to resolve this limitation for vertebrates (Grattarola et al. 2019a), data on the country's plant biodiversity remain limited and scattered across a range of small-scale databases, the majority of which are inaccessible. Our work fills this gap by presenting such a comprehensive database for the plant biodiversity of Uruguay.
The compilation of georeferenced plant data is a relatively-recent practice in Uruguay (Brazeiro et al. 2008, Grela 2004, Haretche et al. 2012. In all cases, maps have been created, based on lists of species found within individual grid-cells of 660 km size, without specific information about their actual locations or when the species were observed or collected. This method is intrinsically affected by a serious loss of spatial and temporal information (i.e. data leakage, Peterson et al. 2018). Yet, until today, it has been the standard approach to build national species distribution databases. It is imperative for Uruguay to change its biodiversity information management to a widespread approach, based on the digitisation of specimen and literature records and on open-access of the available databases. Under accelerating scenarios of human-induced alterations of the global climate and natural landscapes (Butchart et al. 2010, Lovejoy and Hannah 2019, Parmesan 2006, with more than 36.2% of the Uruguayan territory already modified (Brazeiro et al. 2020, Cespedes-Payret et al. 2009), the least developed network of protected areas in the region (Soutullo and Gudynas 2006) and the lowest levels of digitally-available biodiversity data of Latin America (Grattarola et al. 2019a), developing scientific databases in Uruguay is a critical need.

The Biodiversidata Initiative
Biodiversidata -Uruguay's first Consortium of Biodiversity Data (https://biodiversidata.org/) -has recently emerged with the aim of assembling a constantly growing, open-access database for Uruguay's biodiversity (Grattarola et al. 2019a). The range of beneficiaries of the biodiversity data resources that this initiative provides is wide, including individuals and institutions from the scientific, educational and governmental sectors. Biodiversidata relies on the assemblage of biodiversity experts, with the aim of collating a comprehensive database spanning all components of the country's biodiversity (Grattarola et al. 2019a). This aim is being achieved by overcoming the main obstacles detected in the process of data-sharing (Grattarola and Pincheira-Donoso 2019), focusing on data digitisation, curation and standardisation, as well as on the use of the data to collaboratively address questions of conceptual global impact/interest. The data collated for the tetrapod vertebrates have enabled us to understand that most of the country remains neglected by scientific efforts, while a few areas have historically been consistently sampled. In this second stage, we focused on plants. The sampling bias scenario seems to mirror the scenario described for tetrapods (Grattarola et al. 2019a). Here, we present the first comprehensive open-access database of vascular plant species of Uruguay, including all species for which data are currently available. The total number of records collated is 12,470, which includes 1,648 species out of the 3,000 species that have been reported for this country (including records to be confirmed) (Marchesi et al. 2013, Zuloaga et al. 2019) (Table 1). Combined with the first phase that presented a database for tetrapods (Grattarola et al. 2019b), this current expansion of Biodiversidata to plant biodiversity provides an unprecedented resource, anticipated to have a major impact on the development of biodiversity studies and management in Uruguay.

Sampling methods
Sampling description: The primary data were collated from a range of different sources such as online databases, field guides, reports and primary literature, as well as Biodiversidata members' original field/herbarium records. A complete list of sources for the occurrence records is shown in  The data from bibliographic references were obtained from searches based on the use of more than 30 sources which were largely heterogeneous in the amount of information available for each record. The information about the source was captured for each record using the 'associatedReferences' Darwin Core term. The data extracted consisted of taxa names, their geographic location and date of the collection/observation event when available, as well as information about collectors and identifiers. In some cases, georeferencing of the point locations was needed and relevant information was captured under the terms 'coordinateUncertaintyInMeters', 'coordinatePrecision' and 'georeferenceRemarks' (see more details in Steps description subsection).
The data from online sources were accessed through GBIF via 'rgbif' (Chamberlain et al. 2020a), using the following search parameters: 1) Tracheophyta as taxon, 2) 'UY' in the country code (= Uruguay), 3) year of collection from 1900 onwards, 4) with geographic coordinates and no spatial issues associated and 5) including data of 'Preserved Specimen' and 'Human Observation' categories. The chosen parameters were considered to reduce the data cleaning time, given the purpose of use of the data being collected and the limited timeline of the project under course. Records with 'LivingSpecimen' or 'FossilSpecimen' as the basis of records were filtered to avoid crop/cultivated and extinct species. As a perspective, records with 'Unknown' base should be checked in the future. As well, the date lower limit and the constraint of records with coordinates and no spatial issues associated were selected to minimise potential taxonomic and geospatial uncertainties/errors needed to be checked. It would be desirable to process these data to include them in Biodiversidata's database in the future. For instance, georeferencing efforts could be implemented to increase the number of records to include (Chapman and Wieczorek 2020, Zermoglio et al. 2020).
A single dataset with 5,138 occurrence records was downloaded, available at: https:// doi.org/10.15468/dl.wc2fm7. After the data cleaning and quality check process was performed (see details in Quality control subsection), we kept 3,428 data records. Of those records, 1,787 corresponded to specimens and were contributed to GBIF by 51 different institutions around the world. The major contributor was the Missouri Botanical Garden (28.8% of the 1,787 records), followed by Universidade Federale do Rio Grande do Sul of Brazil (11.8%) and Universidade de São Paulo (6.6%). The 1,637 human observations were mainly derived (99.6%) from the citizen-science platform iNaturalist.
The data provided by Biodiversidata members were curated (e.g. taxonomic names updated, fields standardised) and uploaded to GBIF as four separate datasets, one for each data contributor (see sources in Table 2). These records were mostly part of research project surveys, 77.5% of them being observations and 22.5% have a specimen deposited in national natural history collections, such as the Herbarium of the Museo Nacional de Historia Natural de Uruguay (MVM) and the Museo y Jardín Botánico Prof. Atilio Lombardo (MVJB).
Quality control: For data to be fit for use, they must be accurate, complete, consistent with other sources and provide a proper level of detail (Chapman 2005). To meet these standards, we performed the subsequent steps for all the data (see R scripts and working examples at github.com/bienflorencia/rBiodiversidata): Step description: We checked misspellings, format errors and resolved synonymy and we completed higher taxonomic and infraspecific ranks terms and taxonomic authority for the scientific names using the R packages 'taxize' (Chamberlain et al. 2020b) and 'WorldFlora' (Kindt 2020). To check and unify species scientific names, for simplicity we first contrasted the list of species names to World Flora Online (WFO) Taxonomic Backbone. For species derived from literature and Biodiversidata members sources, we used verbatim species names against the authority sources and for GBIF data, we used the 'scientificName' field. If the species match were accepted by Zuloaga et al. (2019) in Darwinion, we kept the name and taxon ID of WFO, otherwise, we used the accepted name from Darwinion and searched for a taxon ID in Tropicos. The original species name was kept under the term 'previousIdentification'. Additionally, the term 'establishmentMeans' was added, categorising species as native or introduced (in Spanish: nativa and introducida) according to Andrade et al. (2018). Species with unverified occurrence in the country were excluded. The final species list was checked by the Biodiversidata plant experts.
We checked dates accuracy and completed the 'eventDate' term with the format YYYY-MM-DD (e.g. 2020-02-10 for 20 February 2010). If only the year were known, 'eventDate' was represented as YYYY and if only the year and month were known, as YYYY-MM.
We filtered records occurring outside Uruguay's continental territory and checked for inaccuracy and incompleteness in georeferences. The data accessed via GBIF was filtered by keeping records with coordinate uncertainty values of less than 10 km and discarding those records with country centroid as georeference protocol. This hard filter was performed to reduce processing time and avoid location inaccuracy for posterior analyses. For the data extracted from literature, when coordinates were missing, we georeferenced point localities from maps figures using Google Earth Pro 2020 and marked them as requiring further verification. From the data provided by members of Biodiversidata or collated from literature, when geographic coordinates were presented either as degrees, minutes and seconds or degrees and decimal minutes, we georeferenced the locations to decimal degrees, following georeferencing best practices (Chapman and Wieczorek 2020, Zermoglio et al. 2020), including datum, uncertainty, precision, georeferencing protocol and georeferencing date values for all these records. Finally, we included the higher geography terms 'continent' and 'country' and the 'stateProvince' term for all the records in the database through the GeoNames Gazetteer database using the R package 'geonames' (Rowlingson 2019).
Finally, we generated a unique 'occurrenceID' for every record in our database, except the data accessed from GBIF for which we kept the original ID.

Geographic coverage
Description: The database covers extant species of vascular plants reported for locations within the borders of Uruguay. The occurrence records are spatially biased (Fig. 1a), as larger numbers of records are restricted to areas around the borders of the country, whilst the central regions of the territory have lower levels of sampling. The most sampled area of Uruguay is in Cerro Largo (central-eastern part of the country, at the frontier with Brazil), followed by the surroundings of some cities on the Atlantic coast (Fig. 1b). As previously observed in tetrapods (Grattarola et al. 2019a) and woody flora (Haretche et al. 2012), some areas of the country remain systematically neglected. It is currently unclear whether these disparities in sampling are due to the lack of explorations, the lack of digitisation or georeferencing of existing occurrences (e.g. GBIF records discarded in the preparation of the database), given the high taxonomic complexity of some vascular plant families for records to reach species level or a combination of all the above. Additionally, the lack of explorations could be for multiple reasons: either a result of the difficulty to access certain areas (see in Fig. 1c, the distribution of urban areas, main routes and rivers) or because of the preference of botanists for certain landscapes over others (Haretche et al. 2012). Maps were created in R and figures prepared using ArcGis 10.5. Sampling effort was evaluated as the number of records in each cell (see scripts in github.com/bienflorencia/ rBiodiversidata for a working example). For the Biodiversidata project, Uruguay's territory is divided in grid-cells of three different sizes: 50 x 50, 25 x 25 and 12.5 x 12.5 km; here we present sampling effort values with the mid-size unit of 25 km.  Distribution in Uruguay of (a) the total 12,470 occurrence records of vascular plants in Biodiversidata, (b) sampling effort with 25 × 25 km grid-cell resolution (the mid-size resolution used for Biodiversidata's analyses) and (c) urban areas (orange dots with size relative to surface in km ), routes (international, primary and secondary) and main rivers. Projection WGS1984.

Taxonomic coverage
Description: The database includes 1,362 native species, 271 introduced and 15 species of yet unknown establishment means. According to Andrade et al. (2018), in Uruguay there are 167 families of vascular plants, comprising 2,431 native species. Therefore, our database covers 56.1% of the native species and 94.7% of the families that have been recorded in the country. The taxonomic coverage amongst groups is uneven (Fig. 2) fairly reflecting the current richness dominance of some taxa groups over others. See on top of the bars in Fig. 2

Notes:
The records included in the database cover samples reported in Uruguay during the period of 1877-2020 (Fig. 3). A large proportion of the records has information about the date of collection/observation (89.2%). We observed that occurrence records have been collected mostly irregularly within groups, with some families, such as Poaceae and Piperaceae, represented in larger time periods, yet most exclusively or more intensely in the last 20-30 years.

Additional information
Biodiversidata is a collaborative association of experts with the aim of assembling a constantly-growing database for Uruguay's biodiversity. The initiative was launched in 2018 under the direction of Florencia Grattarola as part of her PhD project at the University of Lincoln in partnership with the MacroBiodiversity Lab at Queen's University Belfast (UK), led by Daniel Pincheira-Donoso. Its open-access platform (https://biodiversidata.org/) aims to make the biodiversity data of Uruguay openly available by integrating a broad range of resources including databases, publications, maps, reports and infographics, derived from the work of the team of expert scientific members. Current funds for developing Biodiversidata are conditional upon Grattarola's PhD project concluding in December 2020.
The database presented in this study will continue to be improved and updated with new records periodically (yearly expected); check the Zenodo repository for the latest version: doi.org/10.5281/zenodo.3954406