Biodiversity Data Journal :
Research Article
|
Corresponding author: A. Townsend Peterson (town@ku.edu)
Academic editor: Vincent Smith
Received: 21 May 2018 | Accepted: 17 Oct 2018 | Published: 07 Nov 2018
© 2018 A. Townsend Peterson, Alex Asase, Dora Canhos, Sidnei de Souza, John Wieczorek
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Townsend Peterson A, Asase A, Canhos D, de Souza S, Wieczorek J (2018) Data Leakage and Loss in Biodiversity Informatics. Biodiversity Data Journal 6: e26826. https://doi.org/10.3897/BDJ.6.e26826
|
The field of biodiversity informatics is in a massive, “grow-out” phase of creating and enabling large-scale biodiversity data resources. Because perhaps 90% of existing biodiversity data nonetheless remains unavailable for science and policy applications, the question arises as to how these existing and available data records can be mobilized most efficiently and effectively. This situation led to our analysis of several large-scale biodiversity datasets regarding birds and plants, detecting information gaps and documenting data “leakage” or attrition, in terms of data on taxon, time, and place, in each data record. We documented significant data leakage in each data dimension in each dataset. That is, significant numbers of data records are lacking crucial information in terms of taxon, time, and/or place; information on place was consistently the least complete, such that geographic referencing presently represents the most significant factor in degradation of usability of information from biodiversity information resources. Although the full process of digital capture, quality control, and enrichment is important to developing a complete digital record of existing biodiversity information, payoffs in terms of immediate data usability will be greatest with attention paid to the georeferencing challenge.
biodiversity data, usability, fitness for use, time, place, taxon, informatics, geographic referencing, georeferencing, digitization
NOTE: responses to longer-form commentaries from reviewers are provided in Suppl. material
Biological diversity is the variety of life on Earth, and provides or sustains, at least in an ultimate sense, all raw materials for human well-being (food, water, shelter). Biodiversity also supports a series of ecosystem services that, although perhaps less tangibly, maintain all natural and human systems (
Primary biodiversity data—i.e., data records that document the occurrence of a particular species at a place at a point in time—represent a central element in the universe of data documenting biodiversity. Primary biodiversity data have many applications, including documenting basic biodiversity patterns (
Still, total numbers of primary biodiversity data records that are openly available as digital accessible knowledge (DAK;
Even with more than a billion biodiversity specimen and observational data records existing and available in digital format (as of 22 July 2018), many of those records are compromised by missing, partial, or incomplete information, such that they are not usable in many science applications. We term this process as data leakage, or data attrition, to emphasize how an initially large data resource is reduced massively via a series of seemingly relatively minor factors (this view of leakage contrasts with a more temporal sequence of degradation or loss;
Schematic summarizing the translation between biodiversity and biodiversity data, and how those data “leak,” and get lost and degraded, such that only a small subset is available as usable data for science and policy applications. Note that the particular sequence of steps is not set, and may indeed vary significantly from region to region, taxon to taxon, or source to source.
In this contribution, we explore the dimensions and magnitude of these data leaks. Using a diverse suite of plant and bird collections as examples, we assess numbers of data records for which information on time, place, and taxon that is missing or incomplete, distinguishing between data that are simply lacking and those that can be added or rescued. We also explore joint effects that relate directly to two typical uses of such data: place x taxon, for ecological niche modelling and species distribution modelling (
Our analysis sequence is outlined in a protocol file. Briefly, though, we downloaded full institutional datasets for ornithological collections from VertNet (
Each record from each data set was analyzed with respect to time (i.e., in Darwin Core terms, day, month, year, verbatimEventDate), taxon (genus, subgenus, specificEpithet, infraspecificEpithet, taxonRank), and place (country, stateProvince, county, municipality, locality, verbatimLocality, decimalLatitude, decimalLongitude, coordinateUncertaintyInMeters, coordinatePrecision, verbatimCoordinateSystem, georeferenceProtocol). We evaluated each data record as regards 4 categories of completeness and fitness for use: information missing completely (accorded value 0), information partial (value 1), information incomplete but with sufficient information that it could be “rescued” and brought to completeness (value 2), and information complete and ready for use (value 3). We deemed information as “rescuable” when information can be improved or corrected, such as by georeferencing textual geographic information quantitatively, or by correcting a scientific name that is not a standard name; however, we take a somewhat restrictive view of potential for rescue, in that we do not include as rescuable those specimens that could be reexamined physically to obtain information not in the digital record--rather, we focus on rescue in the sense of the data record per se.
Data on time were considered to be partial when information on day, month, year, or their equivalent in eventDate was missing; time was considered as rescuable when full information appeared to be present in verbatimEventDate, but was not parsed appropriately into day, month, and year, or eventDate. For taxonomic information, names were considered as missing if no genus- or species-level information existed, partial if identified to genus but not to species, and rescuable if not a name listed in at least one taxonomic authority (ornithological authorities checked included
Data on place were considered as missing when geographic coordinates were lacking and textual geographic descriptions lacked information more precise than state. These data were considered as partial when information was available at the level of county/municipality, but not to the level of a specific locality. Data on place were considered as rescuable when the locality was described fully in textual terms, but geographic coordinates missing, or when geographic coordinates were not completely documented with appropriate metadata (
To provide a broader perspective on these data leaks, beyond single datasets, we included overview information parallel to the information for individual datasets for two major, large-scale biodiversity information networks. Specifically, we assessed the Brazilian Virtual Herbarium (5,547,394 records as of 17 February 2017) and VertNet (19,623,087 records as of 17 February 2017). Queries by the information managers of these two networks (authors on this paper) replicated the single-collection analyses described above, to create broad-scale overviews of information completeness across two massive information portals.
For all of the data sets described above, data were summarized in terms of usability for time, taxon, and place separately. We also considered two common applications of primary biodiversity data records. First, for ecological niche modeling and species distribution modeling, a researcher requires information on place and taxon (
All data analyzed in this study are freely and openly available via online data resources, particularly from VertNet and GBIF. Specific working datasets are available as Suppl. material
Of the three dimensions of the data that we assessed (time, taxon, and place; Figs
Summary of patterns of completeness and incompleteness of information for 8 herbarium collections, in terms of time, taxon, place, taxon x place, and time x taxon x place. Note that, for lack of a global plant names list that is fully available, we considered rescuable and full taxonomic information together here.
Summary of data leaks in time, place, and taxon information for two major biodiversity informatics initiatives: the Brazilian Virtual Herbarium and VertNet. Note that, for Brazilian Virtual Herbarium, county-level automated georeferencing was included as full georeferencing because it includes information on datum and coordinate uncertainty, although those data records could be georeferenced more finely based on the specific collecting locality. Color scheme follows the key of Figs
We examined data readiness for use in ecological niche modeling and biodiversity inventory analysis (Figs
The analyses presented herein showed that all of the datasets examined suffered some amount of leakage or attrition. That is, for diverse reasons, some information got lost along the way. In some cases, the information loss had occurred at the time of collection of the specimen: i.e., a key data field was not recorded. In such situations, the data record may remain forever without that information. In other cases, however, information loss occurred later, such that some potential exists for rescue and recovery of the information. This potential for rescue with intelligent analysis and hard work is illustrated for the case of date information in a recent analysis (
In cases in which the data record may be incomplete, but the data are rescuable, possibilities exist for rapid improvement of DAK resources. For specimen-based biodiversity records, almost always, the specimen can be reexamined and reassessed, perhaps even using new techniques such as DNA barcoding (
Place information is clearly the dimension in which the greatest need for data rescue exists; that is, biodiversity records almost always hold some spatial information, but the translation of that information into carefully derived and documented geographic coordinates is a complex process (
Indeed, some exploration of place-related data leakage patterns is in order. Of the total of 1,011,708,052 records available via the GBIF data portal as of 22 July 2018, 921,414,317 have geographic coordinates. This total of 91.1% georeferenced is impressive, but is also somewhat deceptive—that is, in the first place, most of those georeferenced records do not include the full metadata to document uncertainty (especially coordinateUncertaintyInMeters), even though this information is crucial to applications such as ecological niche modeling (
A further consideration is the interaction between time and data leakage. That is, the specimen record generally is seen as providing the deepest-time view into biodiversity distributions, yet data leakage certainly is more frequent as the age of the specimen increases, as has been documented in previous analyses (
Finally, dimensions of leakage exist that may not be so important for assuring use of the data record. That is, most uses of biodiversity information focus on time, place, and taxon, so other data fields may be less crucial to use of the data in actual analysis; although still important, data sharing and use do not have to await full checking of the full set of fields, as the need for access to such information is immediate and crucial (
The explorations presented in this paper lead us to a series of insights into how the field of biodiversity informatics can best move forward towards maximizing its information resources. That is, just investing enormous effort may not be the optimal way forward: rather, “smart” effort may yield much greater pay-offs. Analysis of data leakage, as has been illustrated above, offers ways of thinking about these strategies.
If the goal is to maximize the availability of DAK for analysis and interpretation, one can take into account the sequence of information flow and data leakage (Fig.
This insight can guide time investment in biodiversity informatics initiatives. Analyses such as those we have developed identify immediately the limiting dimensions of DAK usability, thereby focusing immediate investments of time and energy. The clearest signal from our analyses is that detailed and well-documented georeferencing is a crucial aspect of biodiversity informatics, although particular situations can and will differ significantly from this generality. Other insights derive from the data flow and leakage analogy: some biodiversity informatics activities—although important clearly—may not pay off in usable information as immediately. For instance, basic digitization is a major emphasis in the field, and is important for collections management, but digitization in an institutional framework that does not foster data sharing will not improve and increase the availability of information for science and policy.
In previous analyses and assessments of biodiversity data in biodiversity information portals around the world, the concept of Digital Accessible Knowledge has been proposed and explored (
Finally, these data leakage phenomena are not in any way unique to specimen-based biodiversity data. Observation-based biodiversity data, which are becoming massively numerous, have their own leaks, such as misidentifications, which create irreparable problems in records; observational data, nonetheless, may not suffer from some of the major leaks that affect specimen data, such as inconsistent taxonomies, given controlled vocabularies in data entry portals. Recent years have seen the assembly of large-scale data resources from heterogeneous sources: e.g., GenBank, and GLOBIS-B. These data infrastructures must reconcile different formats and norms, which at times results in some data records being unusable or less useful in particular analyses. As such, data leakage is not unique to biodiversity data, but rather a general consequence of data sets becoming large.
We thank most fundamentally the biodiversity science community for its large-scale and increasing commitment to sharing openly the important data resources that they have developed over years, decades, and centuries. Luis Osorio Olvera and Ali Khalighifar provided invaluable help with processing large data sets. The idea for this manuscript started at a workshop at Entebbe in Uganda funded by the JRS Biodiversity Foundation, which we thank for its continued support in the area of biodiversity informatics.
All authors contributed to data analysis. ATP drafted the manuscript, which was then edited and commented by all authors.
The authors declare that they have no conflicts of interest.
This file offers detailed responses to two reviewers' comments, which were presented as very long, multipoint comments on the manuscript.
These data are the summaries of data leaks in each of three dimensions for each of the bird and herbarium datasets that are depicted in Figures 2 and 3.
Comma-delimited ASCII data corresponding to specimens held in a series of museum collections
Comma-delimited ASCII data corresponding to herbarium specimens in several collections