Biodiversity Data Journal :
Methods
|
Corresponding author: Brandon Kwee Boon Seah (brandon.seah@thuenen.de)
Academic editor: Lyubomir Penev
Received: 12 Oct 2023 | Accepted: 06 Nov 2023 | Published: 24 Nov 2023
© 2023 Brandon Seah
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Seah BKB (2023) Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers. Biodiversity Data Journal 11: e114076. https://doi.org/10.3897/BDJ.11.e114076
|
|
Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.
data curation, biodiversity informatics, data integration
Biodiversity science has seen a proliferation of databases and checklists (
End-users can match taxa either by their names or taxon identifiers. This task is a subset of data reconciliation or data matching (
How can we avoid duplicated effort in data curation? Ideally, users of taxonomic data would share in building and improving community resources, as they are often also the subject-matter experts. Building yet another database is clearly not the answer. Nonetheless, large aggregator projects, such as WoRMS and ITIS, tend to be centrally organised and may not have a formal avenue for user contributions. Wikidata (https://www.wikidata.org/) (
Graphs of database identifiers have been used instead of name-matching to link over a hundred thousand entries in Wikidata with the Global Biotic Interactions Database (GloBI) (
Here, we describe how we match taxon names and identifiers between the Global Biodiversity Information Facility (GBIF) Backbone Taxonomy (
Our aims are to identify issues commonly encountered during data-matching, in particular the actual impact of homonymy and synonymy on name-matching and to make concrete suggestions for how to troubleshoot and improve community resources as part of the data cleaning process, as a form of crowdsourcing.
The three databases model the relationships between taxa and taxon names differently. Taxon names are formally governed by rules of nomenclature, which decide whether they are validly published. However the circumscription and classification of the taxon concept referred to by a given name can be a matter of legitimate scientific (taxonomic) disagreement. The GBIF Backbone and NCBI Taxonomy explicitly designate a preferred taxonomy. When more than one name is thought to represent the same taxon, GBIF and NCBI explicitly choose one name as accepted and mark the others as synonyms. In GBIF, each name has a distinct taxon identifier (taxonID) and a taxonomic status (e.g. “accepted”, “synonym”), whereas, in NCBI, records for synonyms are merged into the taxonID of the accepted name and the former taxonIDs of the synonyms are deprecated. In Wikidata, each taxon name is a distinct item like in GBIF, but there is no preferred taxonomy. Items representing synonyms can be linked by the property “taxon synonym” (P1420), ideally citing a reference where this relationship is asserted.
The dataset (https://doi.org/10.15468/0fxsox) comprises 7209 taxon names of vascular plants from Germany (5876 at species rank) and their associated GBIF taxonIDs which we wished to link to equivalent NCBI Taxonomy taxonIDs. The file was downloaded from GBIF as a “species list”, which lists taxa in a tab-separated text file, containing the taxon name as supplied by the data provider, the taxonID for that name, the “accepted” taxon in the GBIF Backbone Taxonomy to which it was matched when the dataset was imported, its taxonomic status and taxon rank and the names and taxonIDs of the higher taxa to which it belongs (kingdom, phylum etc.).
For reproducibility, we used flatfiles of the latest available versions of the GBIF Backbone Taxonomy (26 Nov 2021) and the NCBI Taxonomy (01 Dec 2022) instead of live online queries, so that the analysis could be pinned to a specific version as these databases are continuously updated. For Wikidata, we directly queried the online API instead of downloading a versioned flatfile, because database dump files are large (23 Jun 2023 version over 136 GB) and contain data on all entities, not just biological taxonomy; queries can also be submitted via the web interface, either in the SPARQL query language or with the interactive query builder and the results exported as a table.
GBIF taxonIDs in the dataset were matched against the GBIF Backbone Taxonomy to filter out records that have been marked as “doubtful” or problematic and to find currently accepted names and taxonIDs within the GBIF Backbone Taxonomy, as the latter may have been updated after the dataset was originally imported. This resulted in a table of taxon names (with authors) and taxonIDs of interest. Only taxa of species rank (5721 names) were retained to simplify the search, as the higher taxa can be derived from the list of species. From the NCBI Taxonomy, scientific names (including authors where available) and taxonIDs at species rank classified to Viridiplantae (NCBI:txid33090) were retrieved, to reduce the number of names to be searched and to avoid hemihomonyms.
The GBIF taxon names were matched against the Viridiplantae taxon names from NCBI with Gndiff v.0.2.0 (https://github.com/gnames/gndiff) (Fig.
Simplified diagram of identifier linking through name matching. A Match accepted taxon names in GBIF against names in NCBI Taxonomy using gndiff, then check if the respective identifiers are also linked in Wikidata; B If an accepted name had no matches, retrieve synonyms for a second round of name matching.
For GBIF names without matches in NCBI Taxonomy, synonyms according to the GBIF Backbone were retrieved and then used for a second round of name-matching (Fig.
We queried Wikidata via its SPARQL API (https://query.wikidata.org/) for taxon items with the GBIF taxonIDs from our dataset (property P846). If they were linked to an NCBI taxonID (property P685), the linked NCBI taxonID was added to our table. If a taxon name were not linked to a Wikidata item via its GBIF taxonID, but the earlier name matching had found an NCBI taxonID, then the NCBI taxonID was used to query Wikidata to find linked Wikidata items and their associated GBIF taxonIDs, if available.
The identifier links on Wikidata were then used to categorise the pairs of matched names for further action (Table
Possible outcomes of data-linking steps, further curation steps to be taken and the number of cases identified in this example dataset.
Name match type |
GBIF ID linked to Wikidata item? |
Wikidata links GBIF to NCBI ID? |
NCBI ID from name matching same as on Wikidata? |
Wikidata links NCBI to GBIF ID? |
Taxonomic status of name on GBIF |
Curation action to take |
Count |
none |
no |
- |
- |
no |
- |
(a) No matches, including synonyms |
1310 |
none |
no |
- |
- |
yes |
Other |
8 |
|
exact |
yes |
yes |
yes |
- |
- |
(b) Match ok, accept automatically |
3130 |
exact |
yes |
yes |
no |
- |
- |
(c) Verify and update NCBI taxonID in Wikidata item |
11 |
exact |
yes |
no |
- |
no |
- |
(d) Batch-add NCBI taxonID to Wikidata item |
177 |
exact |
yes |
no |
- |
yes |
- |
Other |
52 |
exact |
no |
- |
- |
yes |
“accepted” |
(e) Batch-update GBIF taxonID in Wikidata item |
245 |
exact |
no |
- |
- |
yes |
not “accepted” |
(f) Verify if synonym listed in GBIF is valid before linking identifiers |
89 |
noauthor |
yes |
yes |
yes |
- |
- |
(g) Verify if authorships match before linking identifiers |
211 |
noauthor |
yes |
yes/no |
no |
- |
- |
(h) Possible homonym, investigate further |
224 |
author mismatch |
yes |
yes |
yes |
- |
- |
(g) Verify if authorships match before linking identifiers |
271 |
author mismatch |
yes |
yes/no |
no |
- |
- |
(h) Possible homonym, investigate further |
217 |
fuzzy |
- |
- |
- |
- |
- |
other |
100 |
We identified straightforward cases of missing or outdated information in Wikidata, which can be updated through batch edits (Table
To understand the underlying causes for these erroneous links, we further investigated the cases where name-matching and Wikidata disagree on the GBIF taxonID (Table
The remaining cases were then tabulated for manual curation. This requires some knowledge of taxonomy and nomenclature rules to be able to evaluate whether two names are equivalent or not, as well as cross-checking against additional databases.
Here, we describe what issues can be found during manual curation and what concrete action users can take to improve the database resources. In brief: Wikidata can be edited directly to fix errors or add missing information, preferably after creating a user account; issues with the GBIF Backbone Taxonomy can be reported via the website feedback dialogue, by email or via Github; issues with the NCBI Taxonomy should be reported by email.
Error modes in name matching have been extensively discussed (
Example: Name-matching errors may also appear in the source databases. The original dataset listed Ammophila Kirby, 1798 (GBIF taxonID 1346141), a genus of wasps, instead of the grass genus Ammophila Host (GBIF 2703794). Both names are valid under their respective, independent nomenclatural codes, i.e. they are hemihomonyms. Here, the error appears to have occurred during import of the data from the original provider into GBIF.
Action: Accept or reject the linked identifiers after verification.
If the results of name matching disagree with database identifiers, it is possible that one or more of the source databases have incomplete or erroneous information.
1. GBIF taxonID has been deprecated or merged.
The GBIF Backbone Taxonomy is continually revised and records may be deleted if they are, for example, doubtful names, orthographic errors or duplicates. However, the deprecated GBIF taxonIDs may still be linked in Wikidata. In some cases, the accepted taxon in GBIF may also be in error (see point 6 below).
Example: The Wikidata record for Helianthus annuus (Q171497) was linked to the GBIF taxonID 3119195, which was deleted on 01 Feb 2018. The currently accepted GBIF record for this species is 9206251.
Action: When unambiguous, edit the Wikidata entry to add the currently accepted GBIF taxon, after checking that it is not a homonym. Record the access date in the reference with the property “retrieved” (P813), which will help future editors troubleshoot if the GBIF record changes again. The outdated GBIF identifier value can be explicitly marked with a "deprecated" rank, with a qualifier stating that the reason (P2241) is that the identifier was deprecated in the source database (Q67125514). See
2. NCBI taxonID has been deprecated or merged.
Unlike GBIF, the NCBI Taxonomy merges synonyms under the same taxonID, which can be problematic if there is disagreement about whether two taxa are truly synonymous.
Example: Calamagrostis stricta, formerly NCBI:txid497295, has been merged as a synonym of Calamagrostis neglecta NCBI:txid395286 in the NCBI Taxonomy. Furthermore, the GBIF Backbone accepted C. stricta (2704899) while designating C. neglecta (4104731) as a synonym of Achnatherum calamagrostis (4142326).
Action: Searching the NCBI website for a merged taxonID or entering its URL will auto-redirect to the current accepted one. However, the ENA Taxonomy API (https://www.ebi.ac.uk/ena/taxonomy/rest/), which, in principle, uses the same NCBI Taxonomy, usually returns no result for merged taxonIDs, indicating that merged taxonIDs may cause problems with downstream tools that do not take them into account. The currently accepted NCBI taxonID can be added to the Wikidata entry, but the old taxonID may help disambiguate the record and should not be deleted, but, instead, explicitly marked as a deprecated value (see point 1 above).
3. Incorrect species linked on Wikidata.
The Wikidata record may be linked to an identifier for a different species. These cases are usually homonyms, which can be recognised by the different taxon author.
Example: The Wikidata record for Rubus gracilis C.Presl & J.S.Presl (Q17248013) was linked to identifiers for the homonymous Rubus gracilis Roxb. in GBIF (2990660) as well as another database, GRIN-Global (32332, explicitly annotated as “non J.S.Presl & C.Presl 1822”).
Action: When unambiguous, edit the Wikidata entry to remove the incorrect statement, or point to the correct identifier, if available. Record the access date using the Wikidata property “retrieved” (P813). Different Wikidata items for homonymous taxa can be disambiguated with the property “different from” (P1889).
4. Ambiguous entry in Wikidata: Conflicting taxon authors.
Some cases may require taxonomic/nomenclatural expertise or additional information to resolve.
Example: The Wikidata record for Willemetia stipitata (Q1362051) stated that the taxon author (property P405) is Karl Wilhelm von Dalla Torre (Q79155), but the linked GBIF entry (5389300) for W. stipitata (Jacq.) Dalla Torre was annotated as “doubtful” in GBIF, whereas the linked NCBI entry (NCBI:txid519273) represented the homonym W. stipitata Cass. Linked records in other Wikis were also inconsistent: German-language Wikipedia – W. stipitata (Jacq.) Dalla Torre (https://de.wikipedia.org/wiki/Kronenlattich); Wikispecies – W. stipitata Cass. (https://species.wikimedia.org/wiki/Willemetia_stipitata).
Action: The Wikidata entity may need to be split into separate entities for each homonym. Start a thread on the corresponding discussion/talk page in Wikidata or Wikispecies to alert other users to the issue. For one’s own research, make a judgement call and document it. Both the GBIF and NCBI records have subsequently been changed, but still disagree on which name should be accepted.
5. Ambiguous entry in Wikidata: No taxon authors.
Some taxon names on Wikidata may lack the “taxon author” (P405) or “taxon author citation” (P6507) properties.
Action: As above. These should probably be split into separate entities if they are indeed homonyms, but it would then be unclear how the linked identifiers should be distributed between them.
6. Error in accepted taxon in GBIF Backbone Taxonomy.
These can often be traced back to errors in the source datasets used to populate the GBIF Backbone. The following example was found because the Wikidata entry was linked to both GBIF and NCBI taxonIDs and agreed with the name-matching with Gndiff, but the author names conflicted.
Example: “Primula matthioli K.Richt.” was an accepted taxon in the GBIF Backbone Taxonomy (5640570); GBIF’s source dataset or this name is “Synonymic checklists of the vascular plants of the world” (
Given the corroboration from IPNI, the author names in GBIF records 5640570 (“K.Richt.”) and 9781637 (“(L.) J.A.Richt.”) are likely to be typographical errors for 9764749 (“(L.) V.A.Richt.”).
Action: Report errors or issues via the feedback system on the GBIF website (must be logged in with a GBIF user account). Feedback reports are handled via the issue tracker on GitHub and can also be submitted directly there or by email. The issue opened for the above example is here: https://github.com/gbif/portal-feedback/issues/4673. If their data curators can trace the issue to an upstream data source, the report is passed upwards. Curators can also apply “patches” to the GBIF Backbone Taxonomy, where the upstream source cannot be updated in a timely manner. The GBIF record with the correct authorship (9764749) now has status "accepted".
7. Error in accepted taxon in NCBI Taxonomy.
Example: Carex binervis Sm. (Wikidata Q160245) was an accepted taxon in the GBIF Taxonomy (2723521), but the NCBI record had different authors “Gren. & Godr.” (NCBI:txid372257) (this has now been corrected).
IPNI listed four homonyms for the name Carex binervis, but none with “Gren. & Godr.” as authors. Only C. binervis Sm. was validly published (https://www.ipni.org/?q=carex%20binervis). The remainder were either nom. inval., C. binervis Wahlenb. ex Kunth or nom. illeg., C. binervis Willd. ex Kunth and C. binervis Dewey, the latter according to Plants of the World Online (https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:77237975-1).
“Carex binervis Gren. & Godr.” turned out to be a chresonym, where the authors after the binomen are not the authors of the name itself, but a reference to a usage of the name in some other publication. The Tropicos database had an entry for “C. binervis Gren. & Godr.” with a citation to the publication Flore de France by Grenier & Godron (1855) (http://legacy.tropicos.org/Name/9900008). This allowed us to find a digital copy online (https://bibdigital.rjb.csic.es/idviewer/10272/430) where the name “C. binervis Sm.” was listed, showing that this was, indeed, the intended name.
Where did the NCBI Taxonomy find this chresonym “Carex binervis Gren. & Godr.”? The NCBI web interface listed two references: Monocot Checklist (http://www.kew.org/wcsp/home.do, accessed 01 Nov 2010) and a research paper (
Action: Report errors and updates to the NCBI helpdesk by email (
8. Disagreements in taxon concepts between databases.
The “same” taxon may appear under different names, classifications or even be split or lumped into different taxa, depending on the source consulted. One name may hence represent different taxonomic concepts. When data aggregators designate accepted names or use a particular classification, they gloss over potentially valid taxonomic conflicts (
Example: The species Rosa inodora Fr. (GBIF taxonID 3002258, Wikidata Q15844731) in our dataset did not have an NCBI taxonID, i.e. no sequence data were available. However, Rosa elliptica Tausch (GBIF taxonID 3003248, Wikidata Q9325795), listed as a synonym of Rosa inodora by GBIF, did have an NCBI taxonID (NCBI:txid323240).
Action: For our own analyses, we may accept a particular taxonomic opinion and link these taxa that were designated as synonyms by GBIF or NCBI. However, in Wikidata, the NCBI taxonID of Rosa elliptica should not be linked from the Rosa inodora item, but from Rosa elliptica. Designation of a synonym is a taxonomic theory which is subject to potential disagreement and future revision. Therefore, the original name of interest, accepted names and synonyms are kept in separate data columns in our workflow. In Wikidata, synonymous taxa can be represented by the “taxon synonym” property (P1420), whereas homonyms can be disambiguated with the “different from” property (P1889).
The above workflow is available from https://github.com/monagrland/taxo-harmo (archived version: https://doi.org/10.5281/zenodo.10074668). Software dependencies are specified in a definition file for the Conda environment manager, using packages distributed via the open-source conda-forge and bioconda channels (
The state of biodiversity identifier linking is patchy, even across well-resourced, heavily used databases and for well-studied sets of species like the German vascular plant flora. As expected, naive name-matching alone is problematic and can cause linking errors, affecting at least 2.8% of Wikidata entries for the species names in the dataset examined here. Ironically, better studied groups and more comprehensive databases may contain more historical names and homonyms that need to be accounted for. Most of such linking errors are easily caught by using author names and higher taxa to disambiguate taxa, allowing us to focus manual curation efforts on the most challenging cases.
Existing recommendations and workflows for taxon name harmonisation (
Generally, though, databases are presented as resources to be accepted as-is, over which the user has no influence. Apart from simply filtering out problematic records, what more can be done? We, therefore, suggest the following additional recommendations for users to be active participants and help “pay it forward” in the community:
Name-matching and identifier-linking are receiving renewed attention from database maintainers. A recent symposium at the TDWG2023 conference touched upon issues raised in this case study from their perspective, such as the importance of identifier mapping to data integration (
As a user, why take the trouble to edit Wikidata and send feedback? Curation of biodiversity data is labour-intensive and requires a highly speciali-ed skill-set, so updating community resources will reduce duplicated effort and have a positive, compounding effect (“virtue propagation”). Wikidata, in particular, is increasingly integrated into the biodiversity informatics infrastructure, de facto recognition of its practical usefulness: the database cross-references displayed on species pages on the GBIF website (https://www.gbif.org/species/search) are sourced from Wikidata and the iNaturalist citizen-science app uses Wikidata to link species pages to their respective Wikipedia articles in various languages (
The workflow presented here still relies on ad hoc scripting, which is, to some extent, unavoidable because the point of manual curation is to handle what automation cannot deal with, but it is desirable to minimise this to improve reproducibility, as well as the reusability of code. A promising alternative is OpenRefine (https://openrefine.org/), a dedicated tool for data reconciliation, which records all data-cleaning steps in a given project, allowing them to be shared and re-run on new data. It also supports querying and editing Wikidata within the software, as well as URL-based queries (e.g. calls to the GBIF name parser API). The Simple Standard for Sharing Ontological Mappings (SSSOM,
Routine sharing of curation workflows by researchers, coupled with the transparent handling of issue reports by database maintainers, will foster more community buy-in and faster adoption of useful practices, improving the quality of downstream analyses.
I thank Christian Levers and Wiebke Sickel for feedback on a draft of this manuscript, Dmitry Mozzherin for help with gndiff, GBIF and NCBI Taxonomy curators for their responses to my queries and feedback and the peer reviewers and editor for their suggestions. The Biodiversity Community Integrated Knowledge Library (BiCIKL) project, funded by the European Union Horizon 2020 Research and Innovation Action under grant agreement No 101007492, has supported the publication of this work.