Biodiversity Data Journal :
Forum Paper
|
Corresponding author: Roderic Page (roderic.page@glasgow.ac.uk)
Academic editor: Anne Thessen
Received: 15 Jun 2018 | Accepted: 11 Jul 2018 | Published: 23 Jul 2018
© 2018 Roderic Page
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Page R (2018) Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature. Biodiversity Data Journal 6: e27539. https://doi.org/10.3897/BDJ.6.e27539
|
|
Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic namesand identifiers for the taxonomic articles that published those names.
datasette, linked data, data publishing, biodiversity knowledge graph, taxonomic names
A venerable tradition in taxonomy is compiling and publishing lists of scientific names, whether in printed form or as online databases (
In an ideal world, each taxonomic name would be linked to a detailed bibliographic record of where that name was published and that publication would be available in digital form, as would any subsequent taxonomic revisions (
Motivated by this lack of links, I have spent the last few years obsessively collecting digital identifiers for taxonomic publications and linking them to taxonomic names. This project is far from complete, nor is it likely to be in the near future given the continuing discovery of new species and the increasing number of taxonomic works that are becoming available online. One consequence of this Sisyphean task is that it becomes tempting to simply continue to accumulate links in a local database, forever postponing publishing them. This is unlikely to be a particularly successful career strategy, nor is it helpful to people who might make use of these links. However, publishing sets of links is not necessarily a straightforward task.
One option for publication is to create a custom interface to the links, to make them both discoverable and interesting. Examples include links between the NCBI taxonomy (
Rather than expend effort on developing idiosyncratic solutions, one could simply publish the data to an existing platform. I adopted this approach for the names in the Plant List http://www.theplantlist.org, for which I linked a subset of names to publications with Digital Object Identifiers (DOIs) or with a link to JSTOR. This dataset was uploaded to the Global Biodiversity Information Facility (GBIF) (
GBIF is a domain-specific data publisher. An alternative may be to publish in a venue with broader scope, such as Wikidata (
A particularly appealing route for publishing links would be to treat each link as a "nanopublication" (
If we find the three options discussed so far (custom web site, existing data publisher and nanopublications) unsatisfactory, then it seems that the only remaining approach is simply to deposit the dataset as a “dumb” file in a repository such as Datadryad or Zenodo, minting a DOI to make it citable and then hope that somebody makes use of it. However, multi-megabyte data files are often not the easiest for users to work with and it might not be obvious to a potential user why the data would be worth investing time in discovering whether it was useful.
However, other possibilities are emerging. For example,
In this paper, I describe the creation of a datasette for a longstanding but mostly unpublished project on linking plant names in the International Plant Names Index (IPNI) to the taxonomic literature.
The International Plant Names Index (IPNI, http://www.ipni.org) is an international register of published plant names based at the Royal Botanic Gardens, Kew but which has contributions from the Harvard Gray Index and the Australian Plant Name Index. Both new taxonomic names (e.g. for newly described species) and new combinations (e.g. reflecting transfers of species from one genus to another) are recorded in IPNI, together with a citation to the scientific work which published that name. These citations typically comprise an abbreviation for the publication (such as a journal or a book), a description of the location of the name within that publication, such as a combination of volume number and page number and the year of publication. One or more of these items may be missing, different journal abbreviations may be used in data sourced from different datasets and the volume and pagination may be in either Roman or Arabic numerals. For some records, the IPNI curators have added a link to the corresponding page in the Biodiversity Heritage Library (BHL) and, for some recently added records, the IPNI web site may give the DOI for a publication, but the majority of IPNI records are not linked to a digital identifier for the publication associated with each name.
In much the same way as for BioNames (
A CSV file containing basic metadata for a plant name, such as IPNI LSID, scientific name, bibliographic details and any identifiers found, was generated from the current IPNI LSID to literature identifier mapping. To retain fidelity with the original IPNI data, the column names are those used in the output of the IPNI API - no effort has been made to standardise them using, for example, terms from the Darwin Core (
datasette package -t <username>/ipni ipni.db
where <username> is your username at https://docker.com. The container can be run locally or can be pushed to an online repository where others can access it, such as Docker Hub. To push to the Hub, the commands are:
docker login -u <username> -p <password>
docker push <username>/ipni
A container for this project is available at https://hub.docker.com/r/rdmpage/ipni/.
The datasette, generated here, can be seen online at https://ipni.sloppy.zone. If this demo is offline, the reader can simply deploy a copy of the container from the Docker repository https://hub.docker.com/r/rdmpage/ipni/. The interface is simple and generic (Fig.
Some simple queries include finding the DOIs for publications of new names in a given genus, such as Begonia :
select Id, Full_name_without_family_and_authors, doi from ipni where Genus="Begonia" and doi is not null;
JSTOR has digitised many botanical journals, so for some taxa such as the genus Tiquilia, it is an excellent source of taxonomic literature:
select Id, Full_name_without_family_and_authors, doi from ipni where genus='Tiquilia' and jstor is not null;
Although the primary goal of the name-to-literature mapping is to find digital versions of the descriptions for each species, the datasette enables queries that might address other questions. For example, the database includes information on the agency that registers the DOI for a publication. For most publications, this is CrossRef, but there are other agencies, such as DataCite, the multilingual European Registration Agency (mEDRA) and the Airiti Incorporation (華藝數位). Table
Links between taxonomic names and the scientific literature have many possible uses. One is simply to be able to read the description of a new species or discover the reasoning behind subsequent changes in name. Given that many of these sources are available in machine-readable text, the links could be used to generate a corpus for text mining to extract information on the species being described (
The use of global bibliographic identifiers also enables queries that can span multiple databases. For example, knowing the DOI for a paper that changes the taxonomy of a plant genus, we could ask whether the evidence for that is supported by phylogenetic analysis by seeing whether that DOI also occurs in TreeBASE (https://treebase.org). We could ask to what extent the discovery of new plants species is being driven by molecular data by seeing whether the DOI for the species description also occurs in sequence databases such as GenBank. However, these examples all require the existence of links between these databases, which are often incomplete (
In the absence of an existing knowledge graph and the lack of a centralised infrastructure supporting its development, datasettes provide an easy mechanism for publishing links that places minimal burden on the researcher or curator doing the mapping, but also provides an interface that is potentially useful to users, even as we wait for the knowledge graph itself to coalesce.
Any work augmenting lists of taxonomic names builds on the efforts of cataloguers and biocurators, in this case the many people involved in the International Plant Names Index.