Biodiversity Data Journal :
Data Paper (Biosciences)
|
Corresponding author: Mathias Dillen (mathias.dillen@plantentuinmeise.be)
Academic editor: Anne Thessen
Received: 21 Nov 2018 | Accepted: 04 Feb 2019 | Published: 08 Feb 2019
© 2019 Mathias Dillen, Quentin Groom, Simon Chagnoux, Anton Güntsch, Alex Hardisty, Elspeth Haston, Laurence Livermore, Veljo Runnel, Leif Schulman, Luc Willemse, Zhengzhe Wu, Sarah Phillips
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Dillen M, Groom Q, Chagnoux S, Güntsch A, Hardisty A, Haston E, Livermore L, Runnel V, Schulman L, Willemse L, Wu Z, Phillips S (2019) A benchmark dataset of herbarium specimen images with label data. Biodiversity Data Journal 7: e31817. https://doi.org/10.3897/BDJ.7.e31817
|
|
More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons.
To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.
Herbarium specimens are a research tool, an archive and a reference for plant sciences. They provide data and verifiability to disciplines such as phytogeography, taxonomy and ecology (
As digital imaging of the world’s herbaria continues, there is a recognition that large amounts of information can be extracted from these images. This information includes data concerning the specimen's origin on the labels, such as location, date and collector, but also traits and the identity of the plant itself (
Herbarium specimens are far from homogeneous. They vary in the language, location and style of the labels, in whether they are typed or handwritten and in the quality and quantity of information on the labels (
Not all herbarium digitisation projects are the same. They vary in aspects, such as the imaging methodology, the resolution of the digital image created and their approach to quality control (
For all these reasons, we feel it is useful to provide a benchmark dataset of digitised herbarium specimens, made openly available for the development of tools and workflows for data extraction. This dataset has been placed in the public domain specifically to act as a test dataset for research and a benchmark to compare different methods. We have also provided transcribed data, where available, associated with each image, which can be used for comparison or for training systems. In addition, for 250 of the specimens, we have provided image overlays that identify the position of labels. These can be used for segmentation analysis of the specimen.
The images have been released under a Creative Commons Zero licence waiver (https://creativecommons.org/licenses), to ensure that there are no limitations that could hinder or discourage anyone from using them. However, the authors expect users to follow the norms of scientific citation. Each upload of images and data about a specimen has been assigned a DOI (Digital Object Identifier), which will uniquely and persistently identify it to allow citation. Data and media are provided as they were assembled right now and will not be kept in sync with new developments at the collection level after publication. Stable identifiers for the collection specimens themselves can always be found as 'Alternative identifiers' at each upload's landing page (
Curators from nine European herbaria volunteered to provide a sample of their digitally imaged herbarium sheets. Herbarium curators were requested to select specimens following a set of guidelines that were chosen to ensure a representative cross-section of specimen characteristics. The aim was to provide specimens that could answer questions related to the language, condition, age and geography of the specimen and, at the same time, provide a sufficient sample size for statistical analysis (Table
The guidelines given to herbaria to select specimens for the test dataset. The goal was not to have a representative sample of all specimens, but to have comparable subsets, which will have labels written in different languages; will be printed or handwritten; will cover a wide range of dates; will be both type specimens and general collections and will provide specimens from different families and different parts of the world.
Number of specimens |
Type status |
Date of collection |
Geography |
25 |
Type |
< 1970 |
Any country |
25 |
Type |
> 1970 |
Any country |
25 |
non-Type |
< 1970 |
From the country where the herbarium is located |
25 |
non-Type |
> 1970 |
From the country where the herbarium is located |
100 |
non-Type |
Any |
non-Type specimens from one other country or region of which the herbarium possesses a substantial number of specimens |
Contributions of 9 different institutes to the dataset. Availability of JPG and TIFF images is indicated, as well as the source of label data. Most institutes were able to follow the template in Table
Institute |
Institution Code |
Data Source |
Composition (with ISO 3166-1 alpha-2 Country Codes) |
Meise Botanic Garden |
BR |
10.15468/wrthhx |
As Table 1; 100 from AU, CA, NZ, US |
Royal Botanic Gardens, Kew |
K |
10.15468/ly60bx |
As Table 1; 100 from BR |
Natural History Museum, London |
BM |
10.5519/0002965 |
As Table 1; 100 from AU, CA, NZ, US |
Botanic Garden and Botanical Museum, Berlin |
B |
As Table 1; 100 from AU, BR, CN, ID, TZ, US |
|
Royal Botanic Garden Edinburgh |
E |
10.15468/ypoair |
As Table 1; 100 from CN |
National Museum of Natural History, Paris |
P |
10.15468/nc6rxy |
50 type, 50 non-Type FR, 100 non-Type not FR |
Natural History Museum, University of Tartu |
TU |
10.15156/bio/587444 |
100 < 1970, 100 > 1970 |
Naturalis Biodiversity Center |
L |
10.15468/ib5ypt |
As Table 1; 100 from ID; no selection on date |
Finnish Museum of Natural History LUOMUS, University of Helsinki |
H |
As Table 1; 14 FI, 36 ET instead of 50 FI; 100 from AU, BR, CN, ID, US |
Where possible, images were collected in JPG and lossless TIFF formats. Data were collated as a Darwin Core (DwC) Archive (
There are no clear indications in any data as to whether a specimen is completely or partially transcribed. The labels may contain a variety of information, so availability of the information in the data is not a good guide to whether it is present on the label. Nevertheless, of the 1,800 specimens, 94% had a collector listed and 56% had either a collector number or an explicit indication that there was no number. A total of 85% had either a verbatim or interpreted date and 90% had either a date or collector number. Hence, most of the specimens have some level of transcription.
All specimens were analysed by a polyglot to determine the primary language of the label. In some cases the label had no dominant single language or no other text beyond the scientific Latin name of the specimens. However, the language of the label could be identified for 90% of the dataset (Fig.
A classification of the languages used on labels of the different specimens. EN = English, FR = French, LA = Latin, ET = Estonian, DE = German, NL = Dutch, PT = Portuguese, ES = Spanish, SV = Swedish, RU = Russian, FI = Finnish and IT = Italian. ZZ indicates a single language could not be determined: either there were multiple languages used on the label, there was no obvious use of a certain language (i.e. only scientific Latin terms) or the language was not readily identifiable. Different herbaria are identified by their Index Herbariorum codes (Institution Code in Table
The distribution of collection dates (by year, if known) of the specimens in the dataset for each providing institution. The heat colour indicates the number of specimens for each 10 year time period. Year data were extracted from Darwin Core eventDate and verbatimEventDate if these were in ISO 8601 standard. Codes for the herbaria follow Table
A stacked pie chart generated using Krona (
As detailed in Table
For seven of the institutions, the data associated with these images were downloaded using the GBIF API (accessed 2018-07-12). One of the other two (B) provided these data in DwC format directly. The other (H) had no method to export all data in DwC format, so data were extracted in JSON format using their API (https://api.laji.fi) (Accessed 2018-07-09). These data were subsequently mapped to DwC in the R language using the package jsonlite (
For each of a subset of 250 specimens with labels in English, two PNG overlays were manually made in GIMP (
Locations were mapped using their country code and decimal coordinates (Fig.
The location of geolocated specimens within the dataset and the number of specimens from each country. A total of 267 (15%) specimens have coordinates associated with them and 1,695 (94%) are located to a country. Both categories may overlap. The map uses a Mollweide equal-area projection.
The higher taxonomy of specimens was determined from the GBIF backbone taxonomy when those data came from GBIF. For the data that did not originate from GBIF, we matched the family (B specimens) or genus (H specimens) in the data to the backbone (
Although we aimed at incorporating 25% nomenclatural type specimens within the dataset, according to their data, only 19% are types. This lower value is because some collections are not created primarily for taxonomy and they therefore do not hold many types. Non-type specimens were selected as specimens without any type status. Hence, some specimens listed as non-types could actually be types, if they had not been identified as such in their digital publication.
Regarding the specimen collector names, there are 1,170 different names associated with the dataset. However, it is likely there are duplicates amongst those 1,170, as some names will not be exact textual matches. Only 6% of the specimens had no collector information.
A broad temporal coverage of the dataset was promoted by forcing a separation at 1970 (Table
This landing page contains a CSV file compiling all data associated with herbarium specimens that are part of this dataset, as they could be found on GBIF, JACQ or FinBIF.
In addition, DOIs of the individual specimens uploaded to Zenodo and direct links to the different files (JPEG, TIFF, JSON, PNG) are also included. Index of these added variables:
- persistentID: Persistent Identifier of the collection specimen. The persistent identifier is maintained by each institution and should always lead to the most up-to-date version of a digital specimen record. Apart from the persistent identifier, other data are liable to being amended in institutional databases. Data uploaded as part of this dataset will not be updated with changes at the collection's repository, but this persistent URI will always point to the up-to-date information in the institutional system.
- jpegURL, tiffURL, jsonURL: URLs pointing straight to the respective image and data files themselves, to facilitate (selective) batch downloads.
- pngSegAllURL and pngSegSelURL: Segmented overlays of the herbarium specimens indicating the location of different labels and reference material on the sheet ("All") and their content ("Sel"). More information can be found in the paper (in prep.) associated with this data publication and the individual depositions themselves.
- DOI: The DOI of the deposition of images and data of these specimens on Zenodo. DOIs point to the most up-to-date version of these depositions at the time of the publication of this CSV file. As a rule, this CSV file will be updated should any changes happen to any of the depositions.
- jpegURL2, tiffURL2: A few herbarium sheets had labels on the back and consisted therefore of two scans. As a rule, the label scans are in this category.
Column label | Column description |
---|---|
Data and links.csv | Supplementary Info 5 |
As an increasing number of herbarium specimens are digitally imaged, the possibility of automated analysis becomes more attractive. However, simply providing access to the digital images does not enable full use of the resource. The data associated with the image also need to be accessible for most analyses and this requires these data to be digitised, categorised and standardised (
The digitisation of label data is one of the most significant bottlenecks to the full digitisation of herbaria (
Another approach to data extraction is automation. This might involve optical character recognition of text or other forms of pattern recognition (
Digital images of herbarium specimens may also be used for other purposes, for example, to extract trait data from plants or to identify the species in question (
Some analysis techniques may only be suitable for certain types of specimen, for example, when ML algorithms are trained only in one language or the handwriting of one collector. Here, we have provided a wide variety of test images from which subsamples can be selected for different purposes. However, in selecting the images, we have not attempted to provide a random subsample of specimens, but have tried to provide a good cross-section of the different kinds. This means that some countries, languages and scripts are not represented at all in the collection and the collection will be biased geographically and taxonomically. However, for those countries and languages represented in the set, there will be multiple specimens.
The whole dataset has been archived to the Zenodo research data repository (https://zenodo.org, Suppl. material
We gratefully acknowledge all the collectors of the specimens in our test dataset. We also wish to thank Nuno Veríssimo Pereira for identifying the language(s) used on the labels of each specimen and Phillippe Hendricx for creating the segmented overlays. We thank curators, technical collection staff and the management of the herbaria that provided digitised specimens for their support. Funding for this project was received from the EC H2020 programme (ICEDIG: RIA 777483).
Interactive version of the taxonomic coverage chart, Figure 2 in the article. Rendered using Krona (https://github.com/marbl/Krona).
This R script was used to obtain metadata for the specimens from H in Darwin Core format, using the FinBIF API. Certain transformations depend on what was present in this specific dataset and might not be generically applicable.
This ZIP contains the CSV files necessary for the R script which retrieved and joined the metadata of the dataset and produced most of the graphs.
In addition to seven files with 200 barcodes each for BR, BM, E, K, L, P and TU and two files containing all metadata for B and H, it also contains a file listing the label language for each specimen, a summary table for languages in the dataset and a file mapping DwC terms to their overarching categories.
This R script file contains the different scripts used to obtain metadata, join it, export it and produce the paper's graphs (except for the taxonomic graph, which was done using data exported from R into the Krona Excel macro template, which can be found on Github). The CSV files needed for this script are in a separate ZIP file.
This file contains data of the 1800 digitised specimens this paper's dataset is composed of. The joined data originate from different sources as described above and have also been filtered for a few repository-specific variables, such as GBIF taxon keys. DwC extensions are encoded in JSON.
This file also contains a list of DOIs and Zenodo file URIs (jpegURI, tiffURI...) for the images of each specimen this dataset consists of. Using these links and DOIs, it should be easy to retrieve and cite any proportion of this dataset as needed.
This R script was used to convert data in a CSV format to single JSON-LD files. The ZIP file also contains the original CSV file.
This Python script was used to upload the dataset to the Zenodo platform through their API.