Comprehensive leaf size traits dataset for seven plant species from digitised herbarium specimen images covering more than two centuries

Abstract Background Morphological leaf traits are frequently used to quantify, understand and predict plant and vegetation functional diversity and ecology, including environmental and climate change responses. Although morphological leaf traits are easy to measure, their coverage for characterising variation within species and across temporal scales is limited. At the same time, there are about 3100 herbaria worldwide, containing approximately 390 million plant specimens dating from the 16th to 21st century, which can potentially be used to extract morphological leaf traits. Globally, plant specimens are rapidly being digitised and images are made openly available via various biodiversity data platforms, such as iDigBio and GBIF. Based on a pilot study to identify the availability and appropriateness of herbarium specimen images for comprehensive trait data extraction, we developed a spatio-temporal dataset on intraspecific trait variability containing 128,036 morphological leaf trait measurements for seven selected species. New information After scrutinising the metadata of digitised herbarium specimen images available from iDigBio and GBIF (21.9 million and 31.6 million images for Tracheophyta; accessed date December 2020), we identified approximately 10 million images potentially appropriate for our study. From the 10 million images, we selected seven species (Salix bebbiana Sarg., Alnus incana (L.) Moench, Viola canina L., Salix glauca L., Chenopodium album L., Impatiens capensis Meerb. and Solanum dulcamara L.) , which have a simple leaf shape, are well represented in space and time and have high availability of specimens per species. We downloaded 17,383 images. Out of these, we discarded 5779 images due to quality issues. We used the remaining 11,604 images to measure the area, length, width and perimeter on 32,009 individual leaf blades using the semi-automated tool TraitEx. The resulting dataset contains 128,036 trait records. We demonstrate its comparability to trait data measured in natural environments following standard protocols by comparing trait values from the TRY database. We conclude that the herbarium specimens provide valuable information on leaf sizes. The dataset created in our study, by extracting leaf traits from the digitised herbarium specimen images of seven selected species, is a promising opportunity to improve ecological knowledge about the adaptation of size-related leaf traits to environmental changes in space and time.


Introduction
Plant traits -the morphological, anatomical, physiological, biochemical and phenological characteristics of plants measurable at the individual plant level (Violle et al. 2007) -are vital to quantify, understand and predict plant and vegetation functional diversity and ecology (Grime 1974, McGill et al. 2006. Leaf attributes are amongst the most important traits as they provide relevant information about plant and ecosystem function (Funk et al. 2017, Poorter andBongers 2006). Leaf area (the one-sided projected area of a leaf) is key to understanding the leaf energy balance, which affects photosynthesis and respiration rates . Leaf area is amongst the commonly sampled quantitative plant attributes and has more than 200,000 records in the TRY plant trait database . Nevertheless, coverage is still limited, especially for characterising variation within species across geographical space and longer time-scales . The paucity of data and representative nature limits the scientific community's ability to understand and predict species and ecosystem responses to environmental and climate change (König et al. 2019, Tautenhahn et al. 2020.
Approximately 390 million plant specimens are stored in 3100 herbaria worldwide (Thiers 2019). These specimens provide good biogeographical and temporal coverage -dating back to the 16th century and offering a "window into the past" (Meineke et al. 2018, Lang et al. 2019. However, useful observations earlier than the year 1850 are very few in numbers (Groom 2015). Globally, many herbaria are undertaking digitisation campaigns and are making digitised specimen images openly accessible via various biodiversity data platforms, such as the Global Plants Database (2.6 million images), Natural History Museum Paris (8 million images) (Kirchhoff et al. 2018), iDigBio (35.2 million images) and most of them are published through the GBIF network (41 million images; accessed date December 2020) with considerable overlap. Due to the increasing availability of digitised herbarium specimens, efforts, such as extracting species' phenological and trait information using machine learning approaches from these images, have increased (Carra nza- Rojas et al. 2017, Willis et al. 2008, Younis et al. 2018, Weaver et al. 2020, Younis et al. 2020.
To characterise the variation of leaf size within species across time and space, we need specimen images with consistent information about sampling location and date. This information allows characterising the environment in which the specimens had grown using external data, for example, from gridded climate or soil databases (Fick and Hijmans 2017, Hengl et al. 2017, Taylor et al. 2012. Georeference and sampling date are often available with the digital images of the herbarium specimens -9.1 million out of 35.2 million images from iDigBio and 11.1 million out of 41 million images from GBIF provide georeference and sampling date. Given the significant number of herbarium specimens and the increasing numbers of digitised herbarium specimen images, including metadata information, we here evaluate the potential to use this information to overcome data limitations for size-related leaf traits in space and time. First, we identified the relevant biodiversity data platforms and analysed their metadata for images of Tracheophyta species with suitable leaves and sufficient additional information, i.e. sampling date and georeference. We selected seven species that were well represented in space and time. We downloaded the pertinent images and tested the applicability of a semi-automated tool, TraitEx (Gaikwad et al. 2019), to extract the leaf size traits: length, width, perimeter and area of individual leaf blades. This article describes the workflow to identify, select and download appropriate herbarium specimen metadata and images and extract leaf traits using the TraitEx software. We provide a comprehensive dataset of leaf size traits for seven species as an outcome of this approach. Finally, we compare the extracted measurements with data from the global plant trait database TRY.

Sampling methods
Study extent: Apart from the biodiversity data platforms mentioned earlier, there are several other institutions, libraries and herbaria, such as the Utah Valley State College Digital Herbarium, Moscow university herbarium, vPlants A Virtual Herbarium of the Chicago Region, The Virtual Herbarium of The New York Botanical Garden, WTU Herbarium Image Collection, OSU Type Specimen Images and Original Descriptions, which store digital herbarium specimen images. We assessed the publicly available metadata from these resources and it revealed that iDigBio and GBIF harvest the data from several institutions, libraries and herbaria worldwide to make the data openly available to the scientific community and society through their respective data platforms. Therefore, we decided to focus on extracting metadata information for the digital herbarium specimen images from iDigBio and GBIF (21.9 million and 31.6 million images only for Tracheophyta, respectively; accessed date December 2020).
Sampling description: Downloading a large number of available herbarium specimen images from the repositories takes a substantial amount of time. As a consequence, the images for trait extraction were selected in five steps (Figs 1, 2): (1) identification of specimens from GBIF and iDigBio with sufficient metadata information; (2) harmonising names to species level, based on the GBIF backbone taxonomy; (3) selecting appropriate species for our study; (4) acquisition of image URLs and exclusion of duplicates; (5) download of images and final selection for trait measurements. Flowchart for processing the metadata from iDigBio and GBIF on digitised herbarium specimens (Apart from the identification of data sources, all steps are automated using Python scripts).
The workflows for metadata extraction, downloading digital herbarium specimens and trait measurements using TraitEx are shown in Fig. 1 and Fig. 2.
Records from the taxonomic groups Polypodiopsida, Poales, Marattiopsida, Pinopsida, Lycopodiopsida and Equisetopsida were excluded after the download of metadata information because their leaf sizes or shapes were considered problematic for trait measurements.
This search resulted in 9,998,299 specimen images for 182,409 species (2,426,902 images from iDigBio and 7,571,397 images from GBIF). The spatial coverage of preselected images is global across all continents. However, the geo-points located in the oceans indicate problems with georeferences (Fig. 3). The temporal domain of the images mainly covers from 1900 to 2019, with few specimens collected before 1900 (Fig. 4). Spatial distribution of metadata for 9,998,299 digital herbarium specimen images from iDigBio and GBIF for Tracheophyta (excluding Polypodiopsida, Poales, Marattiopsida, Pinopsida, Lycopodiopsida and Equisetopsida) with georeference and sampling date available (for more details, refer to section 'Sampling methods'). Temporal distribution of metadata for 9,998,299 digital herbarium specimen images from iDigB io and GBIF for Tracheophyta (excluding Polypodiopsida, Poales, Marattiopsida, Pinopsida, Lycopodiopsida and Equisetopsida) with georeference and sampling date available (for more details, refer to section 'Sampling methods').
Harmonising names to species level, based on the GBIF backbone taxonomy: We consolidated the scientific names of species (given by authors of the specimen; see columns 'iDigBio scientificName (given)' and 'GBIF scientificName (given)'), as well as the corresponding accepted scientific names (see columns 'iDigBio scientificName (accepted)' and 'GBIF scientificName (accepted)' in "Digital Herbarium Specimen data", refer to section 'Data resources') provided by iDigBio and GBIF, respectively.
Since scientific names, provided by iDigBio and GBIF for the same specimen, sometimes differ, we additionally provide the corresponding scientific name of the GBIF backbone taxonomy for both, the given scientific names of iDigBio and of GBIF (see columns 'GBIF Backbone Taxonomy scientific name for iDigBio records' and 'GBIF Backbone Taxonomy scientific name for GBIF records').
In order to allow for grouping specimen images per species, we further simplified the scientific names from GBIF backbone taxonomy to binominal names including only genus and species information and ignoring, for example, varieties or subspecies (see column 'Binomial species name for aggregation'). We excluded images for which no species name according to GBIF backbone taxonomy could be identified or where only genus or even broader information was available.
Selecting appropriate species for our study: The distribution of images per species has the characteristics of a long-tail distribution: few species with many images, but many species with few images. However, for about 400 of the 182,409 preselected species, iDigBio and GBIF provide more than 2000 images (Fig. 5). Number of digital herbarium specimen images per species available from iDigBio and GBIF (based on the 9,998,299 images for Tracheophyta (excluding Polypodiopsida, Poales, Marattiopsida, Pinopsida, Lycopodiopsida and Equisetopsida) with georeference and sampling date available (for more details, refer to section 'Sampling methods').
We selected the most promising species for trait data extraction, based on the number of preselected records per species, also considering that sampling sites and dates should be well spread across the species distribution range and in the temporal domain. In addition, the species should have a leaf size and a visible petiole to be easily measurable on the specimen images. Based on these conditions, we selected Salix bebbiana Sarg., Alnus incana (L.) Moench, Viola canina L., Salix glauca L., Chenopodium album L., Impatiens capensis Meerb. and Solanum dulcamara L. Table 1 provides the attribution of species and subspecies names received from iDigBio and GBIF to the accepted names in the GBIF taxonomic backbone for the selected species. Attribution of (given) scientific names to accepted species names, based on the GBIF backbone taxonomy for the seven species of interest: Salix bebbiana Sarg., Alnus incana (L.) Moench, Viola canina L., Salix glauca L., Chenopodium album L., Impatiens capensis Meerb. and Solanum dulcamara L.

Acquisition of image URLs and exclusion of duplicates:
For the selected species, we extracted the Uniform Resource Locators (URLs) from iDigBio and GBIF, under which the herbarium specimen images are stored. Based on the institution code, catalogue number and URL of the specimen images, we identified duplicates within species and excluded them.

Download of images and final selection for trait measurements:
Each herbarium specimen has a unique combination of institution code and catalogue number to track the specific specimen in different herbaria. We used these two codes to create a unique ID for each specimen (see column 'SpecimenID' in "Digital Herbarium Specimen data", refer to section 'Data resources'). Additionally, we enumerated the SpecimenIDs by image to provide unique ImageIDs in case multiple images were provided for the same specimen. These ImageIDs served as unique image names while downloading the digital herbarium specimen images. As the ImageIDs are based on institution code and catalogue number, they are also helpful for tracking the specific digital herbarium specimen images in the future.
We downloaded 17,383 digital herbarium specimen images for the seven species of interest, based on their URLs using automated Python routines. For each species, it took approximately 4 to 8 hours to download the images. The specific details of the downloaded digital herbarium specimen images are provided in Suppl. material 2 and column descriptions in Table 3.

RowID
Each entry in the data file. Table 3.
Description of the columns provided in the file Suppl. material 2, which explains the problems that caused us to discard several specimen images.

ImageID
Unique identity for each digital herbarium specimen (In case of multiple entries, measurements made on different leaves within the same digital herbarium specimen).

Image
If there were no image in the digital herbarium specimen, then the column 'Image' was updated as 'No' and all other possibilities updated as string 'NA'.
Number of leaves measured Contains the number of measured leaves in each digital herbarium specimen and, if not measured, updated as 'NA'.

Remarks_1
Contains remarks: 'Juvenile leaves', 'Saplings'; all other possibilities updated as 'NA'. The plant produces juvenile leaves in its earlier years (ordinarily small compared to adult leaves). Sapling is a young tree. We excluded juveniles and saplings to avoid bias in the data.

Remarks_2
Remarks_2 contains the remarks: 'No leaves', 'No measurable leaves', 'No measurable leaves tape' and 'photograph'. No leaves: When digital herbarium specimen has no leaves (only stem). No measurable leaves: When the digital herbarium specimen has no measurable leaves, for example, only overlapping leaves are not measurable with TraitEx. No measurable leaves tape: When the digital herbarium specimen has no measurable leaves, leaves are covered with tape. Photograph: When the downloaded image is a photograph and not a herbarium specimen, all other possibilities were updated as 'NA'.

Ruler
If there was no or no appropriate ruler (ruler less than 10 cm and pixelated rulers), then the column 'Ruler' was updated as 'No' and all other possibilities updated as 'NA'.

Binomial species name for aggregation
Binomial name for aggregating the scientific names on genus level (Based on the columns 'GBIF Backbone Taxonomy scientific name for GBIF records' and 'GBIF Backbone Taxonomy scientific name for iDigBio records').
In addition to removing duplicates from the metadata, based on the institution code, catalogue number and URLs, we found 65 duplicates in the downloaded images (for example, if the digital image of the herbarium specimen were stored in the same or different repositories with different catalogue numbers or institution codes). To systematically identify duplicates, we used the image processing tool fslint, which compares various digital signatures like md5sum and sha1sum (also checks the file size and then checks to ensure they are not hard-linked) and excludes the duplicates.
Some of the remaining images had other problems, such as containing only juvenile leaves (Fig. 6a), incomplete leaf (Fig. 6b), no ruler (Fig. 6c), presence of specimen as a sapling (Fig. 6d), only overlapping leaves (Fig. 6e), no petiole (Fig. 6e) or live photographs (Fig. 6f). In addition to these problematic images, we identified digital herbarium specimens with contorted shapes, as shown in Fig. 7. These images were considered not suitable for measuring leaf traits and discarded (Fig. 2).
The specific details of 17383 digital herbarium specimen images are provided in the file Suppl. material 2 and column description in Table 3. An additional example for discarded digital herbarium specimens: potentially contorted images (showing one example of good horizontal and contorted vertical images and we repeated the similar process as discussed below if there is a horizontal contorted image and good vertical image).
a: Horizontal orientation of the digital herbarium specimen: we considered only images with horizontal orientation for trait measurements. b: Vertical orientation of the digital herbarium specimen: the image represents the same specimen and image, but in a different orientation with contorted shapes. We removed all images with the vertical orientation of herbarium specimens because the images were potentially contorted by the resize process while downloading.

Figure 8.
Numbers of images downloaded per species ('total number of images') and finally used for trait measurements ('leaf trait extracted images') for the seven species of interest. a semi-automated tool to measure size-related traits on digitised herbarium specimen images. We first uploaded each image into TraitEx and calibrated the length ruler of TraitEx against the ruler bar on the image (Fig. 9, lower-left corner of the specimen image) since the unit length of the ruler will vary from image to image, depending on the image resolution. After calibrating the ruler, the leaf to be measured was selected and an approximate boundary line was drawn 'by hand' around the selected leaf (Fig. 9). The measurement of the exact values of the different traits for the selected leaf within the determined boundary is then done automatically. TraitEx identifies the exact mask of the identified leaf (Fig. 10) and measures the size-related traits on this mask. The results are displayed on the screen (Fig. 9) and saved as a CSV file. The cropped image (inside the boundary line in Fig. 9) and the exact mask of the measured leaf (Fig. 10) is saved for reproducibility. Masks of the leaves contain information on the location of leaves on the digital image and vector data on the boundary lines drawn around the leaf are also saved. Data on masks could be used as training data for leaf segmentation in the future and are available from the corresponding authors upon request.

Figure 9.
A typical herbarium specimen image in TraitEx. A boundary line (red) has been drawn 'by hand' to identify the leaf of interest ('cropped leaf'). The morphological trait values of that leaf as measured by TraitEx are provided in the upper right corner. Figure 10.
The exact mask of the measured leaf in Fig. 9 from TraitEx workflow.
This measurement process was repeated for each leaf to be measured on a specimen image. We measured the traits on 1 -5 well-developed leaves per image, depending on the suitability of leaves. On average, it took 10 minutes to measure five leaves on an individual specimen image. It includes the time to import the specimen image into TraitEx, mark up and measure the leaves of interest and visually check the measurements. TraitEx saves the measured leaf trait records as individual CSV files in the folder where the herbarium specimen image is stored. The CSVs files were concatenated after all measurements were finished for a specific species. A detailed description of TraitEx and the measurement process is provided on the TraitEx website.
Finally, we combined the measured leaf trait values with the metadata downloaded from iDigBio and GBIF, based on their ImageIDs (see Suppl. material 1 and section ' Data resources').

Uncertainties of trait measurements:
Leaf trait measurements from herbarium specimens are associated with uncertainties due to: i) the shrinking of leaves in the preservation process, ii) imaging the herbarium specimen, iii) manual digitisation within the semi-automated workflow of TraitEx and iv) the automated trait measurement of TraitEx.
The uncertainty due to shrinkage during drying is about 3.5 -15.2 % (Tomaszewski and Górzkowska 2016, Kozlov et al. 2021). To provide estimates for the uncertainties associated with the workflow of processes (ii) scanning and (iv) automated trait extraction with TraitEx, "The authors of the TraitEx software (Triki et al. 2021) selected 20 herbarium specimens of 19 species. As a reference measurement, they measured two leaves per specimen on average using the vernier scale and measured the same leaves with TraitEx on the corresponding digitised herbarium specimens images. The authors then compared the leaf trait values between TraitEx and the manual measurements. The correlation between manual trait measurements and TraitEx measurements was very high (0.998 for leaf length and 0.997 for leaf width) and did not show a bias towards smaller or larger values. However, the uncertainty scales with the reference trait values heteroscedastic of in-situ measurements. The standard error ratio of TraitEx to reference measurements is approximately 1% (leaf length 1.02% and leaf width 0.75%). To estimate uncertainties associated specifically with (iii), the manual digitisation within TraitEx, we measured a single leaf 10 times in a herbarium specimen and repeated the same process on seven different digital herbarium specimens using TraitEx covering all leaf sizes. The uncertainties here have been very small (Table 4).

Trait Standard error
Leaf area 0.039663 cm Leaf length 0.012555 cm 2 Table 4.
The standard error for leaf area, leaf length, leaf width and leaf perimeter of a single leaf measured on the same herbarium specimen 10 times and repeated the same process for seven different digital herbarium specimens with TraitEx.

Standard error
Leaf width 0.005268 cm Leaf perimeter 0.035444 cm Therefore, we rounded trait values ("Digital Herbarium Specimen data", refer to section 'Data resources') for leaf width, leaf length and leaf perimeter to a precision of 0.1 cm and leaf area to a precision of 0.1 cm , which corresponds to uncertainties of approximately 1% for leaf length and width for the trait values of seven species we are providing here.

Geographic coverage
Description: Fig. 11 shows the spatial distribution of specimens sampling sites for measured images for the seven species of interest (green dots). We plotted the sampling sites for respective leaf trait measurements in the TRY database (red dots) for comparison. The Latitudes and Longitudes are provided by iDigBio and GBIF; any errors are not the author's responsibility.
Notes: Fig. 12 shows the distribution of specimen sampling years in time for the seven species of interest. Specimen sampling dates back into the 18th century. The data package contains 128,036 trait records for leaf-blade area, length, width and perimeter from 32,009 leaves on 11,604 specimen images for the species Salix bebbiana Sarg., Alnus incana (L.) Moench, Viola canina L., Salix glauca L., Chenopodium album L., Impatiens capensis Meerb. and Solanum dulcamara L., including the respective metadata.
In supplementary materials, we provide additional information for each of the 17383 downloaded images: (1) the number of leaves measured on each image or the reason(s) for exclusion of the image from trait measurements (Suppl. material 2); (2) the metadata received from iDigBio and GBIF for each image (Suppl. material 1). For images received via GBIF, we also provide a Table with the references (Table 2).

RowID
Unique identifier for each entry in the data file.
Leaf length in cm Leaf length of specific entry in cm.
Leaf width in cm Leaf width of specific entry in cm.
Leaf area in cm Leaf area of specific entry in cm .
Leaf perimeter in cm Leaf perimeter of specific entry in cm.

ImageID
Unique identity for each digital herbarium specimen (In the case of multiple entries, measurements are made on different leaves within the same digital herbarium specimen). The binomial name for aggregation is added at the end of the ImageID to ensure each ImageID is unique across the species.

SpecimenID
Provides unique id for each sample (a combination of Institutioncode and Catalognumber), to avoid multiple SpecimenIDs, ImageID is created by enumerating the SpecimenID (occurrence of multiple SpecimenIDs is possible if herbarium specimens are collected from the same sample).

Institutioncode
Code for which Institution the specimen came from.

Catalognumber
Unique identifier of specific specimen in the respective herbarium.

Phylum
Phylum of the species.

Class
Class of the species.

Order
Order of the species. Latitude of the collected specimen (extracted from iDigBio and GBIF metadata).
Longitude (from iDigBio and GBIF) Longitude of the collected specimen (extracted from iDigBio and GBIF metadata).
Sampling date Sampling date of the collected specimen (extracted from iDigBio and GBIF metadata).

Source
From where the digital herbarium specimen was extracted (iDigBio or GBIF or iDigBio and GBIF). If the source is only iDigBio, the metadata is coming from only IDigBio which means corresponding GBIF entries are updated with the string 'NA' and vice versa.

UUID
Universally Unique IDentifier (UUID) is a unique identifier in iDigBio (this id can be used in the future to request the same data from iDigBio).

GBIFID
GBIFID is a unique identifier in GBIF (this id can be used in the future to request the same data from GBIF) AccessURL Link where the digital herbarium specimen is stored.

Additional information Comparison to trait records from plants in natural environments
To test the suitability of leaf trait measurements from digital herbarium specimens, we compared the density distributions of four traits (leaf blade area, length, width and perimeter) measured on the herbarium specimen images against records based on standard measurement protocols from the TRY database on plant traits , Kattge et al. 2011. In this context, leaf area is defined as the projected area of an individual leaf (Pérez-Harguindeguy et al. 2013), which matches the measurements on the specimen images. However, the standard measurement protocol recommends measuring about ten mature leaves from the sunlit part of the canopy for each site in the natural environment and during the full flowering period (Pérez-Harguindeguy et al. 2013 14,15,16). The spatial distribution of the records from the TRY database was minimal compared to the spatial range of the measured herbarium specimens (see Fig. 11). The density distributions of trait records, based on herbarium specimens, follow an approximately normal distribution (after log-transformation) for all measured leaf traits and Figure 13.
Comparison of the density distributions of leaf blade area (mm , log-transformed) from herbarium specimen images to trait records, derived from the TRY database (representing trait measurements from life individuals by standard protocols) for the seven species of interest.
Comparison of the density distributions of leaf blade length (mm, log-transformed) from herbarium specimen images to trait records, derived from the TRY database (representing trait measurements from life individuals by standard protocols) for the seven species of interest.
species. The range of trait values from the TRY database (if there are some) is overlapping the range of trait values from the herbarium specimen images for all trait-species combinations (see Figs 13,14,15,16). However, if the number of trait records derived from the TRY database were sufficient for a more detailed comparison, the density distributions, based on specimen images, show a small, but consistent bias to smaller values (e.g. leaf area, one of the best covered continuous traits in the TRY database). We tend to explain this by the differences in the measurement protocols: the standard protocol recommends selecting ten mature leaves from the sunlit canopy, while we selected up to five leaves from a specimen image. Selecting several leaves from a specimen image includes a higher risk of sampling smaller, not fully mature leaves and leaf sizes might become smaller in the drying process while making digital herbarium specimens (Tomasze wski and Górzkowska 2016). However, this small, but rather consistent bias still needs to be addressed, based on comprehensive sampling across more species. Comparison of the density distributions of leaf blade width (mm, log-transformed) from herbarium specimen images to trait records, derived from the TRY database (representing trait measurements from life individuals by standard protocols) for the seven species of interest. Figure 16.
Comparison of the density distributions of leaf blade perimeter (mm, log-transformed) from herbarium specimen images to trait records, derived from the TRY database (representing trait measurements from life individuals by standard protocols) for the seven species of interest.

Discussion
Millions of digitised herbarium specimen images have become available during recent years and the numbers are expected to rise. Based on the metadata provided via the iDigBio and GBIF data portals, we were able to filter the images along with taxonomy and required additional information -georeference and sampling date. This enabled us to constrain the 21.9 million and 31.6 million herbarium specimen images available via iDigBio and GBIF (accessed date December 2020) for plants in the Phylum Tracheophyta to about 10 million images with sufficient additional information. Based on this preselection, we identified seven species most promising for data extraction and analysis in the context of this pilot study. We finally downloaded 17,383 images. We had to discard 5779 of the downloaded images (about 1/3) because of duplications or other problems not visible in the metadata. Nevertheless, we finally retained about 900 to 2500 images per species, covering broad species distribution ranges and dating back to the 19th century.
Extracting trait values on average about three leaves per image, the final trait dataset contains 128,036 records for leaf area, length, width and perimeter from 32,009 leaves. Separate uncertainty analyses and the comparison to leaf traits measured in natural environments following standard protocols indicate the validity of the trait values extracted from the herbarium specimen images. The dataset provided here increases the number of trait records for the seven selected species compared to other available trait data by up to three orders of magnitude and justifies hope for substantially improved analyses of trait variation within species and across space and time.
However, this pilot study also identified two bottlenecks towards extracting trait records for a more comprehensive number of species. The first bottleneck is the time needed to download the digitised herbarium specimen images. Even though this process was automated using Python scripts, it took approximately 5 hours for 2000 images. This was acceptable for our pilot study, based on seven species, but may be a problem for measurement campaigns across a more comprehensive number of species and images. Every improvement to better select the images appropriate for measurements, without downloading the images or/and speed up the download for individual images, will therefore substantially improve the opportunity for comprehensive trait data extraction from millions of herbarium specimen images. The other bottleneck is the time and the human input needed to measure trait values using the TraitEx software. TraitEx is a semi-automated tool and it takes about 10 minutes to extract, check and save the trait measurements per specimen image. This was manageable for our use case with a constrained number of 17,383 images suitable for trait measurements. However, for comprehensive measurement campaigns across all appropriate images, potentially covering millions of images, a fully automated tool is needed, which seamlessly combines robust automated detection of suitable leaves with the precise measurement of size-related traits.