Biodiversity Data Journal :
Research Article
|
Corresponding author: Eric Schuettpelz (schuettpelze@si.edu)
Academic editor: Vincent Smith
Received: 21 Sep 2017 | Accepted: 21 Oct 2017 | Published: 02 Nov 2017
© 2017 Eric Schuettpelz, Paul Frandsen, Rebecca Dikow, Abel Brown, Sylvia Orli, Melinda Peters, Adam Metallo, Vicki Funk, Laurence Dorr
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Schuettpelz E, Frandsen P, Dikow R, Brown A, Orli S, Peters M, Metallo A, Funk V, Dorr L (2017) Applications of deep convolutional neural networks to digitized natural history collections. Biodiversity Data Journal 5: e21139. https://doi.org/10.3897/BDJ.5.e21139
|
Natural history collections contain data that are critical for many scientific endeavors. Recent efforts in mass digitization are generating large datasets from these collections that can provide unprecedented insight. Here, we present examples of how deep convolutional neural networks can be applied in analyses of imaged herbarium specimens. We first demonstrate that a convolutional neural network can detect mercury-stained specimens across a collection with 90% accuracy. We then show that such a network can correctly distinguish two morphologically similar plant families 96% of the time. Discarding the most challenging specimen images increases accuracy to 94% and 99%, respectively. These results highlight the importance of mass digitization and deep learning approaches and reveal how they can together deliver powerful new investigative tools.
convolutional neural networks, deep learning, machine learning, mass digitization, natural history collections
Deep learning can greatly surpass conventional machine learning by incorporating multi-layered neural networks capable of processing natural data in their raw form (
Deep learning might ultimately be leveraged in many ways for many different types of NHCs. Here, we focus on the digitized portion (currently 1.2 million specimens) of the United States National Herbarium. Our analyses, focused on the detection of specimens previously treated with mercury and the discrimination of superficially similar plant families, are complementary to those recently published on species identification (
To assess the potential of using CNNs to classify specimen images obtained from NHCs, we assembled two distinct datasets. Both datasets contained two image categories, with an approximately equal number of images in each category. Some specimen images were obtained with a traditional light box, but most were acquired via a conveyor system managed by the Smithsonian Digitization Program Office.
In the past, mercuric chloride was sometimes used by collectors or repositories to prevent insect damage to specimens. Unfortunately, this substance is also toxic to humans and knowing the number and location(s) of contaminated specimens in a collection is important. One can test for mercury vapor in herbarium cabinets (
The automated identification of specimens could make a valuable contribution to biological research (
CNNs were built in Mathematica version 11.1 (Wolfram Research Inc.) and trained on NVIDIA K80 GPUs. For each dataset (stained/unstained and clubmoss/spikemoss), we randomly partitioned the images into three non-overlapping sets each time before training the network: 70% were used for training the model; 20% were used for validation; and 10% were reserved as our test dataset (i.e., the images used to train the CNNs were not used to assess their accuracy). We resized the color images to 256×256 pixels, creating a 3×256×256 tensor for our input layer (the first dimension separated by RGB values), and explored the performance of a variety of CNNs for each dataset. For the stained/unstained dataset, the best CNN included four convolutional and four pooling layers (Table
Constitutive layers and tensor/vector shapes for the unstained/stained CNN.
Layer | Type | Shape |
Input | 3-tensor | 3×256×256 |
ConvolutionLayer | 3-tensor | 16×252×252 |
BatchNormalizationLayer | 3-tensor | 16×252×252 |
Ramp (ReLU) | 3-tensor | 16×252×252 |
PoolingLayer | 3-tensor | 16×126×126 |
ConvolutionLayer | 3-tensor | 32×122×122 |
BatchNormalizationLayer | 3-tensor | 32×122×122 |
Ramp (ReLU) | 3-tensor | 32×122×122 |
PoolingLayer | 3-tensor | 32×61×61 |
ConvolutionLayer | 3-tensor | 64×57×57 |
BatchNormalizationLayer | 3-tensor | 64×57×57 |
Ramp (ReLU) | 3-tensor | 64×57×57 |
PoolingLayer | 3-tensor | 64×28×28 |
ConvolutionLayer | 3-tensor | 48×26×26 |
BatchNormalizationLayer | 3-tensor | 48×26×26 |
Ramp (ReLU) | 3-tensor | 48×26×26 |
PoolingLayer | 3-tensor | 48×13×13 |
FlattenLayer | vector | 8112 |
DropoutLayer | vector | 8112 |
LinearLayer | vector | 500 |
Ramp (ReLU) | vector | 500 |
LinearLayer | vector | 2 |
SoftmaxLayer | vector | 2 |
Output | class |
Constitutive layers and tensor/vector shapes for the clubmoss/spikemoss CNN.
Layer | Type | Shape |
Input | 3-tensor | 3×256×256 |
ConvolutionLayer | 3-tensor | 10×252×252 |
BatchNormalizationLayer | 3-tensor | 10×252×252 |
Ramp (ReLU) | 3-tensor | 10×252×252 |
PoolingLayer | 3-tensor | 10×126×126 |
ConvolutionLayer | 3-tensor | 40×122×122 |
BatchNormalizationLayer | 3-tensor | 40×122×122 |
Ramp (ReLU) | 3-tensor | 40×122×122 |
PoolingLayer | 3-tensor | 40×61×61 |
FlattenLayer | vector | 148840 |
DropoutLayer | vector | 148840 |
LinearLayer | vector | 500 |
Ramp (ReLU) | vector | 500 |
LinearLayer | vector | 2 |
SoftmaxLayer | vector | 2 |
Output | class |
Our best performing CNNs were remarkably effective in distinguishing stained from unstained specimens, as well as clubmosses from spikemosses (Fig.
Results of our CNN analyses of test herbarium specimen images.
The present study demonstrates two different ways in which CNNs can be applied to NHCs. The mercury staining analysis has practical implications for collections management, while the analysis centered on distinguishing families is interesting from both collections management and research perspectives. Our stained vs. unstained network could theoretically be applied to digitized specimens in other herbaria to help identify mercury hotspots for potential remediation. Likewise, our family discrimination network has the potential to be further developed into a universal tool to identify unknowns or to flag specimens in need of additional study, in the United States National Herbarium and in other NHCs.
Our work highlights the importance of proper metadata curation when approaching a machine learning project. Assembling the training dataset for the mercury analysis required many person hours to visually inspect images for staining, whereas clubmoss and spikemoss images were easily compiled using specimen metadata alone. Nascent efforts in digitization in NHCs must carefully consider the acquisition and curation of metadata because it affects how quickly machine learning tools can be applied to digitized museum collections.
Computation was performed on the Smithsonian Institution High Performance Cluster (SI/HPC), Hydra. We thank Nathan Anderson, Robert (Bort) Edwards, and Carol Kelloff for help classifying images and Martin Taheri and Sylvain Korzennik for IT support.
Conceptualization: ES, PBF, RBD, AB, AM & LJD. Methodology: ES, PBF, RBD & AB. Software: PBF & AB. Validation: PBF & RBD. Formal analysis: PBF, RBD & AB. Investigation: ES, PBF, RBD, SO, MP, AM, VAF & LJD. Resources: SO & AM. Data curation: ES, PBF, RBD, SO, MP, AM, VAF & LJD. Writing (original draft): ES, PBF & RBD. Writing (review and editing): ES, PBF, RBD, MP, VAF & LJD. Visualization: PBF & RBD. Supervision: ES. Project administration: ES, PBF, RBD, SO, AM & LJD. Funding acquisition: LJD.