Biodiversity Data Journal :
Research Article
|
Corresponding author: Jeremy R. deWaard (dewaardj@uoguelph.ca)
Academic editor: Rodolphe Rougerie
Received: 16 Jan 2023 | Accepted: 20 Mar 2023 | Published: 10 May 2023
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Levesque-Beaudin V, Miller ME, Dikow T, Miller SE, Prosser SW.J, Zakharov EV, McKeown JT.A, Sones JE, Redmond NE, Coddington JA, Santos BF, Bird J, deWaard JR (2023) A workflow for expanding DNA barcode reference libraries through ‘museum harvesting’ of natural history collections. Biodiversity Data Journal 11: e100677. https://doi.org/10.3897/BDJ.11.e100677
|
|
Natural history collections are the physical repositories of our knowledge on species, the entities of biodiversity. Making this knowledge accessible to society – through, for example, digitisation or the construction of a validated, global DNA barcode library – is of crucial importance. To this end, we developed and streamlined a workflow for ‘museum harvesting’ of authoritatively identified Diptera specimens from the Smithsonian Institution’s National Museum of Natural History. Our detailed workflow includes both on-site and off-site processing through specimen selection, labelling, imaging, tissue sampling, databasing and DNA barcoding. This approach was tested by harvesting and DNA barcoding 941 voucher specimens, representing 32 families, 819 genera and 695 identified species collected from 100 countries. We recovered 867 sequences (> 0 base pairs) with a sequencing success of 88.8% (727 of 819 sequenced genera gained a barcode > 300 base pairs). While Sanger-based methods were more effective for recently-collected specimens, the methods employing next-generation sequencing recovered barcodes for specimens over a century old. The utility of the newly-generated reference barcodes is demonstrated by the subsequent taxonomic assignment of nearly 5000 specimen records in the Barcode of Life Data Systems.
DNA barcoding, Diptera, museum harvesting, COI, arthropods, digitisation, National Museum of Natural History, USNM, Centre for Biodiversity Genomics
Digitally capturing biological data is an ongoing challenge, as classification, description, digitisation and collation of data can be tedious and time-consuming processes (
‘Museum harvesting’ refers to the selection, digitisation and sampling of identified voucher specimens held in NHCs, for the purpose of isolating and sequencing one or more barcode markers – a short fragment of the cytochrome c oxidase I (COI) gene in the case of animals (see
The Smithsonian Institution’s National Museum of Natural History (NMNH, USNM; https://naturalhistory.si.edu/) in Washington, D.C., maintains one of the largest arthropod collections in the world, holding over 35 million insect specimens alone (
This present study focused on the museum harvesting of true fly (Diptera) specimens held at the USNM, a collection that comprises over 3,200,000 pinned specimens and over 55,000 identified species from 162 families (
Museum harvesting can be completed through on-site and off-site processing workflows (Fig.
Staff from CBG completed two visits to the USNM, Department of Entomology in 2017 (2-6 October 2017 and 4-12 December 2017). Prior to the first research visit, Orthorrhapha (Diptera) was selected as a target taxonomic group. CBG staff prepared a list of Orthorrhapha genera and species lacking representation in BOLD to assist with on-site specimen selection.
To streamline subsequent processing of specimens, museum harvesting was completed using Schmitt insect boxes arrayed with 8 x 12 grid squares matching a 96-well microplate layout used in the sequencing laboratory (columns numbered from 1 to 12 and rows labelled from A to H). Each Schmitt box accommodates 95 pinned specimens, with the 96th square reserved for a negative control. Ten Schmitt boxes were assigned a unique alphanumeric barcode label received from the Canadian Centre for DNA Barcoding (CCDB; http://ccdb.ca/; e.g. CCDB-31120). The same unique alphanumeric barcode was used to create a unique sample ID for each of the 95 squares of the array (e.g. CCDB-31120-A01). Placeholder labels (“removal labels”) for all sample IDs were pinned in each square matching the corresponding sample ID (Fig.
Specimens were selected in the museum by moving systematically through each adjacent row, cabinet and insect drawer of the target families within the insect collection to search for genera on the target list (Fig.
For each specimen selected and removed from its cabinet/drawer location and placed into a square in a Schmitt box array, the corresponding removal label was placed into the unit tray within the cabinet/drawer and replaced when the specimen was returned to the USNM (Fig.
During the first research visit in October 2017, after specimen selection was completed for the first four arrays, a report of the taxonomy, country of collection, sample ID and specimen cabinet/drawer locations was provided to the USNM curator (T.D.) for use in preparing the specimen loan (Fig.
Multiple habitus photos of each specimen were taken in the CBG imaging lab using a Canon EOS 70D camera (Fig.
During the second research visit in December 2017, after specimen selection for the remaining six arrays was completed (Fig.
Tissue sampling was completed by removing two legs (a middle leg and a hind leg) from each specimen, placing one into an assigned microplate for each array and the second into a tissue archiving plate (Fig.
Once tissue sampling was completed at USNM, specimens were returned to their original locations in the collection using a prepared list of cabinet locations and the removal labels associated with each specimen. Databasing of label data was completed using the label images and entered into the BOLD submissions spreadsheet and submitted to BOLD (in the ASILO project) (Fig.
The 941 tissue samples were lysed and extracted following the silica-based protocol outlined in
PCR amplification and sequencing was first completed using Sanger sequencing and analysis (Fig.
Each PCR reaction consisted of 2 µl of DNA template added to the appropriate well in pre-made, 96-well PCR plates, with 6.25 µl of 10% D-(+)-trehalose dihydrate (ThermoFisher Scientific), 2 µl of Hyclone ultra-pure water (ThermoFisher Scientific), 1.25 µl of 10× PlatinumTaq buffer (Invitrogen by ThermoFisher Scientific), 0.625 µl of 50 mM MgCl2 (Invitrogen by ThermoFisher Scientific), 0.125 µl of each primer, 0.0625 µl of 10 mM dNTP (KAPA Biosystems) and 0.060 µl of 5 U/µl PlatinumTaq DNA Polymerase (Invitrogen by ThermoFisher Scientific) for a total reaction volume of 12.5 µl. Thermal cycling conditions were 94°C for 1 min, 5 cycles at 94°C for 40 s, 45°C for 40 s, 72°C for 1 min, followed by 35 cycles at 94°C for 40 s, 51°C for 40 s, 72°C for 1 min and a final extension at 72°C for 5 min. All amplicons were visualised on a 2% agarose E-gel 96 pre-cast gel (ThermoFisher Scientific). Following consolidation into 384-well plates, PCR clean-up was completed with CleanSeq bead-based purification (Agencourt Biosciences) and cycle sequencing was performed using a modified BigDye 3.1 Terminator (Applied Biosystems, ThermoFisher Scientific) protocol (
All specimens that failed to gain a sequence (N = 418) were selected for next-generation sequencing (NGS), based failure-tracking utilising the method of
PCR amplification involved three rounds (in 96-well plates unless otherwise stated): PCR1 to produce a spectrum of COI amplicons from each DNA extract; PCR2 to generate short, overlapping amplicons flanked by PacBio “PB1” adapters; and PCR3 to add UMIs to the amplicons from each specimen so multiple samples could be pooled for sequencing. The first round of PCR consisted of two reactions per sample (PCR1.1, PCR1.2), with each reaction containing three forward primers spanning the barcode region and 5–6 reverse primers (all untagged primers; see Fig 1A in
PCR2 consisted of six reactions (PCR2.1, PCR2.2, PCR2.3, PCR2.4, PCR2.5, PCR2.6) with three (PCR2.1, PCR2.3, PCR2.5) using PCR1.1 as template, while the others (PCR2.2, PCR2.4, PCR2.6) used PCR1.2. PCR2 reaction cocktails were the same as those employed for PCR1, except the primers were tailed with PB1 adapters, providing universal primer binding sites for subsequent fusion of the UMIs. The PCR regime consisted of 94°C for 2 min, 40 cycles of 94°C for 40 s, 48°C for 40 s and 72°C for 30 s and a final extension of 72°C for 5 min. Following thermocycling, all six PCR2 reactions were pooled for each sample and a 12.5 μl aliquot of each pool was bead-purified as after PCR1. The purified products were then used for PCR3 which added sample-specific UMI tags to the amplicons recovered from each specimen. Asymmetrical dual-tagging was employed by using 96 different forward primers and 96 different reverse primers; the UMI-tagged fusion primers were complementary to the PB1 adapters of the PCR2 primers. The PCR regime consisted of 94°C for 2 min, 20 cycles of 94°C for 40 s, 64°C for 40 s and 72°C for 1 min, with a final extension of 72°C for 5 min. After thermocycling, the PCR3 amplicons were pooled for preparing libraries for SMRT sequencing.
Template preparation was performed following PacBio recommendations for SMRT sequencing. Purification involved adding 400 μl of the library to 480 μl of AMPure-PB beads (all subsequent purifications were carried out using the same 1.2 x beads-sample ratio). End-repair, SMRTbell adapter ligation, primer annealing and polymerase binding were all completed using PacBio instructions. The polymerase-bound products were loaded onto a SMRT cell (1M v.2) via diffusion loading without prior enrichment at a concentration of 18 pM. Sequencing run parameters were set using SMRTLink version 5.0 and sequencing was completed on a PacBio Sequel system. Default run settings were used with a few exceptions: insert size was set to 500, movie time to 480 min, immobilisation time to 120 min and pre-extension time to 20 min. Following sequencing, the raw data were analysed using the CCS algorithm under the SMRT Analysis module of SMRTLink. Default settings were used with the following exceptions: the maximum and minimum subread lengths were set to 500 bp and 100 bp, respectively.
The raw sequence data were used to generate circular consensus sequences (CCS) on SMRTLink v.7 using a minimum predicted accuracy of 99%. The short CCS reads (downloaded in FASTA format) were then assembled (de novo) into longer COI barcode sequences by custom bash and R scripts (made accessible by
To assess the impact of a museum harvesting-based reference library on the identification of BINs or records on BOLD, data from a large-scale collecting effort from CBG, the Global Malaise Program (GMP; http://www.globalmalaise.org;
All sequences uploaded to BOLD that matched criteria outlined in
All specimen data, which were formatted for the USNM EMu Collection Management System, as well as all specimen and label images, were provided to USNM staff for data submission and archiving (Fig.
A complete list of the 941 USNM Diptera specimens (including USNMENT catalogue numbers, collection date, country of origin, taxonomy, BOLD process ID, BIN, sequence length, GenBank accession number and NMNH Biorepository number) is provided in Suppl. material
After sequencing using the Sanger-based method (
Analysis of the relationship between specimen age and sequence length for A) specimens sequenced using the Sanger-based protocol and B) specimens sequenced using the NGS-based failure-tracking protocol. Flagged records were excluded from these analyses. Note that all specimens that failed using the Sanger-based protocol were then attempted with the NGS-based failure-tracking protocol [and only appear in B)].
NGS-based failure-tracking was conducted on 418 specimens that did not gain a sequence during Sanger analysis. Of the 418 specimens, 366 gained a sequence (87.6%), bringing sequence recovery to 92.1% (867 of 941 total specimens > 0 bp). Of the 867 specimens, 824 had acceptable barcodes recovered (> 300 bp), resulting in an overall Sanger- and NGS-based sequence success rate of 87.6%. Of the 819 sequenced genera, 727 had acceptable barcodes recovered (88.8% success rate). For NGS-based failure-tracking, the relationship between sequence length and the collection age of the specimen was weaker, but still significant (Fig.
After NGS-based failure-tracking, of the 941 sequenced specimens, 41 records resulted in a contaminated barcode and were flagged on BOLD (17 at the time of sequencing using the Sanger-based protocol; 16 after the NGS-based protocol and eight flags were added after final data review) (Fig.
Using the taxonomically identified barcodes gained from the ASILO project that were greater than 400 bp, BOLD assigned (or could have assigned) genus- or species-level taxonomy to 4,999 specimens from the GMP project, through BIN taxonomy matches and BOLD ID Engine results (Fig.
Global Malaise Program Records (GMP) that gained or could have gained taxonomy at the genus and species level using BIN taxonomy match and BOLD ID Engine approaches. These numbers are inclusive of older and newer Malaise trap projects that could fall under the large GMP campaign (see Materials and Methods for more details). *Covered by the BIN taxonomy match.
Gained genus assignment (records) |
Gained species assignment (records) |
Total |
|
BIN taxonomy match |
1,263 |
2,403 |
3,666 |
BOLD ID Engine |
1,333 |
* |
1,333 |
Total |
2,596 |
2,403 |
4,999 |
Capturing biological data from natural history collections is critical for providing a comprehensive record of Earth’s biodiversity – both historical and contemporary. In our study, we aimed to develop and streamline a workflow for ‘museum harvesting’ of taxonomically identified voucher specimens held in NHCs. The workflow was then assessed through a pilot project that harvested and DNA barcoded 941 Diptera specimens archived in the Entomology collection of the Smithsonian National Museum of Natural History (USNM). Secondary objectives were to refine the museum workflow to be applicable to future projects at other NHCs and to demonstrate the utility of the newly-generated barcodes for the identification of previously unidentified specimens within the BOLD reference library. Utilising Sanger sequencing for initial DNA barcoding, followed by failure tracking using a NGS-based approach, 867 barcode sequences were recovered from the specimens with an overall sequencing success of 88.8% (727 of 819 sequenced genera gained a barcode > 300 bp).
Both on-site and off-site workflows were employed in the harvesting and barcoding of NHC specimens, each of which possesses advantages and poses challenges during various stages of voucher specimen processing. For on-site specimen processing, there is less risk of damage to fragile and often invaluable vouchers, as there is limited handling and no transport to an off-site location – only the tissue material for DNA extraction/sequencing must be moved off-site. The transport required for the off-site workflow poses a risk of specimen damage (and potentially specimen loss) and can be a time-consuming step if there is significant distance between both facilities, using either shipping or hand-carrying. On-site processing can also facilitate the harvesting of restricted specimens (e.g. from primary or secondary type series) that are not permitted to leave the collection, allow for taxonomic curators to work closely with technicians throughout the entire process and enable the voucher specimens to remain accessible as reference material. Conversely, the on-site workflow is significantly less cost- and time-effective, due to the longer time required within the NHC to complete the labelling, imaging, databasing and tissue sampling. This extra time adds supplemental costs, such as requiring additional technician hours to complete the work at the NHC and/or additional travel/accommodation expenses to compensate for the additional processing time. These tasks can be completed more efficiently at an off-site facility that has a dedicated team to accomplish each task and is better equipped to complete these steps in a shorter time period (e.g. optimised workspaces, superior imaging equipment, improved computational capacity for intensive processes, such as image stacking). In addition to more efficient completion of these crucial steps, off-site processing may also provide a more sterile environment for sampling, reducing the risk of contamination by exogenous DNA (
The dipteran voucher specimens were DNA barcoded using two approaches: a Sanger-based method targeting two overlapping amplicons (
The DNA barcodes generated from the USNM voucher specimens were used to assign 4,999 records on BOLD to genus or species, through matching with an existing BIN or querying the new sequence through the BOLD ID Engine. This demonstrates the further utility of harvesting and barcoding authoritatively-identified museum specimens in the construction of reference barcode libraries: the addition of these records often enables more taxonomic assignments, expanding and refining the library further. These results reinforce the view that building reference libraries for many taxa can rely on a combination of museum harvesting (or other approaches where taxonomic assignments occur prior to barcode analysis) and the barcoding of freshly-collected, unidentified material that is assigned taxonomy after barcoding, through morphological assessment by an expert.
While this study was conducted on a small scale, with less than 1,000 voucher specimens, this workflow has formed the basis for larger-scale museum harvesting projects at the Smithsonian National Museum of Natural History and other institutions, using both on- and off-site processing. For example,
This study was enabled by funding by the Smithsonian Institution Barcode Network Award (FY17 Award Cycle) entitled ‘Harvesting of diverse Orthorrhapha and other dipteran genera at USNM’ awarded to T.D., V.L.B., S.E.M., E.V.Z. and J.R.d. The CBG and BOLD is supported by a number of funding sources (awarded to Paul Hebert), including the Canada Foundation for Innovation, Genome Canada through Ontario Genomics, the Natural Sciences and Engineering Research Council of Canada, the Ontario Ministry of Research, Innovation and Science, the Gordon and Betty Moore Foundation, Ann McCain Evans and Chris Evans. This paper also contributes to the University of Guelph’s Food from Thought research programme supported by the Canada First Research Excellence Fund. We are grateful to colleagues at the USNM and CBG for their support, including Katie Barker, Mike Trizna, Ashton Smith, Gergin Blagoev, Stephanie deWaard, Liuqiong Lu, Margarita Miklasevskaja, Renee Miskie, Norm Monkhouse, Suresh Naik, Nadya Nikolova, Mikko Pentinsaari, Sujeevan Ratnasingham, Angela Telfer and Paul Hebert.
T.D., V.L.B., S.E.M., E.V.Z. and J.R.d. secured the funding. J.R.d., M.E.M., S.E.M. and V.L.B. contributed to the initial organisation of the research project. M.M. and J.R.d. facilitated the two research visits and completed all specimen processing at the USNM. T.D. was the ASILO project lead and curator of the sampling materials at the USNM. M.M. completed all specimen labelling, tissue sampling and digitisation at CBG. J.M. completed all imaging at CBG and processed all images taken at the USNM. S.W.J.P. and E.V.Z. performed laboratory analysis and contributed to the laboratory section of the manuscript. M.M. analysed data, converted BOLD data into the EMu format and wrote the final project report for USNM staff. V.L.B. analysed the data, created the figures and tables and contributed to sections of the manuscript. M.M. and J.R.d. drafted, edited and contributed to sections of the manuscript. N.R. and J.B. added all data into the USNM EMu Collection Management System. All authors read and approved the final manuscript.
Summary of specimen, sequence and voucher information for the 941 USNM specimens of Diptera analysed in the study.