Biodiversity Data Journal :
Research Article
|
Corresponding author: Bernardo F. Santos (bernardofsantos@gmail.com)
Academic editor: Rodolphe Rougerie
Received: 23 Jan 2023 | Accepted: 30 Mar 2023 | Published: 24 Apr 2023
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Santos BF, Miller ME, Miklasevskaja M, McKeown JTA, Redmond NE, Coddington JA, Bird J, Miller SE, Smith A, Brady SG, Buffington ML, Chamorro ML, Dikow T, Gates MW, Goldstein P, Konstantinov A, Kula R, Silverson ND, Solis MA, deWaard SL, Naik S, Nikolova N, Pentinsaari M, Prosser SWJ, Sones JE, Zakharov EV, deWaard JR (2023) Enhancing DNA barcode reference libraries by harvesting terrestrial arthropods at the Smithsonian's National Museum of Natural History. Biodiversity Data Journal 11: e100904. https://doi.org/10.3897/BDJ.11.e100904
|
|
The use of DNA barcoding has revolutionised biodiversity science, but its application depends on the existence of comprehensive and reliable reference libraries. For many poorly known taxa, such reference sequences are missing even at higher-level taxonomic scales. We harvested the collections of the Smithsonian’s National Museum of Natural History (USNM) to generate DNA barcoding sequences for genera of terrestrial arthropods previously not recorded in one or more major public sequence databases. Our workflow used a mix of Sanger and Next-Generation Sequencing (NGS) approaches to maximise sequence recovery while ensuring affordable cost. In total, COI sequences were obtained for 5,686 specimens belonging to 3,737 determined species in 3,886 genera and 205 families distributed in 137 countries. Success rates varied widely according to collection data and focal taxon. NGS helped recover sequences of specimens that failed a previous run of Sanger sequencing. Success rates and the optimal balance between Sanger and NGS are the most important drivers to maximise output and minimise cost in future projects. The corresponding sequence and taxonomic data can be accessed through the Barcode of Life Data System, GenBank, the Global Biodiversity Information Facility, the Global Genome Biodiversity Network Data Portal and the NMNH data portal.
COI, cox1, dark taxa, OTUs, BINs, natural history collection, museum harvesting, National Museum of Natural History, USNM, Centre for Biodiversity Genomics, CBG
The use of DNA barcoding has revolutionised how biodiversity can be surveyed and identified, with applications in fields as broad as biodiversity assessment, invasive species monitoring, agricultural pest control, identification of disease vectors, integrative taxonomy and evolutionary studies (reviewed in
In the face of these challenges, one of the most promising avenues for building comprehensive reference libraries is directly harvesting museum specimens that are authoritatively determined (
The Smithsonian Institution’s National Museum of Natural History (USNM) comprises the largest natural history collection in the world, with a large portion of its holdings represented by terrestrial invertebrates. For many taxa, the USNM holds the most complete inventory of species of any collection in the world and the vast majority of invertebrate orders have a complete inventory of the holdings at species level. These qualities make it ideally suited to contribute to the general effort of building a global reference library for DNA barcodes, especially for taxa not otherwise represented in repositories such as GenBank (
Herein we report results of the project “Barcoding NMNH terrestrial invertebrate genera”, which aims to generate DNA barcoding sequences for genera not previously represented on GenBank, BOLD or GGBN and to initiate the long-term preservation of publicly-accessible genomic DNA extracts and high-resolution images to accompany the physical USNM vouchers. In a companion paper released simultaneously with this one (
In 2018 and 2019, staff from the Centre for Biodiversity Genomics (CBG) completed six visits (46 days total) to the Smithsonian Institution’s National Museum of Natural History, Department of Entomology (USNM). Prior to each visit, a number of target taxa, such as families or superfamilies, were defined, based on number of available specimens, level of curation and physical localisation in the museum. Taxon selection attempted to contemplate most major insect orders, except for Diptera, which were the subject of a pilot project in the development of this methodological workflow (
At the end of each visit, specimens were transferred to CBG for processing. Each specimen was assigned a sample ID, accession number and labelled with a Barcode of Life Data Systems (BOLD) (
From the initial set of specimens, 950 samples were selected for NGS processing; in addition, the NGS pipeline was used for a subset of the specimens that failed to yield sequences using the Sanger protocols. In both cases, the same set of laboratory methods and protocols was adopted. The NGS failure tracking (NGSFT) proceeded as follows: first, a list of genera sampled in Year 1 (Fig.
The complete NGS protocol can be found in
All sequences underwent taxonomic validation by matching to existing records using the BOLD ID engine, followed by sequence discordance detection using Neighbour-joining trees of similar taxa (
The specimen data, images and sequencing data for all 8,549 specimen records are available on BOLD in the public dataset DS-NMNHALL (http://dx.doi.org/10.5883/DS-NMNHALL) and searchable in the Public Data Portal on BOLD (www.boldsystems.org/index.php/Public_BINSearch) or downloadable by utilising BOLD’s API (www.boldsystems.org/index.php/resources/api).
Specimen records include taxonomy, collection date and location, USNM ENT identifiers, EZID reference numbers (corresponding to EMu-minted records that have globally-unique identifier status), BINs and any additional voucher specimen details. All specimen images are publicly available under the Creative Commons No Rights Reserved (CC0 1.0) licence. All data were submitted and stored in the USNM EMu collection management system and individual records are accessible at https://collections.nmnh.si.edu/search/ento/. Specimen data and DNA storage information were submitted to the Global Genome Biodiversity Network (GGBN) Data Portal (
All sequences have been submitted to GenBank; the dataset can be accessed through NCBI’s BioProject PRJNA81359 (https://www.ncbi.nlm.nih.gov/bioproject/81359). All specimen data have also been uploaded to the Global Biodiversity Information Facility (GBIF; http://www.gbif.org) in the ‘NMNH Extant Specimen Records (USNM, US)’ occurrence dataset (https://doi.org/10.15468/hnhrg3). DNA extracts derived from sequenced specimens are held in the CBG DNA Archive (as specified in
A complete list of the 8,549 specimens (including USNM ENT IDs, Process IDs, BOLD IDs, COI sequence length, country of origin, collection date and taxonomy) is provided in Suppl. material
Of the 4,508 selected genera, 882 genera were represented by one specimen, 3,421 genera were represented by two specimens, 103 genera were represented by three specimens, 75 genera were represented by four specimens and the remaining 27 genera were represented by five or more specimens. At the time of specimen selection (Table A in Suppl. material
Initial sequencing results by sequencing method for 8,549 USNM specimen records prior to NGS Failure Tracking. 675 genera gained at least one sequence using both the Sanger and NGS protocol during initial sequencing.
Initial Sequencing Method |
Total Specimens |
> 500 bp |
300–499 bp |
200–299 bp |
0–199 bp |
0 bp |
Contaminated Sequences |
Sanger Protocol |
7,599 |
2,246 |
1,609 |
239 |
53 |
3,306 |
146 |
NGS Protocol |
950 |
445 |
120 |
63 |
84 |
234 |
4 |
TOTAL |
8,549 |
2,691 |
1,728 |
198 |
89 |
3,693 |
150 |
(% of Total) |
31.48% |
20.21% |
2.32% |
1.04% |
43.20% |
1.75% |
NGS-based failure-tracking was conducted in two stages (Fig.
NGS Failure Tracking sequencing results. A total of 145 specimens failed (0 bp) on the first round of NGS failure tracking and were, therefore, included again in the second round. In total, NGSFT was performed on 1343 specimens.
Sequencing Method |
Total Specimens |
> 500 bp |
300–499 bp |
200–299 bp |
0–199 bp |
0 bp |
Contaminated Sequences |
NGSFT (Round 1) |
475 |
231 |
69 |
3 |
7 |
161 |
4 |
(% of Total) |
48.63% |
14.53% |
0.63% |
1.47% |
33.89% |
0.84% |
|
NGSFT (Round 2) |
1,013 |
356 |
145 |
60 |
113 |
332 |
7 |
(% of Total) |
35.10% |
14.30% |
5.90% |
11.20% |
32.80% |
0.70% |
After NGS-based failure tracking, overall sequence recovery by specimen was 66.5% (5,686 of 8,549 records gained a sequence (> 0 bp) (Table
Sequencing results by taxonomic group for 8,549 USNM specimens. Other Orders: Mecoptera, Megaloptera, Neuroptera, Odonata, Plecoptera, Raphidioptera and Trichoptera.
Order |
Total Specimens |
> 500 bp |
300–499 bp |
200–299 bp |
1–199 bp |
0 bp |
Contaminated Sequences |
Araneae |
95 |
42 |
12 |
1 |
13 |
26 |
1 |
Coleoptera |
3,257 |
1284 |
689 |
79 |
41 |
1095 |
69 |
Diptera |
103 |
44 |
17 |
0 |
1 |
37 |
4 |
Hemiptera |
2,042 |
776 |
542 |
30 |
58 |
596 |
40 |
Hymenoptera |
2,017 |
563 |
493 |
133 |
80 |
736 |
12 |
Lepidoptera |
454 |
281 |
46 |
4 |
13 |
104 |
6 |
Other Orders* |
581 |
288 |
143 |
11 |
2 |
119 |
18 |
Total |
8,549 |
3,278 |
1,942 |
258 |
208 |
2,713 |
150 |
(% of Total) |
38.30% |
22.70% |
3.00% |
2.40% |
31.70% |
1.80% |
Success length for COI sequencing by specimen collection date (given in percentage values at each bar) for the 8,549 USNM specimens selected in 2018 and 2019. The green bar represents the percentage of specimens collected per decade with recovered sequences (> 300 bp) and orange represents specimens with failed sequences (0 - 299 bp) or flagged sequences.
Of the 4,508 selected genera, 3,886 gained a sequence > 0 bp (86.2%), with 3,638 genera gaining a sequence that was an acceptable barcode (> 300 bp), resulting in a success rate of 80.7% (Table
Order |
Total Genera |
% Success (> 300 bp) |
> 500 bp |
300–499 bp |
200–299 bp |
1–199 bp |
0 bp |
Contaminated Sequences |
Araneae |
54 |
64.5% |
29 |
6 |
1 |
8 |
10 |
0 |
Coleoptera |
1,655 |
83.1% |
951 |
425 |
29 |
30 |
214 |
6 |
Diptera |
53 |
83.0% |
32 |
12 |
0 |
1 |
7 |
1 |
Hemiptera |
1,123 |
80.6% |
581 |
325 |
14 |
45 |
152 |
6 |
Hymenoptera |
1,068 |
73.2% |
449 |
333 |
58 |
43 |
184 |
1 |
Lepidoptera |
256 |
85.9% |
197 |
23 |
0 |
13 |
21 |
2 |
Mecoptera |
7 |
100% |
6 |
1 |
0 |
0 |
0 |
0 |
Megaloptera |
12 |
75.0% |
6 |
3 |
1 |
0 |
2 |
0 |
Neuroptera |
121 |
92.6% |
91 |
21 |
2 |
0 |
6 |
1 |
Odonata |
135 |
96.3% |
83 |
47 |
1 |
0 |
3 |
1 |
Plecoptera |
8 |
62.5% |
3 |
2 |
0 |
1 |
2 |
0 |
Raphidioptera |
5 |
40.0% |
2 |
0 |
0 |
1 |
2 |
0 |
Trichoptera |
11 |
90.9% |
5 |
5 |
0 |
0 |
1 |
0 |
Total |
4,508 |
80.7% |
2,435 |
1,203 |
106 |
142 |
604 |
18 |
(% of Total) |
54.02% |
26.69% |
2.35% |
3.15% |
13.40% |
0.40% |
Sequence recovery by genera (> 0 bp) for all selected insect orders was between 60.0% and 100.0% (Fig.
Sequencing results by taxonomic group for 4,508 USNM genera. Inner pie chart shows the proportion of sampled taxa in each taxonomic group and the outer chart shows the distribution of sequencing success within each taxonomic group. Other Orders: Mecoptera, Megaloptera, Neuroptera, Odonata, Plecoptera, Raphidioptera and Trichoptera.
Hymenoptera specimens were sequenced using a sample of leg tissue (1,542/2,017 specimens, representing 818 Hymenoptera genera) or using the whole voucher (475/2,017 total specimens, representing 253 Hymenoptera genera), (Table
Tissue type and sequencing method for 2,017 Hymenoptera specimens prior to NGS Failure tracking.
Initial Sequencing |
Total Specimens |
> 500 bp |
300 - 499 bp |
200 - 299 bp |
1 - 199 bp |
0 bp |
Contaminated Records |
Sanger (leg tissue) |
1,347 |
260 |
268 |
93 |
31 |
686 |
9 |
NGS (leg tissue) |
195 |
68 |
24 |
10 |
25 |
68 |
0 |
TOTAL |
1,542 |
328 |
292 |
103 |
56 |
754 |
9 |
(% of Total) |
21.27% |
18.94% |
6.68% |
3.63% |
48.90% |
0.58% |
|
Sanger (Whole Voucher) |
380 |
57 |
91 |
32 |
0 |
197 |
3 |
NGS (Whole Voucher) |
95 |
3 |
29 |
20 |
8 |
35 |
0 |
TOTAL |
475 |
60 |
120 |
52 |
8 |
232 |
3 |
(% of Total) |
12.63% |
25.26% |
10.95% |
1.68% |
48.84% |
0.63% |
Tissue type and sequencing method for 2,017 Hymenoptera specimens after NGS Failure tracking.
Total Specimens |
> 500 bp |
300 - 499 bp |
200 - 299 bp |
1 - 199 bp |
0 bp |
Contaminated Records |
|
Leg Tissue |
1,542 |
487 |
353 |
87 |
72 |
534 |
9 |
(% of Total) |
31.58% |
22.89% |
5.64% |
4.67% |
34.63% |
0.58% |
|
(Whole Voucher) |
475 |
76 |
140 |
46 |
8 |
202 |
3 |
(% of Total) |
16.00% | 29.47% | 9.68% | 1.68% | 42.53% | 0.63% |
The persistent scarcity of reliable reference libraries for many poorly-known invertebrate taxa has been a growing concern, reflected in the recent emergence of specific projects and initiatives aimed specifically at such groups, such as “GBOL III: Dark Taxa” by the German Barcode of Life Initiative (
Using authoritatively identified material from one of the most prominent natural history collections in the world, we were able to provide novel DNA barcoding data for thousands of genera which had not yet been sequenced and for 3,743 determined species of terrestrial arthropods. This data release represents not only an important advance in the availability of species-level reference barcodes for several taxa, but also has the potential to assist genus-level identifications for groups for which reference sequences are sorely lacking. These results were attained by using a workflow that combines on-site sampling with off-site processing of specimens and DNA extracts (
The laboratory protocol used for this study was primarily based on Sanger sequencing, with an NGS pipeline used as an alternative method to recover sequences for very old or small taxa or to specifically target samples that had failed to sequence using the Sanger-based methodology. In our case, this increased overall success, mostly due to the change in amplification strategy (i.e. use of nested PCR targeting smaller fragments; see
As costs associated with NGS processing continue to decline (
In our case, NGS was only attempted for specimens that were either unlikely to be successfully sequenced with Sanger approaches (i.e. very small or old) or as part of failure tracking; hence, our success rates for NGS cannot be used as baseline for overall success if the whole project was conducted under this approach. Overall, our data and those of
We wish to thank the numerous USNM curators and other staff members who contributed directly or indirectly with this work: Thomas Henry, Stuart McKamey, Charyn J. Micheli, Allen Norrbom, Robert Robbins, Ted Schultz, Floyd W. Shockley and Hannah M. Wood. Funds for this project, including a postdoctoral fellowship to BFS, were provided by the Smithsonian’s Global Genome Initiative and from the Smithsonian Institution Barcode Network (FY18, FY19 and FY20 Award Cycles). The CBG receives funding support from a number of sources, including the Canada Foundation for Innovation, Genome Canada through Ontario Genomics, the Natural Sciences and Engineering Research Council of Canada, the Ontario Ministry of Research, Innovation and Science, the Gordon and Betty Moore Foundation, Ann McCain Evans and Chris Evans. This article also contributes to the University of Guelph’s Food from Thought research programme supported by the Canada First Research Excellence Fund. We would also like to thank colleagues at the CBG for their contributions to this research, including Allison Brown, Gergin Blagoev, Tyler Elliott, Liuqiong Lu, Renee Miskie, Norm Monkhouse, Crystal Sobel, Angela Telfer, Connor Warne and Paul Hebert. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA. USDA is an equal opportunity provider and employer. The authors have not detected any conflict of interest to declare.
Specimen selection visits by CBG staff to the Smithsonian Institution National Museum of Natural History, Department of Entomology (NMNH) and corresponding BOLD project on the Barcode of Life Data Systems (BOLD) (Ratnasingham & Hebert 2007).