Documenting decapod biodiversity in the Caribbean from DNA barcodes generated during field training in taxonomy

Abstract DNA barcoding is a useful tool to identify the components of mixed or bulk samples, as well as to determine individuals that lack morphologically diagnostic features. However, the reference database of DNA barcode sequences is particularly sparsely populated for marine invertebrates and for tropical taxa. We used samples collected as part of two field courses, focused on graduate training in taxonomy and systematics, to generate DNA sequences of the barcode fragments of cytochrome c oxidase subunit I (COI) and mitochondrial ribosomal 16S genes for 447 individuals, representing at least 129 morphospecies of decapod crustaceans. COI sequences for 36% (51/140) of the species and 16S sequences for 26% (37/140) of the species were new to GenBank. Automatic Barcode Gap Discovery identified 140 operational taxonomic units (OTUs) which largely coincided with the morphospecies delimitations. Barcode identifications (i.e. matches to identified sequences) were especially useful for OTUs within Synalpheus, a group that is notoriously difficult to identify and rife with cryptic species, a number of which we could not identify to species, based on morphology. Non-concordance between morphospecies and barcode OTUs also occurred in a few cases of suspected cryptic species. As mitochondrial pseudogenes are particularly common in decapods, we investigate the potential for this dataset to include pseudogenes and discuss the utility of these sequences as species identifiers (i.e. barcodes). These results demonstrate that material collected and identified during training activities can provide useful incidental barcode reference samples for under-studied taxa.


Introduction
A shortage of taxonomic expertise is one of the current challenges facing those engaged in identifying, classifying, utilising and conserving the world's biodiversity (Giangrande 2003, Vernooy et al. 2010, Cardoso et al. 2011, Ebach et al. 2011, Wägele et al. 2011, Sluys 2013. The recent application of DNA techniques, the compilation of large taxonomic databases (Costello et al. 2013) and the use of bioinformatics approaches like GIS, have rejuvenated interest in taxonomic data. Unfortunately, this increase in relevance and interest has been counteracted by the gradual loss of integrative taxonomic expertise (Drew 2011, Coleman 2015. This recent decline has limited attempts to document the world's biodiversity and limits the rate at which high-profile initiatives, such as the Census for Marine Life, Barcode of Life (BOLD;  and WoRMS, can generate, agglomerate and synthesise biodiversity knowledge. There is a particular need for new experts specialising in marine biodiversity, where it is estimated that 30-60% (Appeltans et al. 2012) or even 90% (Mora et al. 2011) of eukaryotic species remain to be described or discovered. There is also a particular shortage of taxonomists and data from developing countries and countries with economies in transition.
DNA barcoding is a useful tool for the identification of samples that cannot be identified, based on traditional morphological methods (Bucklin et al. 2011, Bucklin et al. 2016, Geiger et al. 2016). Short, easily amplifiable fragments that vary amongst closely related species are sequenced from specimens identified by experts and then used as a reference set to compare with sequences from unidentified or unidentifiable samples (Hebert and Gregory 2005, Bucklin et al. 2011, Bucklin et al. 2016, Geiger et al. 2016. For animals, the most widely used barcode is cytochrome c oxidase subunit I (COI; Hebert et al. 2003), followed by the 16S large subunit ribosomal RNA (16S), another mitochondrial marker (see Mantelatto et al. 2017, Collin et al. 2018, Morín et al. 2019). This approach is useful in a variety of contexts, including identifying components of gut contents and bulk environmental samples. However, the global DNA barcode database (BOLD; Ratnasingham and Hebert 2007) is still sparsely populated, specifically for invertebrates and especially for tropical taxa. In many cases, even common, relatively easily identified and well-known taxa do not yet have sequences of the DNA barcode fragment of COI publicly available in GenBank. For example, Raupach and Radulovici (2015) showed a particular lack of DNA barcode studies for crustaceans from the Caribbean, amongst other regions. Therefore, activities that can aid in generating DNA reference barcodes for commonly encountered species, even without a comprehensive effort to exhaustively document species in a particular group or fauna, can make a significant contribution to the utility of the barcode database. In addition, improved taxonomic coverage may assist in narrowing down the possible identities of unknown samples that do not match any reference sample.
Here, we used material collected as part of graduate taxonomy training workshops in Bocas del Toro, Panama to generate a reference set of DNA barcodes of common shallowwater decapods of the Caribbean coast of Panama. Decapods are one invertebrate group that, despite its importance and high diversity, still has low DNA barcode coverage in the tropics (Raupach and Radulovici 2015). Our hope was that by combining the two, not only could trainees become familiar with processing material for subsequent DNA extraction, but that the contribution of biodiversity data to global databases would help garner continued support and help make training activities more sustainable (Cancian de Araujo et al. 2018). Due to the wide geographic ranges of many taxa throughout the Caribbean and the potential for gene flow between Bocas del Toro and other parts of the region via dispersal of planktonic larvae on ocean currents (Cowen et al. 2007, Cowen and Sponaugle 2009, Schill et al. 2015, our barcode library may be useful in other zones of the Caribbean Sea.

Collection
Specimens for DNA barcoding were collected during two workshops of the Training in Tropical Taxonomy programme run by the Smithsonian Tropical Research Institute in Bocas del Toro, Panama. The "Shrimp Taxonomy (Caridea, Dendrobranchiata and Stenopodidea)" course in 2008 included 13 students from eight countries (Mexico, UK, US, Colombia, Slovenia, Brazil, Australia and Costa Rica) and the "Taxonomy and Biology of Decapod Crustaceans" course included 13 students from five countries (US, Colombia, Brazil, Argentina and Costa Rica) in 2011. These 2-week workshops, each led by one of us (SDG and DF, respectively) and co-instructed by A. Anker and F. Mantelatto, respectively, were aimed at graduate student training but also included undergraduate students and post-doctoral professionals seeking training in systematics and identification of the focal groups. During two weeks, students collected and identified specimens as part of their training and, when the animals were intact and well-enough preserved to make useful vouchers, tissue samples were taken by us for DNA barcoding (see section on DNA sequencing). Therefore, unlike other studies that included training sessions in DNA barcoding (e.g. Harris and Bellino 2013), our study simply made use of specimens collected and identified under the supervision of taxonomic experts during taxonomy training workshops. All specimens were collected from Bahia Almirante, especially from sites in and around Isla Colon and Isla Bastimentos. No documentation of sampling effort was made, as collections were opportunistic and arranged around the other activities of the courses. The marine invertebrate diversity of this area has been documented in some detail (e.g. Collin et al. 2005, Bonnet and Rocha 2011, Goodheart et al. 2016. A detailed checklist is available for shrimps (De Grave and Anker 2017), but no checklists or comprehensive surveys are available for brachyurans or anomurans.
Vouchers from the decapod course are deposited in the decapod collection of the University of Louisiana at Lafayette (ULL) which is currently being transferred to the Smithsonian Natural History Museum (USNM). Reference numbers from the UL collection are provided in the dataset associated to this research (dx.doi.org/10.5883/DS-CRUSTACE and Table 1). Many vouchers from the shrimp course are currently stored in the Zoological Collection of the Oxford University Museum of Natural History (OUMNH.ZC see Table 1). However, a number of shrimp samples were transferred to the Museu Nacional de Brazil and were subsequently lost in the fire that destroyed the museum in 2018. This lost material is listed without voucher numbers (Table 1)

DNA Sequencing
Specimens from the shrimp course were extracted in Panama using a Biosprint 96 and a DNA Blood Kit (Qiagen) and the DNA extracts were shipped to the Smithsonian's Laboratories of Analytical Biology (LAB) for PCR and sequencing. For the decapod course, small pieces of tissue were preserved in 150 μl of M2 extraction buffer (AutoGen), stored frozen and shipped to LAB for extraction and sequencing. Samples were extracted using an AutoGenprep 965 extraction robot after overnight digestion in AutoGen buffer with proteinase-K. We sequenced two gene fragments. The DNA barcode fragment of the cytochrome c oxidase subunit I (COI) was amplified using primarily the primer pair jgLCO1490/jgHCO2198 (Geller et al. 2013), although the pair dgLCO1490/dgHCO2198 (Meyer et al. 2005) was also used. The 10 μl PCR mix included 1 μl Biolase Taq (Promega), 0.1 μl BSA and 0.3 μl of each 10 mM primer. For amplification and sequencing of 16S, the primer pair 16S AR/16S BR (Palumbi et al. 1991) was used. The mix for 16S was the same as for COI with the addition of 0.5 µl 50 mM MgCl . The annealing temperature for nearly all reactions for all three gene regions was 50°C, although occasionally it was raised to 52°C when it appeared that co-amplification was occurring. Sequencing followed the methods described in

Analysis
Sequences were screened for quality and contigs of forward and reverse sequences were produced using Sequencher 5.4.6 (Gene Codes). Only sequences with a length of more than 90% of the expected length and with a Phred quality score of at least 30 for more than 85% of the bases were combined into contigs and used for analyses. To check for potential contamination, sequences were compared within the BOLD workbench (www.boldsystems. org; Ratnasingham and Hebert 2007) to all taxa sequenced in our project; likewise, sequences were compared to publicly available sequences using BLASTn searches in GenBank. The few sequences with > 95% identity to non-decapods were eliminated from subsequent analyses. COI sequences were also checked with the methods of Song et al. (2008) to determine whether they displayed detectable pseudogene traits (Buhay 2009).
As DNA barcoding is usually a distance-based approach, we constructed a neighbourjoining tree (BIONJ, Gascuel 1997) with Jukes-Cantor distances to preliminarily recognise distinct OTUs. Neighbour-joining trees with Kimura's-two-parameters distances were also constructed and produced the same results as the Jukes-Cantor distances. The tree nodes were further verified with non-parametric bootstrapping, using the Felsenstein's method (Felsenstein 1985, Efron et al. 1996 (Puillandre et al. 2011) using the following parameters: P = 0.001; P = 0.1 for COI and 0.05 for 16S; X = 1.125 for COI and 1.5 for 16S; Steps = 10. P and P were chosen with the help of a histogram of distances and X was smaller in COI because the default 1.5 value did not provide enough sensitivity to partition the data (see Puillandre et al. 2011).
Whenever an OTU differed between COI and 16S, the OTU was accepted only if it diverged from every other sequence by at least 0.05 substitutions per site in COI or 0.03 in 16S. If the discrepancy remained unresolved, then we accepted the option producing fewer OTUs. The final consensus OTUs were compared to the system of Barcode Index Numbers (BINs) assigned in BOLD (Ratnasingham and Hebert 2013) and to our morphological identifications, in order to detect potentially cryptic species or previously unrecognised diversity.

Data resources
The DNA sequences associated with this paper are deposited in the Barcode of Life Database (dataset dx.doi.org/10.5883/DS-CRUSTACE) (Ratnasingham and  and GenBank (www.ncbi.nlm.nih.gov/genbank) (accession numbers MN183805-MN184218 for COI and MK971234-MK971659 for 16S).

Results and Discussion
A total of 447 individuals, morphologically identified to 129 species, were successfully sequenced for at least one marker, including 47 species of shrimps, 57 brachyuran crabs, one achelate lobster, four axiid mudshrimp, one gebiid mudshrimp and 19 anomuran crabs (Table 1, Figs 1, 2). Shrimps included the infra-orders Caridea, Dendrobranchiata and min max min max Stenopodidea. Amongst successfully sequenced individuals, 99 were identified to genus, but could not be confidently assigned to a species based on morphology. The Automatic Barcode Gap Discovery method delimited 141 OTUs with COI and 140 OTUs with 16S; likewise, our COI sequences were assigned to 146 Barcode Index Numbers (BINs) in BOLD. The larger number of OTUs and BINs suggest there are ~10 potentially cryptic species or species with unusually high levels of genetic diversity in this dataset.
Eighty seven of our consensus OTUs matched COI sequences already in GenBank with an identity of > 95% (see Table 1). Of these, our identification and the name on the GenBank sequence were concordant for 77 of the OTUs, including seven cases in which our identification provided better taxonomic resolution than the GenBank sequence. In many cases, these represent samples of the same taxa from other Caribbean regions confirming the conspecific status of animals from different parts of the same biogeographic region. In ten cases where our identification was not concordant with the name of a COI GenBank sequence >95% identical, the discrepancy typically occurred at the species level while the higher taxonomic ranks remained concordant. Two OTUs did not have sequences in COI: one was a singleton identified as Leander paulensis and the other included two specimens identified as Pilumnus reticulatus and P. pannosus. The remaining 51 OTUs for which we have COI sequences were < 95% identical to another sequence in GenBank and therefore considered to be new additions.
The results for the 16S analysis were relatively similar: 99 consensus OTUs were > 97% identical to 16S sequences available in GenBank and for 89 of them, the morphological identification coincided with the name of the GenBank sequence, whereas the other ten OTUs showed discrepancies at the species level with the GenBank sequence, but remained concordant at higher taxonomic ranks. Four singleton OTUs, identified as Stenopus scutellatus, Inachoides sp., Austinixa aidae and Speleophorus nodosus did not have sequences in 16S. The remaining 16S sequences (belonging to 37 OTUs) were > 97% similar to other sequences in GenBank; thus these were considered new additions. Pie chart indicating the number of individuals present on this study for each decapod family.
Our dataset contributed 56 new BINs to BOLD and provided 38 new species for at least one marker in GenBank (Table 1). One hundred and thirty seven of our 140 OTUs are associated with only one morphospecies name. This coincides with our visual observations of the COI and 16S neighbour-joining trees, which showed that our morphospecies identifications are largely concordant with clusters of very similar sequences. These clusters differed from other such clusters by ~0.10 substitutions per site in COI and ~0.05 in 16S (Fig. 3). Such concordance between morphospecies and OTUs, and the magnitude of the observed interspecific divergence are similar to those reported by Costa et al. (2007) and Matzen da Silva et al. (2011). Nevertheless, there are several cases where animals could not be identified to species or in which a single species/species-complex name appeared in different OTUs. These included the following: a b Figure 3.
Neighbour-Joining trees for cytochrome c oxidase subunit I (COI) and 16S ribosomal RNA (16S) from specimens identified in this study and GenBank as Tozeuma carolinense. The accession number is provided for the GenBank sequences. Likewise, reference numbers are provided for specimens deposited in the University of Louisiana at Lafayette (ULLZ#) and the Oxford University Museum of Natural History (OUMNH). The Jukes-Cantor distance between specimens is proportional to the length of the branches separating them, as indicated in the scale bars at the bottom-left.
a: Neighbour-Joining tree for cytochrome c oxidase subunit I b: Neighbour-Joining tree for 16S ribosomal RNA For Synalpheus spp., 42 individuals fell into eight OTUs. Seven OTUs matched GenBank sequences from in-depth studies of these taxa (Duffy et al. 2000, Morrison et al. 2004, Hultgren et al. 2014, Chak et al. 2017). These were S. hoetjesi, S. paraneptunus, S. yano, S. aff. longicarpus, S. elizabethae and S. cf. rathbunae. The final OTU did not match anything in GenBank (identity < 90% in both markers). Our failure to detect more than one additional species in this group suggests that it has been well-sampled in Bocas del Toro.
Specimens of Alpheus spp. were split into 18 OTUs. Thirteen OTUs were identified to species, including two OTUs assigned to the same species name (Alpheus paracrinitus); one of these OTUs had four individuals, whereas the other was a singleton. A. paracrinitus has long been considered an unresolved species complex, including at least four species (Knowlton et al. 1993, Anker 2001 Leray and Knowlton 2015). Most of the OTUs identified as members of the A. packardii complex are < 95% and < 97% identical to COI and 16S sequence in GenBank, respectively. One other OTU, identified as Alpheus sp., matched GenBank sequences that were also identified only to genus. Clearly, we are far from having a complete barcode database for this speciose taxon.
The shrimps Tozeuma carolinense (Fig. 2C) fell into two OTUs, one with eight and the other with one specimen. Both of these OTUs were new for BOLD, adding two new BINs to the database. The 16S sequences for one of these OTUs matched a T. carolinense sequence in GenBank with > 99% identity. However, both of our T. carolinense OTUs were distinct from the available COI sequences in GenBank with > 95% identity (Fig. 3, Table 1), suggesting that this morphologically distinctive species may include several cryptic species. A similar situation occurred for Sicyonia laevigata ( Fig. 2A), Pagurus criniticornis and Alpheus paracrinitus; all these names being assigned to specimens that grouped in two OTUs: one with multiple species and the other being a singleton.
OTUs with multiple species names: One OTU included sequences from two specimens, morphologically identified as different species (Pilumnus dasypodus and P. caribaeus). This OTU matched P. dasypodus sequences in GenBank with > 99% identity for both markers. Another OTU identified as P. caribaeus in our dataset matched GenBank sequences of that species. One other OTU comprised specimens morphologically identified as 2 species (P. pannosus and P. reticulatus).

Pseudogenes in Decapod Barcoding
We found no indels in our COI sequences and no stop codons in the corresponding amino acid sequence. Both the COI and 16S sequences showed a range of GC content (GC%) from 23.97-46.48% (Fig. 4), which is similar to the findings of other studies (e.g. Costa et al. 2007, Matzen da Silva et al. 2011. As mitochondrial genes are expected to show a different AT bias than pseudogenes (Bensasson 2001, Song et al. 2008, Matzen da Silva et al. 2011, Liu et al. 2016, those sequences with significantly deviant GC% are potentially more likely to be pseudogenes; nevertheless, a careful examination of all our sequences failed to detect any strong evidence of pseudogenes. The overall concordance between our morphological identifications and the molecular identification of OTUs, based on 16S and COI, further supports the conclusion that pseudogenes were rare or absent in this dataset. Much has been made of the problem with pseudogenes in decapods (Schneider-Broussard and Neigel 1997, Williams and Knowlton 2001, Song et al. 2008, Schubart 2010, Matzen da Silva et al. 2011, Raupach and Radulovici 2015, but see Schizas 2012. They are undoubtedly more common in decapods than in some other groups of marine invertebrates, where they have seldom been reported. It is also clear that pseudogenes can cause significant problems in phylogenetic reconstructions (Schubart 2010). However, we argue that the problems they pose for DNA barcoding are limited and that the difficulty in determining if a sequence is a pseudogene or not without resorting to cloning, means that barcode datasets for decapods may never be entirely free of pseudogene sequences unless mitochondrial DNA is directly targeted during DNA extraction with a mitochondrial DNA isolation kit. Histogram of the GC content for COI and 16S in three major groups of decapods evaluated in this study. Shrimps included the infra-orders Caridea, Dendrobranchiata and Stenopodidea. ). If AT/GC bias is similar across the mitochondrion, one might expect that the GC% of COI and 16S are correlated and deviations from the trend-line could be another way to identify potential pseudogenes. Our data show an overall positive correlation between the two, considerable scatter around the trend-line and a small cluster of carideans that fall somewhat below the other carideans, but still within the overall variation for the decapods (Fig. 5). Scatterplot of the GC content (GC%) in COI versus 16S for every individual of this study, successfully sequenced for both markers. In general, the GC% of the two markers appears to be as correlated as expected for two mitochondrial genes and none of the specimens seems to deviate enough to be considered a suspect pseudogene. Major groups are indicated with colours: Brachyura: red, Anomura: blue, Shrimps (Caridea, Dendrobranchiata, and Stenopodidea): green, Axiidea: light-grey, Gebiidae: black, Achelata: yellow.
Our limited ability to identify pseudogene sequences without cloning indicates that pseudogenes are likely to infiltrate metabarcoding datasets generated by high-throughput sequencing, as well as datasets generated by sanger sequencing. One concern about the inclusion of pseudogenes in these kinds of biodiversity studies is that they may overestimate the number of OTUs reported (Schneider-Broussard and Neigel 1997, Williams and Knowlton 2001, Song et al. 2008, Schubart 2010, Matzen da Silva et al. 2011, Raupach and Radulovici 2015. This could certainly happen, but a more common occurrence is that either co-amplification of the gene and its pseudogenes reduces the quality of the reads, resulting in an unusable sequence or a single sequence significantly out-amplifies the other, resulting in a single sequence from each species. If this is the pseudogene, the results could complicate phylogenetic analyses, but are unlikely to impact the results of DNA barcoding studies. One situation where pseudogenes certainly will not impact the efficacy of a DNA barcoding approach is in the identification of unknown samples through comparisons with sequences from carefully identified material. If a sequence is known to come from a specific species, whether or not it is a pseudogene, that sequence can be used to generate a positive identification of unknown material. Therefore, rather than discard potential pseudogene sequences, they should be included in barcode databases as a potentially informative resource (see also the arguments by Schizas 2012).

Taxonomy Training and DNA Barcoding
The present study, along with Cancian de Araujo et al. (2018), demonstrates that DNA barcoding of common species encountered during field training in tropical biodiversity can contribute useful data to the effort to barcode metazoans. Such data, despite collected from a single site, may be relevant throughout the Caribbean as connectivity is considered high within this sea for some decapods [e.g. Panulirus argus (Naro-Maciel et al. 2011, Kough et al. 2013]. With only a moderate collecting effort (i.e. incidental collections over 4 weeks total), we obtained a barcode dataset of a similar number of decapod OTUs as the exhaustive decapod DNA barcode dataset for the North Sea (Raupach et al. 2015). Of these, 32% of the COI sequences and 24% of the 16S sequences were new to GenBank and 39% of the BINs were new to BOLD. It has previously been noted that crustacean sequences are poorly represented in BOLD (Raupach and Radulovici 2015). The number of new OTUs is perhaps significantly lower than what would be expected in locations that have previously received less intensive systematic study than Bocas del Toro, which has been the focus of alpheid shrimp systematics for over 15 years (Williams and Knowlton 2001, Anker et al. 2007, Anker et al. 2008, Mathews and Anker 2009, Anker 2010, Anker et al. 2012). Nevertheless, a significant portion of the sequenced species generated new species records in GenBank and required little additional effort in the field over and above the collection and identification exercises already underway. Of course, as with any vouchered material, additional curatorial effort was necessary compared to typical field courses where the material is not usually vouchered.