Development of mitochondrial DNA cytochrome c oxidase subunit I primer sets to construct DNA barcoding library using next-generation sequencing

Abstract Insects are one of the most diverse eukaryotic groups on the planet, with one million or more species present, including those yet undescribed. The DNA barcoding system has been developed, which has aided in the identification of cryptic species and undescribed species. The mitochondrial cytochrome c oxidase I region (mtDNA COI) has been utilised for the barcoding analysis of insect taxa. Thereafter, next-generation sequencing (NGS) technology has been developed, allowing for rapid acquisition of massive amounts of sequence data for genetic analyses. Although NGS-based PCR primers designed to amplify the mtDNA COI region have been developed, their target regions were only a part of COI region and/or there were taxonomic bias for PCR amplification. As the mtDNA COI region is a traditional DNA marker for the DNA barcoding system, modified primers for this region would greatly contribute to taxonomic studies. In this study, we redesigned previously developed PCR primer sets that targetted the mtDNA COI barcoding region to improve amplification efficiency and to enable us to conduct sequencing analysis on NGS. As a result, the redesigned primer sets achieved a high success rate (> 85%) for species examined in this study, covering four insect orders (Coleoptera, Lepidoptera, Orthoptera and Odonata). Thus, by combining the primers with developed primer sets for 12S or 16S rRNA regions, we can conduct more detailed taxonomic, phylogeographic and conservation genetic studies using NGS.


Introduction
Biodiversity can be categorised into three levels: species diversity (richness), genetic diversity and ecosystem diversity.Species richness provides a straightforward method for describing community and regional diversity (Magurran 1988, Gotelli andColwell 2001).An estimated 10 million species live on Earth (May 1988, May 1990, Stork 1993, Gaston 1991, Costello et al. 2011, Costello et al. 2013), although the exact number is unknown.In recent years, climate change due to anthropogenic effects has increased the risk of extinction for many living species.In addition, many species were likely exterminated before being described, even if the number of living organisms is underestimated.Insects are amongst the most diverse eukaryotic groups on the planet and at least one million species have been described (Stork 2018, Takenaka et al. 2023).Nevertheless, the 1-2% of insect species may be cryptic (Stork 2018) and many new and cryptic species have been described recently (e.g.Takenaka et al. (2023)).The DNA barcoding system was developed to identify species through DNA sequencing (Hebert et al. 2003, Hebert andGregory 2005).This innovative system facilitates rapid, accurate, automatable species identification using short standardised gene regions as internal species tags (Hebert and Gregory 2005).DNA barcoding and DNA sequencing approaches have contributed to the detection of cryptic and undescribed species (e.g.Burns et al. (2008), Rebijith et al. (2013)).Hebert et al. (2003) identified the mitochondrial cytochrome c oxidase I (mtDNA COI) region as the core target region for DNA barcoding because it robustly detects moderate genetic differences amongst species as a marker in taxonomic and phylogenetic studies.The 12S and 16S ribosomal RNA loci are also used to identify specimens (e.g.Marquina et al. (2018), Takenaka et al. (2023)).Using genetic differences in the COI region, phylogeographic and conservation genetic studies of many insect species have been conducted (e.g.Buckley et al. (2009), Nakahama et al. (2018), Çıplak et al. (2022)), allowing access to substantial reference sequences in the International Nucleotide Sequence Database Collaboration (INSDC), which involves the DNA Data Bank of Japan (DDBJ), European Molecular Biology Laboratory (EMBL) and the National Center for Biotechnology Information (NCBI).The COI region is more suitable for DNA barcoding than other mtDNA loci, specifically the 658-bp sequence amplified by the primer pair LCO1490 and HCO2198 established by Folmer et al. (1994), which has been widely used for various insect taxa.With the development of next-generation sequencing (NGS), it has become possible to obtain large numbers of short sequence reads (ca.300 bp) rapidly.Therefore, the modification of primer sets to amplify the COI region efficiently across a diverse range of insect taxa and to conduct NGS sequencing would enhance reference sequences in the region for DNA barcoding.
Although primer sets have been developed to amplify the COI region for NGS-based analysis (i.e.DNA metabarcoding; for example, Leray et al. (2013);Meier et al. (2015); Elbrecht and Leese (2017)), contributing to the identification of insect species, these primer sets are not suitable for enhancing reference sequences in the COI region because they amplify only part of the COI region.Furthermore, insect species have not been identified for which these primer sets are suitable.Although Zhou et al. (2013), Liu et al. (2017) and Yang et al. ( 2020) developed new NGS-based pipelines of the mtDNA COI region, the specimens sampled in each analysis showed taxonomic bias, i.e. specimens were selected only from Lepidoptera, Diptera and Neuroptera, respectively.Therefore, there is a need to develop primer sets for NGS-based analysis that can amplify the full length of the COI barcoding region and are applicable to many insect taxa.Novel primer sets for NGS-based analysis that amplify the 12S and 16S ribosomal RNA loci have been developed and used to detect cryptic species in some insect taxa, including 11 orders, 42 families and 70 species (Takenaka et al. 2023).More recently, multiplexed phylogenetic marker sequencing (MPM-seq) was developed (Suyama et al. 2021), to enable the simultaneous detection multi-locus sequences.Thus, the development of primer sets targetting insect mtDNA could contribute to research on taxonomy, phylogeography and conservation genetics.In this study, we redesigned previously developed primer sets that targeted the entire COI barcoding region, which will enhance COI reference sequences for insect taxa.

Development of primer sets
To identify polymorphic sites in the primer annealing regions of different insect taxa, we downloaded COI sequences from the NCBI database (Suppl.material 1; 33 species, 29 families and 14 orders).These 33 sequences were aligned using MAFFT v.7.310-1 (Katoh 2002, Katoh andStandley 2013).Polymorphic sites within the COI region across the 33 species were visualised using MEGA-X (Kumar et al. 2018), which provided information for the modification of the primer pair to include mixed bases (e.g.A/G: R, A/T/C:H).Given that the total sequence amplified by this primer pair exceeds 500 bp, the resultant sequence reads are unsuitable for NGS-based sequence analysis.To address this problem, we integrated an intermediate primer pair, mlCOIintF and mlCOIintR (Leray et al. 2013), to distinguish the first half of the COI region in this study (1-319 bp, Fig. 1, Table 1).An approximately 240-bp portion of the COI region, which is moderately conserved amongst species (Leray et al. 2013), was selected as the annealing site for the forward primer of the second half of the COI region (262-658 bp, Fig. 1, Table 1).The sites were also modified as described above, giving rise to two primer pairs; thus, modified LCO1490 and mlCOIintR amplify the first half of the COI region and the new forward primer COmfd_F and modified HCO2198 amplify the second half of the COI region (Table 1), with expected amplification products of approximately 350 bp.

Sample collection
Between April to October 2022, we collected 96 specimens comprising 96 species, 48 families and 11 orders (Table 2).Each specimen was preserved by freezing.Information of newly-modified primer sets in this study.Imaged position of the primers in this study.
Specimen samples and the result of sequencing analysis.

PCR amplification and sequencing analysis
Genomic DNA was extracted using DNeasy Blood & Tissue Kits (QIAGEN, Hilden, Germany) and the total DNA concentration was quantified with a NanoDrop ND-1000 (Thermo Scientific, Waltham, USA).Polymerase chain reaction (PCR) was performed for each specimen according to the manufacturer's protocol, in a final volume of 10 µl that included 5-10 ng of DNA, 1.0 µl Ex Taq Buffer, 0.2 µmol/l of primers, 0.8 µl of dNTP mixture (2.5 mM of each dNTP), 2 U of Takara Ex Taq polymerase (Takara Bio, Otsu, Japan) and sterile distilled water up to 10 µl.The PCR thermal cycling conditions were an initial 1 min denaturation at 94°C; 35 cycles of 94°C for 30 s, 52°C for 30 s, 72°C for 1 min, with a final 20-min extension at 72°C.The PCR product was verified using a MultiNA microchip electrophoresis system (SHIMADZU, Kyoto, Japan).Before sequencing, the PCR products (i.e.LCO1490-COmfd_R and COmfd_F-HCO2198) were pooled for each specimen sample and all 96 samples were prepared for sequencing.Subsequent pairedend sequencing was conducted using 2 × 250 bp cycle run on an Illumina MiSeq Sequencer (Illumina, San Diego, USA) and with the MiSeq Reagent Nano Kit v.2 (500 cycles).
As paired-end sequencing was used, it was possible to identify overlaps between forward and reverse reads.The clconcatpair command with the --mode=OVL argument was used to generate concatenated reads from the forward and reverse sequences.Any low-quality reads were filtered out using the clfilterseq command with settings --maxplowequal=0.1 --minqual=27, to remove positions with quality lower than Q27.
Overlap was detected and the sequences from the two loci were merged using the EMBOSS programme (Rice et al. 2000).The merged sequences were aligned using MAFFT v.7.310-1, and phylogenetic analysis was conducted using the Maximum-Likelihood method in IQ-TREE (Nguyen et al. 2014).The best substitution model was selected using the ModelFinder Plus option (-m MFP) and the GTR+F+I+G4 model was identified as giving the best fit according to the Bayesian Information Criterion (BIC).Additionally, the ultrafast bootstrap approximation and Shimodaira-Hasegawa approximate likelihood ratio test (SH-aLRT) were set to 1000 replicates to assess branch reliability (-bb 1000 and -alrt 1000).Three Collembola sequences (accession nos.: JN970939.1,MF916630.1 and KY829298.1)were included as outgroup.Phylogenetic analysis using the neighbour-joining method (Saitou and Nei 1987) was also conducted with MEGA-X under the substitution model of the Jukes-Cantor model.The consensus trees were visualised and edited using FigTree v.1.4.3 (Rambaut 2016).A BLAST search was also conducted using reference sequences in GenBank.

Results and Discussion
We redesigned two primer sets (LCO1490-COmfd_R and COmfd_F-HCO2198) that amplified the DNA barcoding region of the mtDNA COI region of insect taxa (see Table 1).
The original barcoding primers for the mtDNA COI region are unsuitable for NGS-based analysis due to their excessive sequence length.Therefore, we modified the original primers (Folmer et al. 1994) and designed internal primers such as mlCOIintF/R (Leray et al. 2013) to amplify lengths appropriate for NGS-based analysis.The sequence reads were divided into two parts: a 319-bp fragment from the PCR amplification products of LCO1490-COmfd_R and a 397-bp fragment from COmfd_F-HCO2198.Our PCR amplification trial and sequence analysis indicated that our modified primer sets successfully amplified the mtDNA COI region, demonstrating the effectiveness of these 3 4 Table 3.
Success rate of sequence analysis.
primer sets for taxonomic, phylogeographic and conservation genetic studies of insect taxa.

Universality and efficiency of the modified primer sets
Using the modified COI primer sets for DNA barcoding, we conducted PCR to amplify samples from 96 species, encompassing 48 families and 11 orders and performed NGS sequencing analysis.These primers successfully amplified and sequenced the target mtDNA COI regions of 80 species from 41 families in 11 orders.Notably, the primer sets had high success rates for Coleoptera, Lepidoptera, Orthoptera and Odonata (Table 3).Despite moderate sequencing success rates and limited specimen samples, the primer sets also showed promise as effective barcoding primers for Hymenoptera, Hemiptera and Diptera.
There was a pronounced bias in the number of reads between LCO1490-COmfd_R and COmfd_F-HCO2198 (Table 2).However, the high degree of variability in the mtDNA COI region (Leray et al. 2013) suggests that this bias may be due to variability at the primer annealing sites.
To assess whether the modified barcoding primer sets could differentiate various insect taxa, we conducted phylogenetic analysis.The resulting phylogenetic tree showed that related insect taxa clustered within the same lineages (Fig. 2, Suppl.material 3).However, three orders were paraphyletic: Coleoptera, Hemiptera and Orthoptera (Fig. 2).The phylogenetic relationships amongst orders cannot be fully elucidated using a single locus, particularly when only short fragment sequences are available.Takenaka et al. ( 2023) also reported paraphyly of Coleoptera and Hemiptera.Therefore, we conclude that these phylogenetic results are not major issues within the scope of this study.Nevertheless, our results suggest that we obtained accurate sequences, as related species were identified as candidate sequences in BLAST searches (Suppl.material 4) and the primer sets appear to be suitable for insect barcoding analyses.We also directly compared additional Chironomid NGS assemblies whose DNA template libraries were similar to the sequence data from the Chironomid DNA Barcode Database (https://www.nies.go.jp/yusurika/en/ contents/search.php).Although we performed NGS sequencing analysis of 16 Chironomid samples, we obtained complete mtDNA COI assemblies from 13 Chironomid specimen samples (Suppl.material 5).Comparing these 13 Chironomid NGS assemblies against the database, we detected no assembly errors (Suppl.material 5).

Future utilisation of the modified COI primer sets
The goal of our research was to modify the existing mtDNA COI primer set established by Folmer et al. (1994) for use in NGS sequencing analysis to enhance DNA barcoding reference databases.The original primer set of Folmer et al. (1994) is foundational to the DNA barcoding system and has been widely used in taxonomic, phylogeographic and conservation genetic studies.Recently, Leese et al. ( 2020) developed new primer sets for the mtDNA COI region tailored for NGS-based analysis.However, the sequences obtained with these new primers were shorter than those produced by the original primer set, leading to concerns that they might not be as effective for enhancing DNA barcoding references.In this study, we modified the original primer set of Folmer et al. (1994).The modified primer sets are anticipated to greatly enhance DNA barcoding references, although these primers are not compatible with some insect taxa, as indicated in Table 2. Shokralla et al. (2015) also developed NGS-based universal primer sets for the mtDNA COI region; however, their success rate was ca.70% for all insect taxa.Although the current study examined somewhat limited specimen samples, the success rate was 80% (Table 3).Notably, the success rate for Coleoptera was significantly higher than that reported by Shokralla et al. (2015), whereas our success rate was similar to that of Liu et al. (2013).However, because the primer sequences differ, the modified primer sets should be useful as supplemental primer sets for those of Liu et al. (2013) andShokralla et al. (2015).
Advances in NGS system have led to various NGS applications in ecological studies.Suyama et al. ( 2021) introduced the multiplexed phylogenetic marker sequencing (MPMseq) technique, which enables the simultaneous acquisition of genetic information using multiple primer sets.Takenaka et al. ( 2023) developed innovative primer sets targetting the mtDNA 16S and 12S rRNA regions, contributing to the discovery of cryptic species and previously undescribed species.As these primer sets generate short sequences, they are also suited for NGS-based analysis (Takenaka et al. 2023).Thus, using our modified primer sets for mtDNA COI, 16S rRNA and 12S rRNA in conjunction with MPM-seq allows more comprehensive taxonomic, phylogeographic and conservation genetic studies.
The mtDNA COI region is considered challenging for designing new NGS-based primer sets due to the high polymorphism rate (Deagle et al. 2014, Takenaka et al. 2023).This complexity has led to low sequencing success rates for insect taxa such as Hymenoptera and Hemiptera (Table 3).In this study, we designed primers manually by visualising primer annealing sites for 33 insect taxa (Suppl.material 1), without performing in silico analysis.We anticipate that further modified primer sequences will enhance the success rates of PCR amplification and sequencing analysis.Nevertheless, due to the presence of mixed bases in the primer sequences, which may lead to the amplification of non-target loci, the primer sequences must be redesigned with caution.

Figure 2 .
Figure 2.The Maximum-Likelihood (ML) phylogenetic tree, based of mtDNA COI region which originated from sequence reads of the modified primer sets.The numbers on the major branches represents bootstrap values from the ultrafast bootstrap replications and SH-aLRT methods, respectively.The horizontal scale bar under the tree represents evolutionary distance between specimen taxa.Headers of scientific names represent abbreviations of the order names (e.g.Col. and Hym.).

Table 1 .
Development of mitochondrial DNA cytochrome c oxidase subunit I primer ...