The first complete mitochondrial genome of Matsucoccidae (Hemiptera, Coccoidea) and implications for its phylogenetic position

Abstract The mitochondrial genome (mitogenome) has been extensively used to better understand the phylogenetic relationships within the hemipteran suborder Sternorrhyncha, but sequenced mitogenomes remain unavailable for the entire family Matsucoccidae to date. To address this, here we sequenced the complete mitogenome of Matsucoccusmatsumurae; the first for this family. The mitogenome is 15,360 bp in size and comprises the typical set of 37 mitochondrial genes and a large non-coding region (AT-rich region). Gene order, nucleotide composition and codon usage of protein-coding genes (PCGs) of M.matsumurae differ considerably from those of the other two sequenced Coccidae species. All PCGs were initiated by the ATN start codons and ended with the TAA/G or single T-- stop codons. Nine transfer RNA genes could be folded into typical clover-leaf secondary structures. The length and AT content of the ribosomal RNA genes are highly conserved in the Coccoidea mitogenomes. In contrast, the AT-rich control region is highly variable in size and in the number of tandem repeats. The sliding window analysis showed that the cox1 gene is the most conserved amongst the 13 PCGs, while the ratios of non-synonymous to synonymous substitution rates indicated that the evolution of this mitogenome has been dominated by positive selection. Phylogenetic analyses, based on nucleotide sequence data of 37 mitochondrial genes and amino acid sequence data of 13 PCGs using Bayesian Inference and Maximum Likelihood methods, showed that Matsucoccidae diverged before the Coccidae.


Introduction
The family Matsucoccidae was erected in 1984 by Koteja (Koteja 1984, Koteja 1990) within the superfamily Coccoidea. It belongs to the Archaeococcoidae clade of the infraorder Coccomorpha, suborder Sternorrhyncha, order Hemiptera. So far, a total of 38 species in two genera of Matsucoccidae have been recorded worldwide (García Morales et al. 2018). Except for six fossil species described from Baltic amber, all 32 extant species exclusively occur on Pinus spp. in the Holarctic and Neotropical Regions (García Morales et al. 2018). Amongst them, some species are amongst the most destructive pests of pine trees, causing significant economic losses (McKenzie et al. 1948, Furniss and Carolin 1977, Foldi 2004, Lim et al. 2013, Liu et al. 2014. Based on the fossil evidence and morphological characteristics, Matsucoccidae is considered to be one of the most primitive families amongst the archaeococcoids and may even represent the most ancient of all scale insects (Coccoidea) (Beardsley 1968, Foldi 2004, Koteja and Azar 2008, Hodgson and Hardy 2013). However, this lacks the support of molecular data.
In insects, the mitochondrial genome (mitogenome) is typically a covalently closed circular double-stranded DNA molecule, usually 15 ~ 18 kb in length, that encodes 37 genes, including 13 protein-coding genes (PCG), two ribosomal RNA genes (rRNA) and 22 transfer RNA genes (tRNA) (Boore 1999, Cameron 2014). In addition, the mitogenome usually includes a non-coding region of variable length that plays a regulatory role in the transcription and replication, namely, the mitochondrial control region (AT-rich region) (Clayton 1982, Boore 1999. The mitogenome, in whole or part, has been widely used as a molecular marker to study population genetics, phylogeny and genetic evolution of insects (Hu et al. 2019).
Currently (23 April 2022), there are no available mitogenomes for Matsucoccidae. In this study, we sequenced, annotated and analysed the mitogenome of Matsucoccus matsumurae (the type species of Matsucoccidae) in detail. Furthermore, we sampled all sternorrhynchan superfamilies to investigate the phylogenetic position of Matsucoccidae and also provide insight into the superfamily-level phylogenetic relationships within the suborder Sternorrhyncha.

Sample collection and DNA extraction
Adult specimens of M. matsumurae (Fig. 1) were collected from Pinus massoniana in Guizhou Province, China (Suiyang County, 107°2′52.06″E, 27°52′10.02″N, September 2021). All fresh specimens were preserved in 100% ethyl alcohol immediately after collection in the field and deposited at -20°C in the laboratory of Guizhou Academy of Forestry, Guiyang, Guizhou. Identification of the specimens was based on morphological characteristics (Li 2016). The total DNA was extracted from thoracic muscles using the Biospin Insect Genomic DNA Extraction Kit (BioFlux) following the manufacturer's instructions. Voucher specimens are stored in the entomological collection of the Guizhou Academy of Forestry.

Mitogenome sequencing and assembly
The whole genomic DNA of M. matsumurae was sequenced using next-generation sequencing (Illumina HiSeq X10, Biomarker Technologies Corporation, Beijing, China). About 2.13 Gb clean data were assembled into a complete circular mitogenome using NOVOPlasty v.4.3.1 (Dierckxsens et al. 2017) with the cox1 gene of Saissetia coffeae ) as the seed sequence. The first complete mitochondrial genome of Matsucoccidae (Hemiptera, Coccoidea) ...

Mitogenome annotation and analyses
The annotation of mitogenome was conducted using Geneious Prime v.2022.0.1 (Biomatters, Auckland, New Zealand). The locations and sequences of tRNA genes were determined by the MITOS Web Server (http://mitos2.bioinf.uni-leipzig.de/index.py) (Donath et al. 2019) and comparison with homologous mitogenome sequences. Secondary structures of tRNAs were plotted using Adobe Illustrator CC2017 according to MITOS results and manual predictions (following the expected cloverleaf secondary structure of tRNA genes). Two rRNA genes and 13 PCGs were identified, based on their alignments with the other scale insect (Coccoidea) mitogenome sequences. The mitogenomic circular map was depicted with the help of OrganellarGenomeDRAW (OGDRAW) (https:// chlorobox.mpimp-golm.mpg.de/OGDraw.html) (Greiner et al. 2019).
The organisation tables, nucleotide composition and relative synonymous codon usage (RSCU) of the mitogenomes of Coccoidea species were calculated and produced using PhyloSuite v.1.2.2 . The nucleotide diversity (Pi) analyses of 13 PCGs and two rRNA genes of the Coccoidea species and a sliding window analysis (a sliding window of 200 bp and step size of 20 bp), were conducted by DnaSP v.6.0 (Rozas et al. 2017). The non-synonymous substitution rates (Ka), synonymous substitution rates (Ks) and ratios of non-synonymous to synonymous substitution (Ka/Ks) for each of the concatenated 13 PCGs of the Coccoidea mitogenomes were also calculated by DnaSP v. 6.0. Tandem repeats of the control region were identified with the Tandem Repeats Finder server (https://tandem.bu.edu/trf/trf.html) (Benson 1999).

Phylogenetic analysis
A total of 34 mitogenomes from five superfamilies of Hemiptera were used for the phylogenetic analyses (Table 1). Of these, 32 species belong to all four superfamilies of Sternorrhyncha (the ingroup), while the remaining two species from the superfamily Fulgoroidea were chosen as outgroups. Nucleotide sequences (without stop codons) for the 13 PCGs were aligned using MAFFT v.7 (Katoch and Standley 2013) with the L-INS-i (accurate) strategy and codon alignment mode (Code table: Invertebrate mitochondrial genetic codon), rRNA and tRNA gene sequences were aligned using MAFFT v.7 (Katoch and Standley 2013) with the G-INS-I algorithm (which takes account of the secondary structure of rRNA and tRNA genes) and normal alignment mode and amino acid sequences of 13 PCGs were aligned using the -auto strategy and normal alignment mode. Ambiguously aligned areas were removed using Gblocks v.0.91b (Talavera and Castresana 2007), respectively. Individual gene alignments were concatenated using PhyloSuite v.1.2.2 . For amino acid sequence data, Bayesian Inference (BI) phylogenetic analysis were conducted using PHYLOBAYES MPI v.1.5c (Lartillot et al. 2013) in the CIPRES Science Gateway (Miller et al. 2010) which employs the siteheterogeneous model CAT + GTR. Two independent Markov Chain Monte Carlo (MCMC) chains were run and the analysis was stopped when the two runs had satisfactorily converged (maxdiff. fell below 0.3). A consensus tree was computed from the remaining trees combined from two runs after the initial 25% trees from each MCMC chain run were discarded as burn-in. For nucleotide sequence data, the best partitioning scheme and nucleotide substitution models for Maximum Likelihood (ML) and BI phylogenetic analyses were selected with PartitionFinder v.2 (Lanfear et al. 2017) using the Bayesian Information Criterion (BIC) (Suppl. materials 1, 2). ML analyses were conducted using IQ-TREE v.1.6.3 (Nguyen et al. 2015) under the ultrafast bootstrap (UFB) approximation approach (Minh et al. 2013) with 5,000 replicates. BI analysis was performed using MrBayes v.3.2.7a (Ronquist et al. 2012) in the CIPRES Science Gateway server (Miller et al. 2010) with four chains (one cold chain and three hot chains). Two independent runs of 5,000,000 generations were carried out with sampling every 1,000 generations. The first 25% of trees were discarded as burn-in. After the average standard deviation of split frequencies fell below 0.01, stationarity was assumed. Mitogenomes of the 34 Hemiptera insects used in this study.

Mitogenome organisation and nucleotide composition
The complete mitogenome of M. matsumurae is a closed circular double-stranded DNA molecule (Fig. 2, Suppl. material 3), 15,360 bp in length. This is medium-sized amongst the available mitogenomes of Coccoidea: from 15,143 bp for Didesmococcus koreanus to 15,389 bp for Saissetia coffeae , Xu et al. 2021). The newly-sequenced mitogenome encodes 37 genes (13 PCGs, 22 tRNAs and two rRNAs) and an A+T-rich region (control region), which is a typical architecture for bilaterian animal mitogenomes (Boore 1999). The gene order of M. matsumurae (Matsucoccidae) differs considerably from that of the other two Coccidae species , Xu et al. 2021), but it is comparatively similar to gene orders of most Cimicomorpha, Fulgoroidea, Membracoidea, Psylloidea and Aphidoidea species (only the tRNA cluster of trnW-trnC-trnY was rearranged to trnW-trnY-trnC) (Fig. 3). Amongst the three scale insect species, twentythree genes (nine PCGs and 14 tRNAs) are encoded on the majority strand (H strand) and the remaining 14 genes (four PCGs, eight tRNAs and two rRNAs) are encoded on the minority strand (L strand) (Fig. 3). A total of 85 overlapping nucleotides were found in nine pairs of neighbouring genes, with the longest identified overlap between the trnN and trnS1 (22 bp). Furthermore, there are 193 intergenic nucleotides dispersed across 16 gene boundaries. The longest intergenic region (103 bp) is located between cox3 and trnG.
The overall nucleotide composition of the M. matsumurae mitogenome is 47.2% A, 6.4% C, 0.6% G and 45.8% T. It therefore exhibits a strong AT bias of 91.1%, which is higher than in other Coccoidea insects (82.5% for D. koreanus and 84.7% for S. coffeae) ( Table 2) (Lu et al. 2020, Xu et al. 2021). The PCGs have the lowest AT content (90.3%) and the control region has the highest (92.9%), which differs from all previously-sequenced scale insects (Table 2) , Xu et al. 2021). The new mitogenome exhibits negative GCskews (-0.267) and positive AT-skews (0.056), which is common for scale insects (Table 2) and the Hemiptera in general (Wang et al. 2015.

Nucleotide diversity and selection pressures
Nucleotide diversity (Pi value) of 13 PCGs and two rRNAs amongst the three scale insect species is shown in Fig. 4 (Fig. 4). The lowest nucleotide diversity in cox1 is in agreement with observations in most sequenced insect mitogenomes (Ma et al. 2019, Ge et al. 2021, Zhang et al. 2022, which presumably accounts for its ability to provide reliable species identification. The non-synonymous/synonymous (Ka/Ks) substitution ratio can be used to estimate whether a sequence is undergoing purifying, neutral or positive selection. The rates of nonsynonymous (Ka) and synonymous substitutions (Ks) and their ratio (Ka/Ks) were calculated for the 13 PCGs of each of the three scale insect species using Aphis craccivora as the reference sequence (Fig. 5). A value of Ka/Ks greater than 1 implies that a gene is evolving predominantly under positive selection. This indicates that non-synonymous mutations are favoured by the Darwinian selection, i.e. that they are retained at a rate greater than synonymous mutations. All Ka/Ks values were above 1, which strongly suggests the presence of positive selection in these species.

Protein-coding genes
The 13 PCGs (length: 10,626 bp) account for 69.2% of the complete mitogenome of M. matsumurae. All PCGs were initiated by the typical start codon ATN (ATA/T/G/C) and ended with the TAA/G stop codon or their incomplete form T-. This is almost identical to D. koreanus (Xu et al. 2021) and S. coffeae . Such incomplete stop codons are common in insects and believed to be completed by post-transcriptional polyadenylation (Ojala et al. 1981). The AT-skews of PCGs were similar (-0.115 to -0.072) amongst the three available scale insects (Table 2). In M. matsumurae, the 13 PCGs encode a total of 3,531 amino acids, amongst which the most frequently used are Leu, Met, Ile and Phe Gene orders in Hemiptera mitogenomes. With the exception of Coccoidea (Didesmococcus koreanus, Matsucoccus matsumurae and Saissetia coffeae), each superfamily (plus the infraorder Cimicomorpha) is represented by one species. Genes shown with "-" signs are located on the minor strand (L-strand), while others are located on the major strand (H-strand). Table 2.
(accounting for 56.65% of the total amount), whereas Cys is the least frequently used (0.34%) (Fig. 6). These amino acid usage patterns are very consistent in Coccoidea species , Xu et al. 2021. Relative synonymous codon usage (RSCU) is summarised in Fig. 7. Results indicate that the four most frequently used codons are UUA (Leu), AUU (Ile), UUU (Phe) and AUA (Met). All of them are composed solely of A or U, which is reflected in the high A+T content of PCGs. Excluding TAA and TAG, 61 codons were observed in D. koreanus (no CGC), 46 in M. matsumurae (no CUG, GUC, GUG, CCC, CCG, ACG, GCC, GCG, CAG, UGC, CGC, CGG, AGC, AGG, GGC and GGG), and 56 in S. coffeae (no CUC, ACG, GCG, UGC, CGC and GGC) , Xu et al. 2021.

Transfer RNA genes
All three sequenced scale insect mitogenomes encode 22 tRNA genes (Suppl. material 3) , Xu et al. 2021. The AT content of tRNA genes is slightly higher than that of the PCGs, ranging from 87.9% (D. koreanus) to 92.3% (M. matsumurae) ( Table 2). Furthermore, the AT-skew values of tRNAs were all greater than zero (0.043 to 0.054). The length of the 22 tRNA genes ranged from 46 bp (trnS1 of D. koreanus) to 78 bp (trnS1 of M. matsumurae) (Suppl. material 3) , Xu et al. 2021. Most tRNAs lack either the DHU or TψC arm and some even lost both DHU and TψC arms (Fig. 8) , Xu et al. 2021. Only ten tRNAs of D. koreanus, nine of M. matsumurae and 11 of S. coffeae could be folded into the common clover-leaf secondary structures (Fig. 8) , Xu et al. 2021. This suggests that the reduction of DHU or TψC arms of tRNA genes could be a very common phenomenon in the mitogenomes of scale insects. The anticodons of tRNA genes are identical amongst the three scale insects (Fig. 8) , Xu et al. 2021, except for the trnE of D. koreanus and trnK of S. coffeae, which employ UCG and UUU instead of UUC and CUU respectively.  Percentages of each amino acid within the three scale insects. The stop codon is not included.

Ribosomal RNA genes
The AT nucleotide content of rrnS and rrnL genes of the three scale insects ranges from 86.2% to 92.7% (Table 2). Two rRNAs in these three mitogenomes show a negative ATskew (-0.099 to -0.027) ( Table 2). The rrnL gene, located between trnL1 and trnV, ranged in length from 1,161 bp to 1,177 bp and rrnS gene, located between trnV and trnI (M. matsumurae) or trnM (D. koreanus and S. coffeae), ranged from 713 bp to 804 bp in length (Fig. 3, Suppl. material 3) , Xu et al. 2021. Therefore, the length and AT content of rRNAs are conserved in the Coccoidea.

AT-rich region
The AT-rich region is believed to be involved in regulating the transcription and replication of DNA in insects (Clayton 1982, Cameron 2014. The AT-rich region of M. matsumurae is located between rrnS and trnI, whereas the AT-rich regions of D. koreanus and S. coffeae are located between rrnS and trnM and the size of three scale insects ranges from 1,271 bp to 2,261 bp (Fig. 3, Suppl. material 3) , Xu et al. 2021. Analyses of the AT-rich regions by the Tandem Repeats Finder indicated that M. matsumurae and S. coffeae have different numbers of absolute tandem repeat units. Two types of absolute tandem repeats were present in M. matsumurae (nucleotide positions: 7 to 1,023 and 1,026 to 1,373 of the AT-rich region). The AT-rich region of S. coffeae had only one kind of absolute tandem repeat, located between the nucleotide positions 1,096 to 1,533 (Fig. 9). Like in most insect mitogenomes, tandem repeats are common and the size of tandem repeat regions varies depending on the number of copies of the repeating units (Garey and Wolstenholme 1989). Tandem repeats are thought to play an important role in the control of DNA methylation, gene transcription and replication (Zhang et al. 1995, Huang et al. 2013.

Phylogenetic relationships
Phylogenetic analyses of 32 species of Sternorrhyncha, including two outgroups, based on ML and BI analyses of nucleotide sequence data of 37 mitochondrial genes, yielded largely congruent topologies, with most branches receiving strong support (Figs 10, 11). As proposed by previous studies , Xu et al. 2021, the relationships amongst sternorrhynchan superfamilies are inferred as ((Coccoidea + Aphidoidea) + (Aleyrodoidea + Psylloidea)).