Genomic microsatellite characteristics analysis of Dysommaanguillare (Anguilliformes, Dysommidae), based on high-throughput sequencing technology

Abstract Microsatellite loci were screened from the genomic data of Dysommaanguillare and their composition and distribution were analysed by bioinformatics for the first time. The results showed that 4,060,742 scaffolds with a total length of 1,562 Mb were obtained by high-throughput sequencing and 1,160,104 microsatellite loci were obtained by MISA screening, which were distributed on 770,294 scaffolds. The occurrence frequency and relative abundance were 28.57% and 743/Mb, respectively. Amongst the six complete microsatellite types, dinucleotide repeats accounted for the largest proportion (592,234, 51.05%), the highest occurrence frequency (14.58%) and the largest relative abundance (379.27/Mb). A total of 1488 microsatellite repeats were detected in the genome of D.anguillare, amongst which the hexanucleotide repeat motifs were the most abundant (608), followed by pentanucleotide repeat motifs (574), tetranucleotide repeat motifs (232), trinucleotide repeat motifs (59), dinucleotide repeat motifs (11) and mononucleotide repeat motifs (4). The abundance of microsatellites of the same repeat type decreased with the increase of copy numbers. Amongst the six types of nucleotide repeats, the preponderance of repeated motifs are A (191,390, 43.77%), CA (150,240, 25.37%), AAT (13,168, 14.05%), CACG (2,649, 8.14%), TAATG (119, 19.16%) and CCCTAA (190, 19.16%, 7.65%), respectively. The data of the number, distribution and abundance of different types of microsatellites in the genome of D.anguillare were obtained in this study, which would lay a foundation for the development of high-quality microsatellite markers of D.anguillare in the future.


Introduction
Shortbelly eel (Dysomma anguillare Barnard, 1923) is a small-sized warm water eel that is widely distributed in the Indian Ocean and the western Pacific Ocean (Nelson et al. 2016).In China, it is also one of the preponderant bycatch in the offshore waters of the southern East China Sea (Zhao et al. 2016).As an intermediate to high trophic-level species in the coastal food webs, it is of great significance in the offshore marine ecosystem and biodiversity.However, the limited studies of D. anguillare were mainly focused on the nutrition and feeding habits (Zhang and Tang 2003), the spatial-temporal pattern of community structure (Liu and Xian 2009) and the effects of lipid removal on the stable isotopes (Yang et al. 2020).
The explicit germplasm genetic characteristics of fishery species are considered to be the indispensable prerequisite for effective fisheries management (Hemmer-Hansen et al. 2018).However, the available genetic data for this species are still scarce and only partial mitochondrial and nuclear gene sequences have hitherto been reported and analysed (Chen et al. 2014, Chang et al. 2016, Wang et al. 2019).Microsatellite DNA, also named simple sequence repeats (SSRs) are short tandem duplications (typically 1-6 nucleotide repeats and mostly less than 100 bp in length), ubiquitous occurring in eukaryotic organisms.Besides, the repetitions vary drastically amongst different genotype of the same species (Tautz and Renz 1984).The co-dominant microsatellite molecular markers, based on polymerase chain reaction (PCR) techniques, have overriding advantages in high polymorphism, good repeatability, simple operation and low experimental cost.Therefore, it has possessed important applied worth in gene mapping and QTL analysis, population genetics and evolutionary research, as well as molecular marker-assisted breeding (Messier et al. 1996, Schlötterer 2000).At present, the conventional development strategies of representative microsatellite loci mainly include anchored-PCR-based method, selective hybridisation enrichment method, database search and relative species selection method (Sun et al. 2009).Nevertheless, these above-mentioned technical means not only are time-consuming and expensive, but also reflect incomplete distribution of microsatellites and develop limited molecular markers.
In recent years, along with the rapid progress of high-throughput sequencing (HTS) technology and the reduction of sequencing cost, developing numerous high-polymorphism SSR markers from multi-omics data has become more and more convenient.In this study, the genome-wide sequences of Dysomma anguillare were obtained, based on HiSeq 4500 platform for the first time; meanwhile, the SSR loci distribution and characteristics were also analysed by bioinformatics tools.The findings will help to provide useful references and basic information for germplasm resources conservation, population genetic evaluation and phylogenetic relationships analysis amongst related species of Anguilliformes.

Sample collection and genomic DNA extraction
Fifty-three samples of Dysomma anguillare were collected by trawling in the coastal waters of Zhoushan, Zhejiang Province in September 2022.After preliminary morphological identification, muscle tissues from five male and five female individuals were randomly selected for the genomic DNA extraction by the traditional Tris-saturated phenol method (Maniatis et al. 1982).Subsequently, the DNA barcode method, based on the mitochondrial COI sequence, was further conducted to ensure the species accuracy .The 1% agarosegel electrophoresis and NanoDrop 2000 ultraviolet spectrophotometer (USA, Thermo Fisher Scientific) were performed to detect the integrity and purity of the genomic DNA, respectively.The obtained DNA samples were stored at -20℃ for further analysis.

Library construction and high-throughput sequencing
Equal amounts of DNA (2 μg each) were mixed for library construction and next-generation sequencing by Onemore Technology (Wuhan) Co., Ltd.The genomic DNA was randomly fragmented using Covaris Ultrasonic Processor into small 200 to 350 bp fragments.Two pair-end DNA libraries were constructed through terminal repair, adding Poly-A tails and sequencing adapters, purification and PCR amplification and then sequenced using the Illumina HiSeq 4500 sequencing technology.

Sequence cleaning and genome assembly
Raw data output from Illumina platform were firstly transformed into sequence reads by base calling and recorded in a FASTQ format.Subsequently, clean reads were obtained after filtering adaptor sequences and low quality read by Cutadapt v.1.16(Martin 2011).SOAPdenovo v.2.04 was used to assemble the clean data with the setting parameters "-K 53 -R -M 3 -d 1", which employed the de Bruijn graph-based assembly strategy (Kajitani et al. 2014).First, reads sequenced from the small-fragment library were divided into smaller substrings (K-mers) to construct a preliminary de Bruijn diagram.Then, the simplified de Bruijn graph was obtained after removing the low-coverage branches and branches that cannot be connected further due to sequencing errors and the sequences at every TM TM bifurcation locus were truncated to obtain the initial contigs.By mapping the paired-end reads back to the contigs, the connectivity relationships between the reads and the information of the inserted fragment size were used to further assemble the contigs into scaffolds and obtain the primary genomic sequence.

Screening and identification of SSRs
MicroSatellite identification tool (MISA) software (http://pgrc.ipk-gatersleben.de/misa/)written by Perl script was implemented to scan the assembled scaffolds to identify the genome-wide microsatellite repeat units and to analyse the length, location and quantity of the SSRs (Thiel et al. 2003).The occurrence frequency of SSR loci, average distribution distance and density of microsatellites, type and length of repeat motifs were calculated using Microsoft Excel 2019.The default parameters of MISA were set as follows: the repeat motif length was from 1 to 6 nucleotides and the minimum thresholds of repeat counts were 1-10, 2-6, 3-5, 4-5, 5-5 and 6-5, which meant the number of mononucleotide repeats was less than 10, number of dinucleotide repeats was less than 6 and numbers of remaining repeats were all less than 5, respectively.Besides, the number of bases interrupting two SSRs in a compound microsatellite should be less than 100.Considering the Watson-Crick complementary condition and the difference in the base arrangement, the repeat sequences and their complementary sequences were grouped together.For example, the (AC) , (CA) , (TG) and (GT) were treated as the same SSR repeat types.

Genome sequencing and assembly
The information of contigs and scaffolds of the Dysomma anguillare genome was listed in the Table 1.About 11,805,379 contigs with the total length 1,960 Mb were obtained after splicing and the average GC content was about 42.2%.The number of scaffolds produced by the SOAPdenovo v.2.0 assembly was 4,060,742 and the full length was 1,561 Mb, with the average GC content 39.6%.N50 value is a widely used metric for measuring the quality of sequences by the assembly algorithms' output.It refers to the contig or scaffold length value when the accumulated fragment length (from long to short) exceeds 50% of the total length of all contigs or n n n n Table 1.
The contig and scaffold assembly results statistics.
scaffolds for the first time.The greater the N50 value, the smaller the quantity and the better the assembly quality.In this study, the N50 values of contig and scaffold assembly were 272 bp and 709 bp, respectively.Compared with the assembled genomes of related species Anguilla japonica (Henkel et al. 2012), A. anguilla (Jansen et al. 2017) and A. rostrate (Pavey et al. 2017), the assembly effect of Dysomma anguillare was relatively good and developing microsatellite markers could reflect the genome-wide characteristics of SSRs.

SSR repeat types and distribution
A total of 1,160,104 microsatellites with 1-6 bp nucleotide motifs were detected in 770,294 unigenes and 234,959 of them contained more than one SSR locus, with the occurrence frequency (total number of SSRs detected/total number of unigenes) of 28.57%.The density of distribution (total length of unigenes/total number of SSRs screened) was on average 1/1.35 kb and the relative abundance (total number of SSRs screened/total length of unigenes) was 743/Mb.
The occurrence frequency of dinucleotide repeats was highest, while hexanucleotide was observed the lowest, representing 14.58% and 0.06% of the total genome, respectively.The relative abundance of dinucleotide reached 379.27/Mb, with an average of one SSR locus per 2.64 kb and the next was mononcleotide (280.00/Mb).By comparison, the relative abundance of hexanucleotide was the lowest (1.59/Mb) (Table 2).

Repeat numbers of different SSRs
The number of repeats of SSR loci mainly ranged from 5 to 24.The predominant repeat number of the SSR loci was 10 times, comprising 17.52% of the total number of SSR loci.
In general, the number of repeat types decreased with the increase in repeat numbers (Fig. 2).The repeats of mononucleotide, dinucleotide and trinucleotide were mainly distributed in 10-19 times (96.83%),6-15 times (95.15%) and 5-9 times (85.34%),respectively.However, the repeat times of the rest of the repeat types were all within 13 times, which were mainly in the range of 5-8 times and separately accounted for 92.40%, 96.70% and 99.56% (Table 3).
In summary, the repeat numbers of SSR loci were mainly concentrated in 10-15 times and 5-8 times, with a total number of 1,016,359 (87.61%).Few SSR loci with more than 25 repeats were identified and the type of base repeats was monotonous, only composing of mononucleotide repeat.

Copy numbers of repeat units
Amongst the detected 1,488 repeat units, hexanucleotide repeats possessed the most types and pentanucleotide repeats took second place.Nevertheless, the type of mononucleotide repeats was the least limited to the base number (Table 4).Amongst all these repeats, the dominant repeat motifs in mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide were A (191,390,43 Table 2.
Proportions of each SSR repeat types in the genome of D. anguillare.

SSR length distribution and polymorphism evaluation
The sequence length amongst different types of SSRs varied a lot, from 10 to 54 bp (Fig. 4).The minimum and maximum variations in length were detected in hexanucleotide and mononucleotide repeats, respectively.The former was in the range of 30-54 bp with the total length of 1,774 bp, while the latter was in the range of 10-51 bp with total length of 379,455 bp, which constituted approximately 49.14% of the total length of SSRs.Amongst the six types of nucleotide repeat, dinucleotide and trinucleotide were dominant in the distribution of microsatellites from the perspective of sequence length, which were 677,805 bp in total and accounting for 87.78% in all SSRs.The length of the microsatellite was one of the main factors affecting its polymorphism.
Distribution interval of the copy number in different microsatellite motif for D. anguillare.
type I (length ≥ 20 bp) and the moderate-polymorphic type Ⅱ (12 bp ≤ length < 20 bp).The microsatellites with length less than 12 bp owned lower polymorphism, but higher mutation potential.In the present study, there were 21,347 type I SSRs (19%) and 294,373 type II SSRs (54%), respectively.SSR loci with low mutation potential accounted for 27%.

Discussion
Number and relative abundance of microsatellites in the genome of Dysomma anguillare The bioinformatics software was used to search and analyse the various types and numbers of six perfect microsatellites in the genome of Dysomma anguillare.
Approximately 1,160,104 microsatellite loci were revealed across the 1.56 Gb genome sequence, with a total length of 24,707,980 bp (occupying 58% of the full genome length).
In contrast to other published genomes of bony fishes, it was higher than Takifugu rubripes Hancock (1996) speculated that the numbers of microsatellites increased with the chromosome length and the disproportional relationship between the genome size and microsatellite numbers was also confirmed in our study.

Distribution characteristics of microsatellites in the genome of Dysomma anguillare
Varied microsatellite types composing of 1-6 nucleotide repeats were discovered in the genome of Dysomma anguillare and dinucleotide repeats were the most frequent, followed by mononucleotide repeats, while the percentages of SSRs containing 3-6 nucleotide repeats were no more than 10%.Therefore, priority should be given to dinucleotide repeats when designing SSR primers of D. anguillare.Mononucleotide and dinucleotide repeats were regarded as the most abundant types of SSRs in most species.It was reported that mononucleotide repeats tended to dominate in the genomes of higher grade organisms (Gao and Kong 2005).However, dinucleotide repeats contained higher proportions in fish genomes, which probably related to the differences in gene expression and regulation.
The CA repeat motif was the most abundant amongst dinucleotide repeats and occupied 25.37% of them, which was consistent with Scophthalmus maximus ( Ruan 2009) and pufferfishes (Cui et al. 2006, Xu et al. 2021), but different from Ictalurus punctatus (Tang et al. 2022), while the number of GC repeat motifs was the least.The base sliding might generate microsatellites more easily at the low melting temperature (T ).Two hydrogen bonds between A-T base pairs were more likely to be broken than three hydrogen bonds between G-C base pairs, resulting in reduction of the GC repeats (Huang et al. 2020).Some other scholars pointed out that the methylation of CpG might cause the spontaneous deamination of cytosine to thymine in order to maintain the thermodynamic stability of the DNA molecule.In this study, the proportion of GC repeats motif was only 0.1% and from this aspect, the lower GC content in the whole genome also reflected the small amount of GC repeats (Schorderet and Gartler 1992).
The structural instability and composition of trinucleotide repeats were closely related to some genetic diseases in humans (Sinden et al. 2002).It was found that AAT repeat motif was the most numerous of the trinucleotide repeats in the Dysomma anguillare genome, the same as for humans and primates (Kelkar et al. 2008).Therefore, in-depth analysis of trinucleotide repeats would contribute to predict some gene loci associated with human diseases and thereby reduces the occurrence of certain illness by changing gene expression.

Copy numbers and length variations in the genome of Dysomma anguillare
The repeat unit length was in inverse proportion to the copy number of microsatellite DNA (Harr and Schlötterer 2000).Commonly, the higher the copy number of SSRs meant the more alleles and the richer polymorphism.The number of microsatellite repeats in the Dysomma anguillare genome was mainly in the range of 5 to 25. Motifs that showed more than 25 reiterations were very rare (only 2,712 SSRs) and all of them were composed of mononucleotide repeats.Previous studies proved that the mutation rate of microsatellites was positively correlated to the copy number of the repeat motif (Wierdl et al. 1997) and longer microsatellites were expected to have higher mutation rate owing to more chances of replication slippage (Calabrese and Sainudiin 2005).The results demonstrated that the number of SSRs decreased as the repeat number increased.In addition, tetranucleotide, pentanucleotide and hexanucleotide microsatellites might have higher mutation rates than those of the mononucleotide, dinucleotide and trinucleotide microsatellites.
The length of microsatellites in the Dysomma anguillare genome was generally 10-18 bp and the number of microsatellites was inversely proportional to the repeat motif length.The structure and its characteristics analysis of a parthenogenic gastropod Melanoides tuberculata concluded that the longer the repeat sequence length was, the greater the selection pressure undergoing and the lower numbers of repeats was (Samadi et al. 1998)

Conclusions
In conclusion, MISA software was used for the first time to search and analyse six types of perfect microsatellite loci from the whole genome survey data of Dysomma anguillare.The results showed that both the relative abundance and density of various microsatellite types were very high.Amongst the 1,160,104 SSR loci, the number of different repeat types presented a trend as: dinucleotide > mononucleotide > trinucleotide > tetranucleotide > pentanucleotide > hexanucleotide.The dominant repeat motifs of them were A, CA, AAT, CACG, TAATG and CCCTAA, respectively.The results supplemented the genetic marker database of marine fishes and provided valuable information resources for further genetic analysis of D. anguillare.

Figure 1 .
Figure 1.Distribution of SSRs repeat types in genomes of D. anguillare.

Figure 3 .
Figure 3.The distribution of microsatellite repeats in genome of D. anguillare.

Figure 4 .
Figure 4. Length distribution of genes in D. anguillare.A SSR length distribution; B Distribution types of SSR (type I and type II).

Table 4 .
Dominant base types and the proportion in genome of D. anguillare.