Biodiversity Data Journal :
Research Article
|
Corresponding author: Tianyan Yang (hellojelly1130@163.com)
Academic editor: Yahui Zhao
Received: 09 Jan 2023 | Accepted: 31 Mar 2023 | Published: 07 Apr 2023
© 2023 Ziyan Zhu, Yuping Liu, Shufei Zhang, Sige Wang, Tianyan Yang
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Zhu Z, Liu Y, Zhang S, Wang S, Yang T (2023) Genomic microsatellite characteristics analysis of Dysomma anguillare (Anguilliformes, Dysommidae), based on high-throughput sequencing technology. Biodiversity Data Journal 11: e100068. https://doi.org/10.3897/BDJ.11.e100068
|
Microsatellite loci were screened from the genomic data of Dysomma anguillare and their composition and distribution were analysed by bioinformatics for the first time. The results showed that 4,060,742 scaffolds with a total length of 1,562 Mb were obtained by high-throughput sequencing and 1,160,104 microsatellite loci were obtained by MISA screening, which were distributed on 770,294 scaffolds. The occurrence frequency and relative abundance were 28.57% and 743/Mb, respectively. Amongst the six complete microsatellite types, dinucleotide repeats accounted for the largest proportion (592,234, 51.05%), the highest occurrence frequency (14.58%) and the largest relative abundance (379.27/Mb). A total of 1488 microsatellite repeats were detected in the genome of D. anguillare, amongst which the hexanucleotide repeat motifs were the most abundant (608), followed by pentanucleotide repeat motifs (574), tetranucleotide repeat motifs (232), trinucleotide repeat motifs (59), dinucleotide repeat motifs (11) and mononucleotide repeat motifs (4). The abundance of microsatellites of the same repeat type decreased with the increase of copy numbers. Amongst the six types of nucleotide repeats, the preponderance of repeated motifs are A (191,390, 43.77%), CA (150,240, 25.37%), AAT (13,168, 14.05%), CACG (2,649, 8.14%), TAATG (119, 19.16%) and CCCTAA (190, 19.16%, 7.65%), respectively. The data of the number, distribution and abundance of different types of microsatellites in the genome of D. anguillare were obtained in this study, which would lay a foundation for the development of high-quality microsatellite markers of D. anguillare in the future.
Dysomma anguillare, genome, microstatellite, high-throughput sequencing
Shortbelly eel (Dysomma anguillare Barnard, 1923) is a small-sized warm water eel that is widely distributed in the Indian Ocean and the western Pacific Ocean (
The explicit germplasm genetic characteristics of fishery species are considered to be the indispensable prerequisite for effective fisheries management (
In recent years, along with the rapid progress of high-throughput sequencing (HTS) technology and the reduction of sequencing cost, developing numerous high-polymorphism SSR markers from multi-omics data has become more and more convenient. In this study, the genome-wide sequences of Dysomma anguillare were obtained, based on HiSeqTM 4500 platform for the first time; meanwhile, the SSR loci distribution and characteristics were also analysed by bioinformatics tools. The findings will help to provide useful references and basic information for germplasm resources conservation, population genetic evaluation and phylogenetic relationships analysis amongst related species of Anguilliformes.
Fifty-three samples of Dysomma anguillare were collected by trawling in the coastal waters of Zhoushan, Zhejiang Province in September 2022. After preliminary morphological identification, muscle tissues from five male and five female individuals were randomly selected for the genomic DNA extraction by the traditional Tris-saturated phenol method (
Equal amounts of DNA (2 μg each) were mixed for library construction and next-generation sequencing by Onemore Technology (Wuhan) Co., Ltd. The genomic DNA was randomly fragmented using Covaris Ultrasonic Processor into small 200 to 350 bp fragments. Two pair-end DNA libraries were constructed through terminal repair, adding Poly-A tails and sequencing adapters, purification and PCR amplification and then sequenced using the Illumina HiSeqTM 4500 sequencing technology.
Raw data output from Illumina platform were firstly transformed into sequence reads by base calling and recorded in a FASTQ format. Subsequently, clean reads were obtained after filtering adaptor sequences and low quality read by Cutadapt v.1.16 (
MicroSatellite identification tool (MISA) software (http://pgrc.ipk-gatersleben.de/misa/) written by Perl script was implemented to scan the assembled scaffolds to identify the genome-wide microsatellite repeat units and to analyse the length, location and quantity of the SSRs (
The information of contigs and scaffolds of the Dysomma anguillare genome was listed in the Table
Assembly level |
The total length (bp) |
The sequence number |
Length number of sequences ≥ 2Kb |
The maximum length (bp) |
N50 (bp) |
N90 (bp) |
GC content (%) |
Contig |
1,960,673,378 |
11,805,379 |
30,667 |
9,646 |
272 |
60 |
42.2 |
Scaffold |
1,561,530,495 |
4,060,742 |
95,727 |
23,878 |
709 |
134 |
39.6 |
N50 value is a widely used metric for measuring the quality of sequences by the assembly algorithms' output. It refers to the contig or scaffold length value when the accumulated fragment length (from long to short) exceeds 50% of the total length of all contigs or scaffolds for the first time. The greater the N50 value, the smaller the quantity and the better the assembly quality. In this study, the N50 values of contig and scaffold assembly were 272 bp and 709 bp, respectively. Compared with the assembled genomes of related species Anguilla japonica (
A total of 1,160,104 microsatellites with 1-6 bp nucleotide motifs were detected in 770,294 unigenes and 234,959 of them contained more than one SSR locus, with the occurrence frequency (total number of SSRs detected/total number of unigenes) of 28.57%. The density of distribution (total length of unigenes/total number of SSRs screened) was on average 1/1.35 kb and the relative abundance (total number of SSRs screened/total length of unigenes) was 743/Mb.
These SSR loci can be classified into six repeat types: mononucleotide, dinucleotide, trinucleotide, tertranucleotide, pentanucleotide and hexanucleotide. The most abundant type of repeat motif was dinucleotide, accounting for 51.05% in the all SSR loci and then followed by mononucleotide (37.69%), trinucleotide (8.08%), tertranucleotide (2.71%) and pentanucleotide (0.25%), while hexanucleotide was the minimum (0.21%) of all (Fig.
Repeat type |
Number |
Occurrence frequency (%) |
Relative abundance (per・Mb -1) |
Average length (bp) |
Total length (bp) |
Mononucleotide |
437,234 |
10.77% |
280.00 |
0.87 |
379,455 |
Dinucleotide |
592,234 |
14.58% |
379.27 |
0.50 |
298,350 |
Trinucleotide |
93,734 |
2.31% |
60.03 |
0.72 |
67,533 |
Tetranucleotide |
31,481 |
0.78% |
20.16 |
0.72 |
22,680 |
Pentanucleotide |
2,936 |
0.07% |
1.88 |
0.82 |
2,409 |
Hexanu cleotide |
2,485 |
0.06% |
1.59 |
0.71 |
1,774 |
Total |
1,160,104 |
28.57% |
742.93 |
4.35 |
772,201 |
The number of repeats of SSR loci mainly ranged from 5 to 24. The predominant repeat number of the SSR loci was 10 times, comprising 17.52% of the total number of SSR loci. In general, the number of repeat types decreased with the increase in repeat numbers (Fig.
Distribution interval of the copy number in different microsatellite motif for D. anguillare.
Repeat number |
Mononu cleotide |
Dinu cleotide |
Trinu cleotide |
Tetranu cleotide |
Pentanu cleotide |
Hexanu cleotide |
Total |
Proportion (%) |
5 |
0 |
0 |
32,413 |
17,143 |
2,071 |
972 |
52,599 |
4.53% |
6 |
0 |
162,916 |
18,834 |
7,300 |
498 |
394 |
189,942 |
16.37% |
7 |
0 |
101,359 |
13,353 |
3,184 |
176 |
270 |
118,342 |
10.20% |
8 |
0 |
72,287 |
9,325 |
1,460 |
94 |
838 |
84,004 |
7.24% |
9 |
0 |
57,111 |
6,070 |
895 |
46 |
11 |
64,133 |
5.53% |
10 |
152,127 |
46,524 |
3,962 |
594 |
51 |
0 |
203,258 |
17.52% |
11 |
90,414 |
38,631 |
2,619 |
422 |
0 |
0 |
132,086 |
11.39% |
12 |
58,458 |
30,798 |
1,896 |
430 |
0 |
0 |
91,582 |
7.89% |
13 |
40,161 |
23,987 |
1,488 |
53 |
0 |
0 |
65,689 |
5.66% |
14 |
27,543 |
17,686 |
1,203 |
0 |
0 |
0 |
46,432 |
4.00% |
15 |
18,717 |
12,237 |
1,451 |
0 |
0 |
0 |
32,405 |
2.79% |
16 |
13,469 |
8,578 |
1,069 |
0 |
0 |
0 |
23,116 |
1.99% |
17 |
9,925 |
5,919 |
51 |
0 |
0 |
0 |
15,895 |
1.37% |
18 |
7,271 |
3,962 |
0 |
0 |
0 |
0 |
11,233 |
0.97% |
19 |
5,276 |
2,745 |
0 |
0 |
0 |
0 |
8,021 |
0.69% |
20 |
3,800 |
2,077 |
0 |
0 |
0 |
0 |
5,877 |
0.51% |
21 |
2,605 |
1,480 |
0 |
0 |
0 |
0 |
4,085 |
0.35% |
22 |
1,889 |
1,344 |
0 |
0 |
0 |
0 |
3,233 |
0.28% |
23 |
1,297 |
1,670 |
0 |
0 |
0 |
0 |
2,967 |
0.26% |
24 |
853 |
878 |
0 |
0 |
0 |
0 |
1,731 |
0.15% |
25 |
697 |
45 |
0 |
0 |
0 |
0 |
742 |
0.06% |
>25 |
2,712 |
0 |
0 |
0 |
0 |
0 |
2,712 |
0.23% |
In summary, the repeat numbers of SSR loci were mainly concentrated in 10-15 times and 5-8 times, with a total number of 1,016,359 (87.61%). Few SSR loci with more than 25 repeats were identified and the type of base repeats was monotonous, only composing of mononucleotide repeat.
Amongst the detected 1,488 repeat units, hexanucleotide repeats possessed the most types and pentanucleotide repeats took second place. Nevertheless, the type of mononucleotide repeats was the least limited to the base number (Table
Repeat type |
Number of types |
Maximum |
Minimum |
||||
Repeat motif |
Number |
Proportion (%) |
Repeat motif |
Number |
Proportion (%) |
||
Mononucleotide |
4 |
A |
191,390 |
43.77 |
G |
43,065 |
9.85 |
Dinucleotide |
12 |
CA |
150,240 |
25.37 |
GC |
604 |
0.1 |
Trinucleotide |
59 |
AAT |
13,168 |
14.05 |
ACG |
26 |
0.03 |
Tetranucleotide |
232 |
CACG |
2,649 |
8.41 |
ACCC/ACTT /AGGT/CCAC /CCGA/CGAT /TACG/TGGG |
1 |
0.00 |
Pentanucleotide |
574 |
TAATG |
110 |
19.16 |
- |
1 |
0.17 |
Hexanucleotide |
608 |
CCCTAA |
190 |
7.65 |
- |
1 |
0.04 |
The sequence length amongst different types of SSRs varied a lot, from 10 to 54 bp (Fig.
The length of the microsatellite was one of the main factors affecting its polymorphism.
The bioinformatics software was used to search and analyse the various types and numbers of six perfect microsatellites in the genome of Dysomma anguillare. Approximately 1,160,104 microsatellite loci were revealed across the 1.56 Gb genome sequence, with a total length of 24,707,980 bp (occupying 58% of the full genome length). In contrast to other published genomes of bony fishes, it was higher than Takifugu rubripes (0.77%) (
Relative abundance was an important feature to measure microsatellite richness. It was calculated to be 743/Mb of Dysomma anguillare, which was much higher than that of other marine fishes, such as Scatophagus argus (653/Mb) (
Varied microsatellite types composing of 1-6 nucleotide repeats were discovered in the genome of Dysomma anguillare and dinucleotide repeats were the most frequent, followed by mononucleotide repeats, while the percentages of SSRs containing 3-6 nucleotide repeats were no more than 10%. Therefore, priority should be given to dinucleotide repeats when designing SSR primers of D. anguillare. Mononucleotide and dinucleotide repeats were regarded as the most abundant types of SSRs in most species. It was reported that mononucleotide repeats tended to dominate in the genomes of higher grade organisms (
The CA repeat motif was the most abundant amongst dinucleotide repeats and occupied 25.37% of them, which was consistent with Scophthalmus maximus (
The structural instability and composition of trinucleotide repeats were closely related to some genetic diseases in humans (
The repeat unit length was in inverse proportion to the copy number of microsatellite DNA (
The length of microsatellites in the Dysomma anguillare genome was generally 10-18 bp and the number of microsatellites was inversely proportional to the repeat motif length. The structure and its characteristics analysis of a parthenogenic gastropod Melanoides tuberculata concluded that the longer the repeat sequence length was, the greater the selection pressure undergoing and the lower numbers of repeats was (
In conclusion, MISA software was used for the first time to search and analyse six types of perfect microsatellite loci from the whole genome survey data of Dysomma anguillare. The results showed that both the relative abundance and density of various microsatellite types were very high. Amongst the 1,160,104 SSR loci, the number of different repeat types presented a trend as: dinucleotide > mononucleotide > trinucleotide > tetranucleotide > pentanucleotide > hexanucleotide. The dominant repeat motifs of them were A, CA, AAT, CACG, TAATG and CCCTAA, respectively. The results supplemented the genetic marker database of marine fishes and provided valuable information resources for further genetic analysis of D. anguillare.
We are grateful to the National Innovation and Entrepreneurship Training Program for College Students (202210340023); Science and Technology Innovation Project of College Students in Zhejiang Province (2023R411006); Science and Technology Planning Project of Zhoushan (2022C41022); and Fund of Guangdong Provincial Key Laboratory of Fishery Ecology and Environment (FEEL-2021-7).
Tianyan Yang conceived and designed the study. Shufei Zhang collected and provided samples. Yuping Liu preformed the DNA extraction and bioinformatics analysis. Ziyan Zhu wrote and edited the manuscript. All authors contributed to the preparation of the manuscript.