Biodiversity Data Journal :
Research Article
|
Corresponding author: Dinh Duy Vu (duydinhvu87@gmail.com), Cui Bei (cuibei@nwsuaf.edu.cn)
Academic editor: Elton John de Lirio
Received: 19 Mar 2024 | Accepted: 30 May 2024 | Published: 17 Jun 2024
© 2024 Mai-Phuong Pham, Dinh Duy Vu, Cui Bei, Thi Tuyet Xuan Bui, Dinh Giap Vu, Syed Noor Muhammad Shah
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Pham M-P, Vu DD, Bei C, Bui TTX, Vu DG, Shah SNM (2024) Characterisation of the Cinnamomum parthenoxylon (Jack) Meisn (Lauraceae) transcriptome using Illumina paired-end sequencing and EST-SSR markers development for population genetics. Biodiversity Data Journal 12: e123405. https://doi.org/10.3897/BDJ.12.e123405
|
|
Cinnamomum parthenoxylon is an endemic and endangered species with significant economic and ecological value in Vietnam. A better understanding of the genetic architecture of the species will be useful when planning management and conservation. We aimed to characterize the transcriptome of C. parthenoxylon, develop novel molecular markers, and assess the genetic variability of the species. First, transcriptome sequencing of five trees (C. parthenoxylon) based on root, leaf, and stem tissues was performed for functional annotation analysis and development of novel molecular markers. The transcriptomes of C. parthenoxylon were analyzed via an Illumina HiSeqTM 4000 sequencing system. A total of 27,363,199 bases were generated for C. parthenoxylon. De novo assembly indicated that a total of 160,435 unigenes were generated (average length = 548.954 bp). The 51,691 unigenes were compared against different databases, i.e. COG, GO, KEGG, KOG, Pfam, Swiss-Prot, and NR for functional annotation. Furthermore, a total of 12,849 EST-SSRs were identified. Of the 134 primer pairs, 54 were randomly selected for testing, with 15 successfully amplified across nine populations of C. parthenoxylon. We uncovered medium levels of genetic diversity (PIC = 0.52, Na = 3.29, Ne = 2.18, P = 94.07%, Ho = 0.56 and He = 0.47) within the studied populations. The molecular variance was 10% among populations and low genetic differentiation (Fst = 0.06) indicated low gene flow (Nm = 2.16). A reduction in the population size of C. parthenoxylon was detected using BOTTLENECK (VP population). The structure analysis suggested two optimal genetic clusters related to gene flow among the populations. Analysis of molecular variance (AMOVA) revealed higher genetic variation within populations (90%) than among populations (10%). The UPGMA approach and DAPC divided the nine populations into three main clusters. Our findings revealed a significant fraction of the transcriptome sequences and these newlydeveloped novel EST-SSR markers are a very efficient tool for germplasm evaluation, genetic diversity and molecular marker-assisted selection in C. parthenoxylon. This study provides comprehensive genetic resources for the breeding and conservation of different varieties of C. parthenoxylon.
endangered species, genetic diversity, genetic structure, Illumina HiSeqTM 4000, species conservation, SSR markers
Cinnamomum parthenoxylon (Jack) Meisn. (Lauraceae) is an evergreen, broad-leaved, and diploid (2n = 24) tree species restricted to Vietnam, India, and China (
The past two decades have witnessed extensive applications of molecular techniques to analyze genetic diversity within and among populations of threatened species of Lauraceae, for instance, RAPD and SRAP markers to assess genetic diversity in Sri Lankan Cinnamomum species (
Molecular markers are used to assess the genetic diversity and genetic structure of rare and threatened species populations. Microsatellites (SSRs; simple sequence repeats) are highly polymorphic, widely distributed across genomes and highly reproducible, which make them a powerful means to characterise the genetic make-up of plant populations. SSRs developed from expressed sequence tags (EST-SSRs) are gene-based markers (
To identify the characteristics of the comprehensive transcriptome of C. parthenoxylon and to develop a large number of expressed sequence tag-SSR (EST-SSR) markers, root, leaf, and stem tissue (Fig.
Sampling locations and individuals of C. parthenoxylon used for this study in Vietnam. Map showing the collection sites (a), adult plant (b), leaves, stem, roots (c) and sampling plant (d). Different symbols in (a) show genetic clustering into two clusters, as revealed in population structure analyses based on microsatellite data.
To quantify genetic variation within and among the populations, young leaves were sampled from a total of 179 trees in nine wild populations of C. parthenoxylon (Fig.
Sampling location and genetic diversity within C. parthenoxylon populations at 15 SRR loci
Population code |
Information |
Longitude (N) / Latitude (E) |
N |
Na |
Ne |
P% |
H o |
H e |
Fis |
FisIIM |
P v a lue of bottleneck |
||
TPM |
SMM |
||||||||||||
DL |
Kon Chu Rang National Reserve, Gia Lai Province |
|
40 |
4.13 |
2.91 |
93.33 |
0.49 |
0.58 |
0.14 |
0.027 |
ns |
ns |
|
GL |
Chu Yang Sin National Park, Dak Lak Province |
|
12 |
4.40 |
2.71 |
100.00 |
0.54 |
0.55 |
0.09 |
0.010 |
0.01 |
ns |
|
TH |
Xuan Lien National Reserve, Thanh Hoa Province |
|
20 |
3.93 |
2.40 |
100.00 |
0.53 |
0.54 |
0.01 |
0.018 |
ns |
ns |
|
QN |
Tay Giang District, Quang Nam Province |
|
20 |
3.80 |
2.21 |
100.00 |
0.50 |
0.49 |
-0.03 |
0.017 |
ns |
ns |
|
VP |
Tam Dao National Park, Tam Dao Province |
|
19 |
2.40 |
1.85 |
80.00 |
0.71 |
0.40 |
-0.71 |
0.022 |
0.02 |
0.05 |
|
HB |
Pa Co National Reserve, Hoa Binh Province |
|
8 |
2.40 |
1.84 |
86.67 |
0.39 |
0.39 |
0.00 |
0.038 |
ns |
ns |
|
PY |
Song Hinh District, Phu Yen Province |
|
22 |
2.67 |
1.85 |
100.00 |
0.74 |
0.41 |
-0.64 |
0.015 |
ns |
ns |
|
YT |
Yen Tu National Forests, Quang Ninh Provice |
|
19 |
2.73 |
1.84 |
86.67 |
0.49 |
0.39 |
-0.19 |
0.018 |
ns |
ns |
|
PT |
Xuan Son National Park, Phu Tho Province |
|
19 |
3.13 |
2.00 |
100.00 |
0.69 |
0.45 |
-0.40 |
0.019 |
ns |
ns |
|
Species level |
3.29 |
2.18 |
94.07 |
0.56 |
0.47 |
-0.19 |
|||||||
Note: N, population size; Na, mean number of alleles per locus; Ne, mean number of effective alleles; P %, percentage of polymorphic loci; Ho and He, mean observed and expected heterozygosities, respectively; Fis, inbreeding coefficient; FisIIM, corrected inbreeding coefficient for null alleles; *p < 0.05. ∗∗P < 0.01, ∗∗∗P < 0.001; ns, not significant; SMM (stepwise mutation model) and TPM (two-phased model of mutation). |
Total RNA was extracted from the samples using a plant RNA kit (Omega Bioteck Inc.) following the manufacturer’s instructions and treated with DNase-I after extraction. The integrity of the RNA samples was evaluated by gel electrophoresis with 1.2% agarose gel and an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA) and the purity was analyzed using a NanodropND-2000 spectrophotometer (NanoDrop Technologies, USA). Total mRNA was purified using oligo magnetic beads. The cDNA libraries were constructed following the manufacturer’s instructions. The Illumina HiSeqTM 4000 platform was used to sequence the transcriptome of C. parthenoxylon (Beijing Nuoheyuan Technology Co., Ltd.).
The quality of the raw reads was checked by Trimmomatic v3.0 (
To determine predicted functions, all unigenes were used and compared with the NCBI non-redundant (NR) protein sequences (
The DNA sequence of each nonredundant gene was obtained by sequencing. All unique sequences were used to determine the composition, frequency, and distribution of SSRs in MISA v1.0. For the search criteria of MISA, the minimum number of repetitions of each corresponding unit size was defined as follows: 1~10, 2~6, 3~5, 4~5, 5~5, and 6~5 (where, for example, 1~10 refers to a single nucleotide as the repeat unit with at least 10 repeats and 2~6 refers to a dinucleotide as the repeat unit with at least 6 repeats). ESTs containing SSRs were deposited in GenBank. SSR primers were designed using Primer v5.0 software. The major parameters for primer design were as follows: primer length (18-24 bp), with an optimum of 20 bp, PCR product sizes of 100-300 bp, annealing temperature between 55 and 65°C with an optimum of 60°C and a GC content of 40-65%, with 50% being optimal. The SSR primer pairs were synthesized by Breeding Biotechnologies Co., Ltd.
Total genomic DNA was extracted from fresh leaves by a DNA extraction kit (Norgenbiotek, Canada). An MM 400 mixer (Retsch) was used to crush the samples in liquid nitrogen. The total DNA purity and integrity were measured by a fluorimeter and then the DNA was diluted to 10 ng µl-1. A total of 54 primer pairs were randomly selected among 134 SSR primer pairs for PCR identification and 15 were successfully amplified for assessing the nine wild populations of C. parthenoxylon (Suppl. material
Null alleles and other genotyping errors were detected using MICRO-CHECKER v.2.0 software (
The FST (
Cinnamomum parthenoxylon transcriptome sequencing produced 27,363,199 high-quality and clean paired-end reads with a total output of 8.17 Gb and 160,435 unigenes (235 Mb) were assembled. The GC content was 46.2%, the Q20 score was 99.15% with 100% cycles and the Q30 score was 97.05%. Trinity produced 3,274,271 contigs with a mean length of 78.66 bp and an N50 of 110 bp. In total, 3,167,763 (96.75%) contigs were between 200 and 300 bp in length; 53,394 (1.63%) ranged from 301-500 bp; 32,240 (0.98%) ranged from 501-1000 bp; 14,348 (0.44%) ranged from 1001-2000 bp; and 6,526 (0.2%) transcripts were longer than 2000 bp (Suppl. material
Length range (bp) |
Unigene |
Contigs |
Transcripts |
200-300 |
71,571 (44.61%) |
3,167,763 (96.75%) |
81,432 (30.96%) |
300-500 |
44,960 (28.02%) |
53,394 (1.63%) |
58,136 (22.11%) |
500-1000 |
25,658 (15.99%) |
32,240 (0.98%) |
47,630 (18.11%) |
1000-2000 |
11,776 (7.34%) |
14,348 (0.44%) |
43,206 (16.43%) |
> 2000 |
6,470 (4.03%) |
6,526 (0.2%) |
32,595 (12.39%) |
Total Number |
160,435 |
3,274,271 |
262,999 |
Total Length |
88,071,463 |
257,547,156 |
237,355,325 |
N50 Length |
711 |
110 |
1,682 |
Mean Length |
548.95 |
78.66 |
902.49 |
A total of 51,691 sequences among the blasted unigenes had matches in the public databases, i.e. COG, GO, KEGG, KOG, Pfam, Swiss-Prot and NR (Table
Annotated database |
Annotated_No. |
Percentage (%) |
300–1000 (bp) |
≥ 1000 (bp) |
COG |
14,629 |
9.12 |
4,230 |
5,119 |
GO |
25,352 |
15.80 |
8,776 |
7,345 |
KEGG |
10,701 |
6.67 |
3,553 |
3,250 |
KOG |
26,823 |
16.72 |
9,584 |
8,365 |
Pfam |
29,514 |
18.39 |
10,044 |
11,310 |
Swissprot |
25,202 |
42.35 |
9,431 |
9,098 |
Nr |
49,383 |
30.78 |
18,276 |
13,831 |
All |
50,691 |
31.59 |
18,704 |
13,879 |
A total of 25,352 (15.8%) unigenes were assigned GO annotations and could be divided into three ontologies (biological process, molecular function and cellular component) with 51 subcategories (Fig.
Gene Ontology (GO) classifications of unigenes of C. parthenoxylon. A total of 25,352 unigenes were categoried into three main categories: biological process, cellular component and molecular function. The x-axis indicates the subgroups in GO annotation, while the y-axis indicates the percentage of specific categories of genes in each main category.
For the development of new molecular markers, the 160,435 assembled unigenes were used to explore the potential microsatellites, which were defined as di- to hexanucleotide motifs. A total of 12,849 potential EST-SSRs were identified by the SSRIT tool. A total of 8,687 and 2,917 sequences had one and more than one microsatellite locus, respectively, while the EST-SSR frequency was 7.5% and the distribution density of one EST-SSR was 4.16 kilobases (kb) among the unigenes. The most common repeat motif was mononucleotide (7,183; 55.90%), followed by dinucleotide (3,331; 25.92%), trinucleotide (2,186; 17.01%), tetranucleotide (115; 0.9%), hexanucleotide (17; 0.13%) and pentanucleotide (17; 0.13%) repeats (Table
Number of repeat units |
Mono- |
Di- |
Tri- |
Tetra- |
Penta- |
Hexa- |
Total |
Percentage (%) |
5 |
0 |
0 |
1285 |
97 |
14 |
8 |
1404 |
10.93 |
6 |
0 |
871 |
609 |
18 |
3 |
7 |
1508 |
11.74 |
7 |
0 |
589 |
272 |
0 |
0 |
1 |
862 |
6.71 |
8 |
0 |
589 |
19 |
0 |
0 |
1 |
609 |
4.74 |
9 |
0 |
699 |
1 |
0 |
0 |
0 |
700 |
5.45 |
10 |
2041 |
455 |
0 |
0 |
0 |
0 |
2496 |
19.43 |
>10 |
5142 |
128 |
0 |
0 |
0 |
0 |
5270 |
41.01 |
Total |
7183 |
3331 |
2186 |
115 |
17 |
17 |
100 |
|
Percentage (%) |
55.90 |
25.92 |
17.01 |
0.90 |
0.13 |
0.13 |
100 |
A total of 54 primer pairs were randomly selected among 134 SSR primer pairs for PCR identification and 15 were successfully amplified for the study of the wild population of C. parthenoxylon at nine different locations (Suppl. material
Primers |
Na |
Ne |
Null allele |
PIC |
Ho |
He |
Fis |
Fit |
Fst |
Nm |
P HWE |
VDD01 |
3.78 |
2.81 |
No |
0.75 |
0.78 |
0.62 |
-0.29 |
0.03 |
0.22 |
0.89 |
ND |
VDD02 |
3.78 |
2.50 |
No |
0.77 |
0.72 |
0.59 |
-0.21 |
-0.16 |
0.04 |
5.68 |
*** |
VDD03 |
2.56 |
1.93 |
No |
0.56 |
0.65 |
0.47 |
-0.37 |
-0.29 |
0.08 |
3.06 |
*** |
VDD04 |
3.56 |
2.36 |
No |
0.42 |
0.66 |
0.57 |
-0.14 |
0.03 |
0.16 |
1.34 |
*** |
VDD05 |
3.33 |
1.98 |
No |
0.61 |
0.61 |
0.47 |
-0.28 |
-0.16 |
0.11 |
2.07 |
*** |
VDD06 |
4.89 |
2.93 |
No |
0.46 |
0.77 |
0.63 |
-0.27 |
-0.09 |
0.11 |
2.00 |
NS |
VDD07 |
2.44 |
1.38 |
0.18 |
0.67 |
0.17 |
0.23 |
0.20 |
0.49 |
0.28 |
0.64 |
*** |
VDD08 |
3.33 |
2.05 |
0.12 |
0.30 |
0.32 |
0.36 |
0.11 |
0.30 |
0.21 |
0.95 |
*** |
VDD09 |
3.11 |
2.16 |
No |
0.46 |
0.65 |
0.53 |
-0.24 |
-0.10 |
0.11 |
2.12 |
*** |
VDD10 |
2.89 |
1.94 |
No |
0.52 |
0.63 |
0.46 |
-0.35 |
-0.22 |
0.11 |
1.95 |
** |
VDD11 |
2.00 |
1.24 |
0.17 |
0.43 |
0.08 |
0.14 |
0.35 |
0.50 |
0.17 |
1.22 |
NS |
VDD12 |
4.56 |
3.09 |
No |
0.18 |
0.86 |
0.64 |
-0.37 |
-0.23 |
0.07 |
3.25 |
ND |
VDD13 |
4.44 |
3.11 |
No |
0.61 |
0.79 |
0.66 |
-0.24 |
-0.08 |
0.11 |
2.08 |
*** |
VDD14 |
2.33 |
1.43 |
0.09 |
0.71 |
0.18 |
0.21 |
0.00 |
0.31 |
0.22 |
0.90 |
*** |
VDD15 |
2.33 |
1.75 |
No |
0.27 |
0.59 |
0.42 |
-0.38 |
-0.34 |
0.06 |
4.23 |
ND |
Note: The number of alleles per locus (Na), the mean number of effective alleles (Ne), the average null allele frequency (Null allele), polymorphism information content (PIC), observed heterozygosities (Ho), expected heterozygosities (He), the fixation index (Fis), coefficient of total inbreeding (Fit), genetic differentiation index of Weir and Cockerham (1984) (Fst), gene flow (Nm) NS=not significant, ND=not determined *P<0.05, **P<0.01, ***P<0.001.
Genetic diversity was recorded at the population level (Table
Analysis of molecular variance (AMOVA) was performed based on 1000 permutations and showed that the molecular variation was attributable to differentiation within the populations of C. parthenoxylon (Table
Analysis of molecular variance in C. parthenoxylon from nine populations.
Source of variation |
Degree of freedom |
Sum of squares |
Variance components |
Total variation (%) |
P value |
Amongst populations |
8 |
177.142 |
0.493 |
10 |
0.001 |
Amongst individuals within populations |
170 |
504.179 |
0.000 |
0 |
|
Within individuals |
179 |
774.500 |
4.327 |
90 |
|
Total |
357 |
1455.821 |
4.820 |
100 |
Population pairwise Fst and significant values of the probability (p-value < 0.05).
GL |
DL |
TH |
QN |
VP |
HB |
PY |
YT |
PT |
|
GL |
- |
+ |
+ |
+ |
+ |
+ |
+ |
+ |
+ |
DL |
0.07 |
- |
+ |
+ |
+ |
+ |
+ |
+ |
+ |
TH |
0.09 |
0.05 |
- |
+ |
+ |
+ |
+ |
+ |
+ |
QN |
0.09 |
0.06 |
0.04 |
- |
+ |
+ |
+ |
+ |
+ |
VP |
0.14 |
0.10 |
0.07 |
0.07 |
- |
+ |
+ |
+ |
+ |
HB |
0.13 |
0.05 |
0.05 |
0.06 |
0.08 |
- |
+ |
+ |
+ |
PY |
0.13 |
0.11 |
0.08 |
0.05 |
0.03 |
0.09 |
- |
+ |
+ |
YT |
0.14 |
0.08 |
0.07 |
0.06 |
0.03 |
0.06 |
0.04 |
- |
+ |
PT |
0.12 |
0.08 |
0.05 |
0.05 |
0.01 |
0.06 |
0.03 |
0.03 |
- |
Without any prior information, discriminant analysis of principal components (DAPC) also uncovered three genetic groupings for C. parthenoxylon (Fig.
The Bayesian analysis of individual assignments, based on the likelihoods, showed that the highest ∆K value (405.5) for 179 individuals was associated with K = 2 as the optimum number of genetic groups and showed that all individuals exhibited admixture among the three groups (Fig.
The distinct genetic groups of the populations were identified by the unweighted pair group method average cluster analysis (UPGMA) method based on Nei’s distance by using POPTREE2 (Fig.
In the present study, the first transcriptomic analysis of C. parthenoxylon was performed via Illumina HiSeq™ 4000 sequencing technology. Transcriptome sequencing of C. parthenoxylon provided a valuable resource for the development of SSR markers to study the genetic diversity, molecular marker-assisted breeding and evolution of the Lauraceae family. After the uni-transcripts were assembled, the GC content (46.2%) of C. parthenoxylon was found to be lower than that of Arundo donax L. [49%;
To uncover the prevailing genetic diversity and evolution patterns of C. parthenoxylon, databases such as COG, GO, KEGG, KOG, Pfam, Swiss-Prot and NR were utilied to investigate matching sequences. Cinnamomum parthenoxylon unigenes were annotated among GO categories. The metabolic process term in the biological process category and the cell term in the cellular components category were the largest group in this study, indicating the importance of cellular and metabolic activities. These results are similar to those of previous studies on Panicum miliaceam L. (
Illumina sequencing technology is an efficient method for obtaining large amounts of transcriptome data for the identification of novel genes and the development of molecular markers. The current study suggested that the transcriptome sequencing of C. parthenoxylon provided a valuable resource for the development of EST-SSR markers to study genetic diversity, molecular marker-assisted breeding and evolution in the Lauraceae family. A total of 160,435 transcriptome unigenes were obtained using the Illumina HiSeq1M 4000 platform for C. parthenoxylon, with an N50 of 711 bp and a mean length of 548.95 bp. Cinnamomum parthenoxylon has shorter unigenes than Cinnamomum longepaniculatum (Gamble) N.Chao ex H.W.Li (N50 = 1387 bp; average length = 879.43 bp) (
In the present study, the mean PIC value was 0.52 for 15 polymorphic EST-SSR markers, showing a high level of informativeness (
Differences in the genetic structure of populations are very important for exploring genetic diversity and Fst is an effective metric with which to study genetic differentiation and gene flow among populations (
In this study, for the first time, Illumina HiSeq™ 4000 sequencing technology was applied for transcriptomic analysis of C. parthenoxylon in Vietnam. A total of 12,849 EST-SSRs were recorded and 134 SSR loci were deposited in GenBank (OR536813-OR536946). The present study shows that C. parthenoxylon currently maintains medium levels of genetic diversity and shows low genetic differentiation among populations. The EST-SSRs generated in this study for C. parthenoxylon will aid in further exploration of allopatric speciation, adaptive divergence in genetic and lineage diversity and the migration history of C. parthenoxylon. Our study may also contribute to taxonomic studies as well as evolutionary research. The current study also provides a platform for breeding and conservation of C. parthenoxylon as well as related species.
We thank Join Vietnam - Russia Tropical Science and Technology Research Center (Hanoi, Vietnam), Graduate University of Science and Technology (GUST), Vietnam Academy of Science and Technology (Hanoi, Vietnam) and local people for their expertise and support of this research.
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 106.06-2021.02. The first author was funded by the Ph.D. Scholarship Programme of Vingroup Innovation (VINIF), code VINIF.2023.TS.088.