Characterisation of the Cinnamomumparthenoxylon (Jack) Meisn (Lauraceae) transcriptome using Illumina paired-end sequencing and EST-SSR markers development for population genetics

Abstract Cinnamomumparthenoxylon is an endemic and endangered species with significant economic and ecological value in Vietnam. A better understanding of the genetic architecture of the species will be useful when planning management and conservation. We aimed to characterize the transcriptome of C.parthenoxylon, develop novel molecular markers, and assess the genetic variability of the species. First, transcriptome sequencing of five trees (C.parthenoxylon) based on root, leaf, and stem tissues was performed for functional annotation analysis and development of novel molecular markers. The transcriptomes of C.parthenoxylon were analyzed via an Illumina HiSeqTM 4000 sequencing system. A total of 27,363,199 bases were generated for C.parthenoxylon. De novo assembly indicated that a total of 160,435 unigenes were generated (average length = 548.954 bp). The 51,691 unigenes were compared against different databases, i.e. COG, GO, KEGG, KOG, Pfam, Swiss-Prot, and NR for functional annotation. Furthermore, a total of 12,849 EST-SSRs were identified. Of the 134 primer pairs, 54 were randomly selected for testing, with 15 successfully amplified across nine populations of C.parthenoxylon. We uncovered medium levels of genetic diversity (PIC = 0.52, Na = 3.29, Ne = 2.18, P = 94.07%, Ho = 0.56 and He = 0.47) within the studied populations. The molecular variance was 10% among populations and low genetic differentiation (Fst = 0.06) indicated low gene flow (Nm = 2.16). A reduction in the population size of C.parthenoxylon was detected using BOTTLENECK (VP population). The structure analysis suggested two optimal genetic clusters related to gene flow among the populations. Analysis of molecular variance (AMOVA) revealed higher genetic variation within populations (90%) than among populations (10%). The UPGMA approach and DAPC divided the nine populations into three main clusters. Our findings revealed a significant fraction of the transcriptome sequences and these newlydeveloped novel EST-SSR markers are a very efficient tool for germplasm evaluation, genetic diversity and molecular marker-assisted selection in C.parthenoxylon. This study provides comprehensive genetic resources for the breeding and conservation of different varieties of C.parthenoxylon.


Introduction
Cinnamomum parthenoxylon (Jack) Meisn.(Lauraceae) is an evergreen, broad-leaved, and diploid (2n = 24) tree species restricted to Vietnam, India, and China (MOST and VAST 2007).The species is found at elevations above 200 m in the secondary forests of Vietnam in deep, well-drained soil developed from limestone, granite, and alluvial sediment (Nguyen 2010).Cinnamomum species are recalcitrant to natural regeneration.Older trees are light-demanding, but young trees may tolerate shade.The tree can reach up to 45 m in height and 85 cm in diameter at breast height.They are a natural source of high-quality essential oil and timber.In addition, Cinnamomum is an economically important tree in Vietnam.The excessive use/exploitation of the species by the timber and pharmaceutical industries in recent decades has caused a sharp population decline in C. parthenoxylon.The population decline has been exacerbated by the poor regeneration potential and slow growth of the species (Hai et al. 2016).For these reasons, the species has been classified as endangered (CR A1a,c,d) in the Vietnam Red Data Book (MOST and VAST 2007).Human activities in tropical forests mostly reduce the distribution area and increase isolation for many plant species (Yan et al. 2024), which affect genetic variation, gene flow between populations and resistance to environmental stochasticity (Liu et al. 2013, Sebbenn et al. 2017).Genetic diversity in a species contributes greatly to the adaptation of TM the species to environmental changes for long-term survival (Reed 2002).The genetic diversity within and between populations is a potential evolutionary trait that can lead to the formulation of strategies for species conservation and restoration (Hamrick and Godt 1996).
The past two decades have witnessed extensive applications of molecular techniques to analyze genetic diversity within and among populations of threatened species of Lauraceae, for instance, RAPD and SRAP markers to assess genetic diversity in Sri Lankan Cinnamomum species ( Abeysinghe et al. 2014), RAPD in Cinnamomum zeylanicum Blume (Sandigawad and Patil 2013), EST-SSRs developed on the basis of next-generation sequencing technology in Neolitsea sericea (Blume.)Koidz (Chen et al. 2015), SSRs developed in Cinnamomum camphora (L.) Siebold (Kameyama et al. 2012, Kameyama et al. 2017) and SSRs in Cinnamomum balansae Lecomte (Cui et al. 2022).However, the genetic diversity and genetic structure of C. parthenoxylon have not been explored, which are important prerequisites for formulating and implementing conservation and restoration measures.
Molecular markers are used to assess the genetic diversity and genetic structure of rare and threatened species populations.Microsatellites (SSRs; simple sequence repeats) are highly polymorphic, widely distributed across genomes and highly reproducible, which make them a powerful means to characterise the genetic make-up of plant populations.SSRs developed from expressed sequence tags (EST-SSRs) are gene-based markers (Sahu et al. 2012), with increased identification potential for polymorphisms in woody species (Kaur et al. 2012).Transcriptome sequencing is a simple and effective method for developing numerous SSRs for a comprehensive analysis of the plant genome (Wang et al. 2010, Yan et al. 2015, Hu et al. 2016).Recently, ESTs have become available for a large number of plant species; however, there are currently no ESTs for C. balansae in GenBank.Studies on C. parthenoxylon in Vietnam have focused on the morphological and ecological characteristics of the species (Tai andThin 2014, Hai et al. 2016).This study focuses on the transcriptome characterisation, functional analysis, classification and metabolic pathway identification of C. parthenoxylon using the Illumina HiSeq 4000 platform.A set of EST-SSR markers is developed and used to assess the genetic diversity and genetic structure of the endangered C. parthenoxylon population.

Plant material
To identify the characteristics of the comprehensive transcriptome of C. parthenoxylon and to develop a large number of expressed sequence tag-SSR (EST-SSR) markers, root, leaf, and stem tissue (Fig. 1) were collected from five C. parthenoxylon trees in a wild population in Tam Dao National Park, Vinh Phuc Province, Vietnam.All samples were immediately frozen individually in liquid nitrogen and stored at -80°C until RNA extraction.

TM
To quantify genetic variation within and among the populations, young leaves were sampled from a total of 179 trees in nine wild populations of C. parthenoxylon (Fig. 1, Table 1).The samples were immediately placed in liquid nitrogen upon collection and stored at -30°C.

RNA isolation and transcriptome sequencing
Total RNA was extracted from the samples using a plant RNA kit (Omega Bioteck Inc.) following the manufacturer's instructions and treated with DNase-I after extraction.The integrity of the RNA samples was evaluated by gel electrophoresis with 1.2% agarose gel and an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA) and the purity was analyzed using a NanodropND-2000 spectrophotometer (NanoDrop Technologies, USA).
Total mRNA was purified using oligo magnetic beads.The cDNA libraries were constructed following the manufacturer's instructions.The Illumina HiSeq 4000 platform was used to sequence the transcriptome of C. parthenoxylon (Beijing Nuoheyuan Technology Co., Ltd.).
The quality of the raw reads was checked by Trimmomatic v3.0 (Bolger et al. 2014).The reads that had more than 15% low-quality bases, were shorter than 36 bp or were contaminated with adaptors were excluded.Trinity v2.4 (Grabherr et al. 2011) with the default parameters (K-mers = 25) was used for transcriptome assembly.Contigs without ambiguous bases were obtained by conjoining the K-mers in an unambiguous path.The reads were mapped back to contigs by Trinity to build unigenes with the paired-end information.The program identified contigs from the same transcript and the distances between these contigs.The contigs were assembled with Trinity and those that did not extend on either end were sequenced.Such sequences are defined as unigenes.Finally, the overlapping unigenes from the libraries were assembled into a continuous sequence by the overlapping ends of different sequences and redundant sequences were removed to acquire nonredundant unigenes with the maximum possible length by the TIGR Gene Indices clustering tool (TGICL) v2.1 (Pertea et al. 2003).

Functional annotation analysis
To determine predicted functions, all unigenes were used and compared with the NCBI non-redundant (NR) protein sequences (Deng et al. 2006), Swiss-Prot (Apweiler et al. 2004), Gene Ontology (GO) (Ashburner et al. 2000), Clusters of Orthologous Groups (COG) (Tatusov et al. 2000), Eukaryotic Orthologous Groups (KOG) of proteins (Koonin et al. 2004) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al. 2004) databases using the BLAST (E-value 10 ) program.HMMER software was used to determine protein sequence similarities with the protein family database (Pfam) (Finn 2005).

Detection of SSR loci and development of SSR primers
The DNA sequence of each nonredundant gene was obtained by sequencing.All unique sequences were used to determine the composition, frequency, and distribution of SSRs in MISA v1.0.For the search criteria of MISA, the minimum number of repetitions of each corresponding unit size was defined as follows: 1~10, 2~6, 3~5, 4~5, 5~5, and 6~5 (where, for example, 1~10 refers to a single nucleotide as the repeat unit with at least 10 repeats and 2~6 refers to a dinucleotide as the repeat unit with at least 6 repeats).ESTs containing SSRs were deposited in GenBank.SSR primers were designed using Primer v5.0 software.The major parameters for primer design were as follows: primer length (18-24 bp), with an optimum of 20 bp, PCR product sizes of 100-300 bp, annealing temperature between 55 and 65°C with an optimum of 60°C and a GC content of 40-65%, with 50% being optimal.The SSR primer pairs were synthesized by Breeding Biotechnologies Co., Ltd.

DNA extraction and the identification of SSR polymorphism
Total genomic DNA was extracted from fresh leaves by a DNA extraction kit (Norgenbiotek, Canada).An MM 400 mixer (Retsch) was used to crush the samples in liquid nitrogen.The total DNA purity and integrity were measured by a fluorimeter and then the DNA was diluted to 10 ng µl .A total of 54 primer pairs were randomly selected among 134 SSR primer pairs for PCR identification and 15 were successfully amplified for assessing the nine wild populations of C. parthenoxylon (Suppl.material 2).The PCR amplification reaction was performed in a 25 µl volume comprising 2.5 µl of template DNA, 12.5 µl of 2X Taq Master mix, 1 µl of each primer and 8 µl of deionized water.The settings for running the PCR were as follows: initial denaturation at 94°C for 3 minutes; 40 cycles of 1 minute at 94°C, 30 s at a 54-56°C annealing temperature for primer pairs, 1 minute of extension at 72°C and 10 minutes at 72°C for the final cycle; and holding at 4°C.The PCR products were sized and subjected to relative quantification between samples on a 5300 Fragment Analyzer (Agilent) with an Agilent DNF-905 dsDNA Kit (1-500 bp) (Agilent).

Genetic diversity
Null alleles and other genotyping errors were detected using MICRO-CHECKER v. Variables representing genetic diversity per locus and population, including the number of alleles (Na), number of effective alleles (Ne), observed heterozygosity (Ho), expected heterozygosity (He), inbreeding coefficient (Fis) and gene flow (Nm), calculated using the program GenAlEx v.6.5 (Peakall and Smouse 2010).The individual inbreeding model was performed to evaluate the Fis index for null allele frequency F IIM using INEst (Chybicki and Burczyk 2009).BOTTLENECK v.1.2was used to examine each population for deviations in heterozygosity from the expected level for that population given mutation-drift equilibrium (Piry et al. 1999).Two different mutational and drift models, SMM (the stepwise mutation model) and TPM (the two-phased model of mutation) were considered.For TPM, 70% single-step mutation and 10,000 repeats were used.A one-tailed Wilcoxon signed rank test was utilised to determine the significance of the findings.

Genetic differentiation and population structure
The F ( Weir and Cockerham 1984) and G' values ( Hedrick 2005) were used to quantify the genetic divergence between the populations (GenAlEx v.6.5).Using ARLEQUIN v.3.1, the significance of the F values for population pairs was assessed at a significance level of 0.05.The results of this testing were considered significant (Excoffier et al. 2005).Analysis of molecular variance (AMOVA) was carried out with the assistance -1

ST ST
ST of ARLEQUIN v.3.1, which was used to perform significance testing for the variance components involved.The outcomes of 10,000 different permutations served as the basis for this testing.The genetic association among populations was determined by Poptree2 using the UPGMA approach (Takezaki et al. 2010).Analysis of population structure was performed using STRUCTURE v.2.3.4 and the Bayesian clustering method was used (Pritchard et al. 2000).Establishing the admixture model with correlated allele frequencies required ten distinct runs for each number of groups in the dataset (K).These runs were carried out with 500,000 Markov Chain Monte Carlo (MCMC) iterations and a burn-in time of 100,000 iterations.K ranged from 1 to 15. STRUCTURE HARVESTER ( Earl and Vonholdt 2012) was used to detect the number of groups that best fit the dataset based on the K that was determined by Evanno et al. (2005) to determine the ideal value of K.This was done so that the optimal value of K could be determined.When the optimal K value had been determined, the duplicated findings were aligned with CLUMPP v1.1.2(Jakobsson and Rosenberg 2007) and the allocated cluster membership bar plots were created with DISTRUCT v.1.1.(Rosenberg 2004).A discriminant analysis of principal components (DAPC) was also carried out by utilising the adegenet package in the R v4.0.2 program to discover clusters of individuals with a common genetic ancestor (Jombart et al. 2010).The DAPC was implemented without any previous information on the population's origin.The Bayesian information criterion was used to establish the "ideal" cluster size and the number of clusters, denoted by the letter "K," ranged from one to twenty in the experiments (BIC).The DAPC was also responsible for performing the previous information analysis to determine how individuals should be assigned to populations.To obtain a better visual representation of the genetic clusters, the complot function of adegenet was applied.The xvalDAPC function was used to retain the top fourteen principal components from the principal component analysis (explaining 98.5% of the variance), as well as the seven discriminant eigenvalues.
Distribution of species search of unigenes against the Nr database.
The distribution of EST-SSRs based on the number of repeat units.

Population genetic diversity and structure
A total of 54 primer pairs were randomly selected among 134 SSR primer pairs for PCR identification and 15 were successfully amplified for the study of the wild population of C. parthenoxylon at nine different locations (Suppl.material 2).Null allele frequencies were determined at four loci (VDD07, VDD08, VDD11 and VDD14) for C. parthenoxylon at a significance level of 0.05.Ten of the 15 loci displayed heterozygote deficits under the Hardy-Weinberg Equilibrium in C. parthenoxylon (Table 5).We found 97 different alleles of 104 to 260 bp at the 15 loci from the 179 sampled trees of nine C. parthenoxylon populations.Table 4 shows that Na = 3.29 (2.00-4.89),Ne = 2.18 (1.24-3.11),PIC = 0.52 (0.18-0.77),Ho = 0.56 (0.08-0.86) and He = 0.47 (0.14-0.66).The fixation index (F ) was -0.145.Positive Fis values were found at all four loci (VDD07, VDD08, VDD11 and VDD14) under investigation, suggesting an excessive number of homozygotes and inbreeding.The inbreeding coefficient of the populations (Fit = -0.002)suggests an excess of heterozygotes in the populations (Table 5).Genetic diversity was recorded at the population level (Table 1) and the data showed 3.29 mean alleles per locus (Na) and 2.18 mean effective alleles (Ne).The proportion of polymorphic loci was 94.07%, where Ho and He were 0.56 and 0.47, respectively.The values of the inbreeding coefficient (Fis) were negative for the six populations (QN, VP, HB, PY, YT and PT), which proved that these populations are predominantly allogamous, further suggesting outcrossing and reflecting the population structure.The values of Fis were positive for three populations (DL, GL and TH), suggesting a departure from Hardy-Weinberg quilibrium with excess homozygosity.The F IIM values varied from 0.015 (PY) to 0.038 (HB), with an average of 0.02.The bottleneck analysis showed significant heterozygosity deficits in the VP population (p<0.05)under the SMM (stepwise mutation model) and TPM (two-phase model) (Table 1).This suggests that the VP population recently experienced a bottleneck.
Analysis of molecular variance (AMOVA) was performed based on 1000 permutations and showed that the molecular variation was attributable to differentiation within the populations of C. parthenoxylon (Table 6).The results revealed that the total variation within populations was 90%, which was highly significant (P<0.001).Genetic variation between the populations was recorded as 0.06 (0.01-0.14) (P < 0.05), indicating moderate genetic differentiation, while gene flow was 2.16.The highest differentiation value (Fst = 0.14) was observed within the YT and GL populations and the lowest (Fst = 0.01) was observed within the PT and VP populations (Table 7).

Sum of squares
Variance components Without any prior information, discriminant analysis of principal components (DAPC) also uncovered three genetic groupings for C. parthenoxylon (Fig. 7b).Cluster 1 included two populations from the Central Highlands area (GL and DL), Cluster 2 included two populations from the South Central Coast area (QN and PY) and Cluster 3 included five populations from the Northwest and Northeast areas (HB, VP, PT, YT, TH and TQ).DAPC performed with prior information on population origin showed individuals within and between populations (Fig. 7a).The high overlap indicated low genetic differentiation between populations.
The Bayesian analysis of individual assignments, based on the likelihoods, showed that the highest ∆K value (405.5) for 179 individuals was associated with K = 2 as the optimum number of genetic groups and showed that all individuals exhibited admixture among the three groups (Fig. 1, Fig. 7c and d).The proportion of ancestry linked with each genetic group can be deduced from the color of each individual, as measured by the percentage of the population's total color.At K = 2, the C. parthenoxylon populations were divided into two groups (Central Highlands and South Central Coast; Northwest and Northeast).The percentage of the color of 179 individuals showed the fraction of ancestry associated with each genetic group.The blue group contained four populations (GL, DL, QN and PY), while the orange group included five populations (HB, VP, PT, YT, TH and TQ).At K = 4, the C. parthenoxylon populations were differentiated into five groups.
The distinct genetic groups of the populations were identified by the unweighted pair group method average cluster analysis (UPGMA) method based on Nei's distance by using POPTREE2 (Fig. 8).Three different groups were generated; the first group was composed of six populations from the Northwest and Northeast areas (HB, VP, PT, YT, TH and TQ) with a bootstrap value of 65%, the second group was composed of two populations from the South Central Coast area (QN and PY) with a bootstrap value of 59% and the last group was composed of two populations from the Central Highlands area (GL and DL) with a bootstrap value of 64%.The populations VP/PT, QN/PY and GL/DL had a close relationship with each other, with high bootstrap values of 62%, 59% and 64%, respectively.

Discussion
In the present study, the first transcriptomic analysis of C. parthenoxylon was performed via Illumina HiSeq™ 4000 sequencing technology.Transcriptome sequencing of C. parthenoxylon provided a valuable resource for the development of SSR markers to study the genetic diversity, molecular marker-assisted breeding and evolution of the Lauraceae family.After the uni-transcripts were assembled, the GC content (46.2%) of C. parthenoxylon was found to be lower than that of Arundo donax L. [49%; Evangelistella et The GC content provides valuable information regarding the stability of genes and genomic composition for the evolution and genetic structure of the species.These results can be attributed to the adaptability of species to different environmental conditions.
To uncover the prevailing genetic diversity and evolution patterns of C. parthenoxylon, databases such as COG, GO, KEGG, KOG, Pfam, Swiss-Prot and NR were utilied to investigate matching sequences.Cinnamomum parthenoxylon unigenes were annotated among GO categories.The metabolic process term in the biological process category and the cell term in the cellular components category were the largest group in this study, indicating the importance of cellular and metabolic activities.For this purpose, KEGG resources were used to study the biological functions and interactions of genes.Some of the dominant pathways were the ribosome (876 members), oxidative phosphorylation (456), protein processing in the endoplasmic reticulum (445), glycolysis/gluconeogenesis (417), spliceosome (324), RNA transport (323) and purine metabolism (306).
Illumina sequencing technology is an efficient method for obtaining large amounts of transcriptome data for the identification of novel genes and the development of molecular markers.The current study suggested that the transcriptome sequencing of C. parthenoxylon provided a valuable resource for the development of EST-SSR markers to study genetic diversity, molecular marker-assisted breeding and evolution in the Lauraceae family.A total of 160,435 transcriptome unigenes were obtained using the Illumina HiSeq 4000 platform for C.  et al. 2015).Among the targeted populations, three (DL, GL and TH) showed Fis values ranging from 0.01 to 0.14, indicating a deficit of heterozygotes due to inbreeding.These results suggest a small neighbourhood size and matings between siblings within populations.Further results revealed a sign of a bottleneck for the GL population (P < 0.005).The heterozygosity deficit detected by bottleneck analysis showed a reduction in the population size of C. parthenoxylon.Small populations may be the result of inbreeding, which may reduce genetic diversity.However, bottlenecks were also detected in the VP population, implying that population sizes were rapidly reduced in the 1980s and 1990s (MOST and VAST 2007, Vu et al. 2017, Nguyen et al. 2021) Its pollen grains can be dispersed through the wind and insects (bees).Due to the fact that dispersal depends on insects, an increase in geographic distance will decrease the transmission of pollen grains and seeds between populations.Therefore, habitat fragmentation can lead to increasingly isolated populations and then decrease gene exchange among the populations.This can affect the genetic structure of C. parthenoxylon.Our study showed low levels of genetic difference in the same areas, such as GL and DL (Fst = 0.07) in Central Highlands, QN and PY (Fst = 0.05) in the South Central Coast VP and PT (Fst = 0.01) in the Northeast area.High genetic differences were recorded between populations located in different areas, such as Fst = 0.14 between GL/VP and GL/YT, Fst = 0.13 between GL/HB and GL/PY, Fst = 0.12 between GL/PT and Fst = 0.11 between DL/PY all of which are separated by large geographic distances.Moreover, the dispersal of seeds through the wind may be restricted due to their heavy weight.In our study, the UPGMA tree and DAPC revealed closer relationships among the C. parthenoxylon populations and two different genetic clusters were identified by structure analysis based on the Bayesian model, with a ΔK value at K = 2.However, the structural analysis showed that the 179 genotypes of C. parthenoxylon were divided into three clusters in the UPGMA tree and DAPC (Central Highlands and South Central Coast; Northwest and Northeast areas of Vietnam).This reflects the existence of genetic structure as a consequence of genetic exchange, regarding the populations in the identified genetic clusters.The Bayesian clustering approach performed by Structure was more suitable than the UPGMA tree based on genetic distances between the populations.To prevent inbreeding depression and increase the variability of progenies through outcrossing in the future, all the targeted populations could be considered for both in-situ and ex-situ conservation activities.The PY and VP populations maintained high genetic diversity and can be considered high priorities for in-situ conservation to promote the sustainable utilisation of genetic resources.All populations should be considered for germplasm collection to maximise genetic diversity.

Conclusions
In this study, for the first time, Illumina HiSeq™ 4000 sequencing technology was applied for transcriptomic analysis of C. parthenoxylon in Vietnam.A total of 12,849 EST-SSRs were recorded and 134 SSR loci were deposited in GenBank (OR536813-OR536946).The present study shows that C. parthenoxylon currently maintains medium levels of genetic diversity and shows low genetic differentiation among populations.The EST-SSRs generated in this study for C. parthenoxylon will aid in further exploration of allopatric speciation, adaptive divergence in genetic and lineage diversity and the migration history of C. parthenoxylon.Our study may also contribute to taxonomic studies as well as evolutionary research.The current study also provides a platform for breeding and conservation of C. parthenoxylon as well as related species.

Figure 1 .
Figure 1.Sampling locations and individuals of C. parthenoxylon used for this study in Vietnam.Map showing the collection sites (a), adult plant (b), leaves, stem, roots (c) and sampling plant (d).Different symbols in (a) show genetic clustering into two clusters, as revealed in population structure analyses based on microsatellite data.
2.0 software(Van Oosterhout et al. 2004), with 1000 bootstrap iterations over loci to generate the expected homozygote and heterozygote frequencies.Deviations from Hardy-Weinberg Equilibrium (HWE) for loci within populations were tested in Cervus(Kalinowski et al. 2007)   based on 1000 permutations of alleles among individuals.CERCUS(Kalinowski et al.  2007) was used to estimate each locus' PIC (polymorphism information content) value.

Figure 3 .
Figure 3. Gene Ontology (GO) classifications of unigenes of C. parthenoxylon.A total of 25,352 unigenes were categoried into three main categories: biological process, cellular component and molecular function.The x-axis indicates the subgroups in GO annotation, while the y-axis indicates the percentage of specific categories of genes in each main category.

Figure 4 .
Figure 4. Classification of the COG by functional categories of C. parthenoxylon.One-letter abbreviations for the functional categories (A-Z).

Figure 5 .
Figure 5. Clusters of orthologous groups based on KEGG classification of C. parthenoxylon (Cellular process; Environmental information processing; Genetic information processing; Metabolism; Organismal systems).

Figure 7 .
Figure 7. Analysis of population structure using DAPC and STRUCTURE for nine C. parthenoxylon populations.Scatterplot of the DAPC with prior information (a), Scatterplot of the DAPC without prior information (b), Distribution of DeltaK over K=1-15 (c) and Bar plots at K = 2 and K = 4 (d).

Figure 8 .
Figure 8. Unweighted pair group method average cluster analysis tree based on Nei's chord distance of genetic relationship among nine C. parthenoxylon populations.

Table 1 .
Sampling location and genetic diversity within C. parthenoxylon populations at 15 SRR loci Note: N, population size; Na, mean number of alleles per locus; Ne, mean number of effective alleles; P %, percentage of polymorphic loci; Ho and He, mean observed and expected heterozygosities, respectively; Fis, inbreeding coefficient; F IIM, corrected inbreeding coefficient for null alleles; *p < 0.05.P < 0.01, P < 0.001; ns, not significant; SMM (stepwise mutation model) and TPM (two-phased model of mutation).

Table 2 .
Overview of de novo transcriptome assembly for C. parthenoxylon.

Table 6 .
Analysis of molecular variance in C. parthenoxylon from nine populations.
These results are similar to those of previous studies on Panicum miliaceam L.(Yue et al. 2016)and Iris lactea Pall.var.chinensis(Fisch.)Koidz(Guet al. 2018).A total of 14,629 (9.12%) unigenes were assigned to 25 COG categories.Among these categories, ''general function prediction only'' and "translation, ribosomal structure and biogenesis'' were dominant.The current COG database results are parallel to those previously reported for Camellia taliensis.
parthenoxylon, with an N50 of 711 bp and a mean length of 548.95 bp.Yan et al. 2017).The mono-and di-nucleotide repeat motifs were the most 1M abundant repeat types, followed by tri-, tetra-, hexa-and pentanucleotides.These results in C. parthenoxylon were in line with those previously reported for N. sericea( Chen et al.  2017) and C. longepaniculatum (Yan et al. 2017).The A/T motif was most frequent than C/ G, accounting for 99.2% of the mono-nucleotide SSR markers.Among the four types of dinucleotide motifs in C. parthenoxylon, AG/CT was recorded at the highest frequency (73.73%).These values are higher than those in L. glauca (37.5%)(Zhu et al. 2016), N. sericea (30.02%) (Chen et al. 2017) and C. longepaniculatum (Yan et al. 2017), where CG/ CG repeat motifs were the least frequent (0.12%).CG motifs were very rare in C. parthenoxylon, which is similar to the previously reported results in N. sericea (Chen et al. 2017) and C. longepaniculatum (Yan et al. 2017).The most abundant trinucleotide repeat motifs were AAG/CTT (38.52%).Similar results were also reported in C. longepaniculatum (Yan et al. 2017), L. glauca (Zhu et al. 2016) and N. sericea(Chen et al. 2017).Thus, A/T, AG/CT and AAG/CTT were the major SSR markers in C. parthenoxylon.SSRs can be used to analyse differences between the C. parthenoxylon genome and the genomes of other plants within the same family, which can reveal genetic diversity and provide a set of genetic markers.In the present study, the mean PIC value was 0.52 for 15 polymorphic EST-SSR markers, showing a high level of informativeness(Botstein et al. 1980).These markers could contribute to genetic analysis and aid in the accumulation of genetic information on C. parthenoxylon origins.In this study of C. parthenoxylon, the 15 SSR markers showed a moderate level of genetic diversity (Ho = 0.56 and H e = 0.47).However, high genetic diversity levels were reported in Machilus pseudokobu Koidz.(He = 0.47) (Tsuneki et al. , leading to a lack of gene exchange in the population and a sharp decrease in allele frequency.Differences in the genetic structure of populations are very important for exploring genetic diversity and Fst is an effective metric with which to study genetic differentiation and gene flow among populations(Liu et al. 2013, Peng et al. 2017).Genetic differentiation is strongly affected by gene flow and genetic drift(Hamrick and Godt 1996).Among the tested C. parthenoxylon populations, low levels of genetic differentiation (Fst = 0.06) and gene flow (Nm = 2.16) were found.Strong gene flow (Nm > 1) leads to a high number of migrants in each generation and may prevent genetic differentiation among populations due to genetic drift(Slatkin 1985).A barrier to gene flow is reflected by significant differentiation (10%; p<0.001) in C. parthenoxylon.Gene flow, genetic drift, selection, mutations and long-term evolution have a strong effect on genetic variations among different populations(Finkeldey and Hattemer 2007, Brutting et al. 2012, Yan et al. 2024).Literature reports lower variations within populations and mating systems and strong gene flow in C. camphora(Kameyama et al. 2017, Kameyama et al. 2012, Li et al. 2018)and N. sericea( Chen et al. 2017).High genetic exchange in these species, probably via the dispersal of pollen grains and seeds, may contribute to preserving genetic variation by avoiding local gene drift.Pollen grains dispersed through insects might be a key contributor to gene flow(Brutting et al. 2012).As a woody species, C. parthenoxylon has large longevity and is predominately outcrossing.