Biodiversity Data Journal :
Software Description
|
Corresponding author: Cyril Dutech (cyril.dutech@inrae.fr)
Academic editor: Christian Wurzbacher
Received: 20 Apr 2022 | Accepted: 23 Jun 2022 | Published: 24 Aug 2022
© 2022 Benjamin Penaud, Benoit Laurent, Marine Milhes, Camille Noüs, François Ehrenmann, Cyril Dutech
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Penaud B, Laurent B, Milhes M, Noüs C, Ehrenmann F, Dutech C (2022) SNP4OrphanSpecies: A bioinformatics pipeline to isolate molecular markers for studying genetic diversity of orphan species. Biodiversity Data Journal 10: e85587. https://doi.org/10.3897/BDJ.10.e85587
|
For several decades, an increase in disease or pest emergences due to anthropogenic introduction or environmental changes has been recorded. This increase leads to serious threats to the genetic and species diversity of numerous ecosystems. Many of these events involve species with poor or no genomic resources (called here "orphan species"). This lack of resources is a serious limitation to our understanding of the origin of emergent populations, their ability to adapt to new environments and to predict future consequences to biodiversity. Analyses of genetic diversity are an efficient method to obtain this information rapidly, but require available polymorphic genetic markers.
We developed a generic bioinformatics pipeline to rapidly isolate such markers with the goal for the pipeline to be applied in studies of invasive taxa from different taxonomic groups, with a special focus on forest fungal pathogens and insect pests. This pipeline is based on: 1) an automated de novo genome assembly obtained from shotgun whole genome sequencing using paired-end Illumina technology; 2) the isolation of single-copy genes conserved in species related to the studied emergent organisms; 3) primer development for multiplexed short sequences obtained from these conserved genes. Previous studies have shown that intronic regions of these conserved genes generally contain several single nucleotide polymorphisms within species. The pipeline's functionality was evaluated with sequenced genomes of five invasive or expanding pathogen and pest species in Europe (Armillaria ostoyae (Romagn.) Herink 1973, Bursaphelenchus xylophilus Steiner & Buhrer 1934, Sphaeropsis sapinea (fr.) Dicko & B. Sutton 1980, Erysiphe alphitoides (Griffon & Maubl.) U. Braun & S. Takam. 2000, Thaumetopoea pityocampa Denis & Schiffermüller, 1775). We successfully isolated several pools of one hundred short gene regions for each assembled genome, which can be amplified in multiplex. The bioinformatics pipeline is user-friendly and requires little computational resources. This easy-to-set-up and run method for genetic marker identification will be useful for numerous laboratories studying biological invasions, but with limited resources and expertise in bioinformatics.
amplicon, biological invasion, forest diseases and pests, single-copy genes, whole-genome sequencing
Pest and disease emergences are an important threat to the biodiversity and functioning of world ecosystems (
Evolutionary and demographic inferences may be obtained from population genetic analyses (
Out of the methods used for SNP genotyping, genome-wide sequencing (GWS) allows us to obtain thousands of markers by using high-throughput DNA sequencing on parts or whole genomes studied (
Therefore, the objective of the present software was to develop an automatic bioinformatics pipeline usable for a large number of species, especially focusing on emerging forest pest and pathogen species that are often orphan species (i.e. with no or poorly published genomic resources). The pipeline hereafter called “SNP4OrphanSpecies” is designed to be easily installed and used by biologists and based on limited genomic resources (i.e. one single shotgun whole genome sequencing) in order to provide useful genetic markers for assignments to genetic lineages, identification of the origin of invading or expanding populations and estimates of population diversity and structure.
SNP4OrphanSpecies
The method to isolate single-copy genes for orphan species is based on an automated de novo genome assembly without the step of manual curation, using sequence data generated by paired-end Illumina sequencing technology. The assembly quality is checked by looking at some summary statistics (i.e. genome size, degree of assembly fragmentation, completeness of the genome). For isolating hundreds of SNPs, we focused on the single-copy genes conserved in genomes at a given taxonomic level (i.e. genus, family or order). The focus on these conserved genes allows us: 1) to control for the taxonomic assignment of the analysed genomic regions, 2) to remove duplicated genes in the genome which can produce possible false positive SNPs, 3) to isolate several SNPs generally present in the introns of these genes and 4) to yield several pools of pairs of primers for amplification of around 400 bp sequences which can be amplified together in one multiplex (100 sequences per pool). By automation of these steps, the method decreases the time of genomic analysis, while it selects DNA sequences specific to the studied species (discarding sequences due to, for example, laboratory or field DNA contamination;
The basic pipeline steps are illustrated in Fig.
The second step of this pipeline evaluates the completeness of the assembly and identifies the genes which will be used to isolate short sequences (400 bp, hereafter called “amplicon”). In this objective, BUSCO v.4.1.4 (
The last step of the pipeline is the isolation of amplicons to be amplified in pools. For this step, the amplicons are chosen to encompass at least one intron in the sequence. For each amplicon, a pair of primers is designed using a home-made Perl script integrating the programme Primer3 (v.2.5
Keeping in mind biologist users, we implemented this pipeline with Snakemake (
For running the pipeline, users only must: 1) produce a paired-end Illumina whole-genome sequencing of the species of interest and 2) set parameters (i.e. taxonomy, filtering, number of amplicons), in the file “Snakemake_Config_SNP4OrphanSpecies.yaml”. A README available on https://doi.org/10.15454/GWKRKY gives more details about these different steps and parameters.
We consider that the time of bioinformatics analyses to isolate and to develop new markers is seriously reduced thanks to this pipeline, easily installed on a personal computer, without the need to access the internet after this setting. Then, this method is especially dedicated to research teams, governmental agencies or organisations, which have limited human and financial resources. With the short sequences provided by this pipeline, the possibility of obtaining the first genetic information on recently emerging populations without the high cost of genome sequencing should help to identify the origin of emergence and the risk of adaptation to new ecosystems and define the best practices to manage new disease or pest species.
We assessed the performance of the pipeline in a new de novo genome assembly of Diplodia sapinea isolate CBS117911. Diplodia sapinea or Sphaeropsis sapinea (fr.) Dicko & B. Sutton 1980 is a worldwide emergent fungal pathogen infecting many host trees, especially pine species (
After filtering and trimming Illumina raw reads, new de novo genome assemblies were produced by the pipeline for each species (details of each assembly are given in Table
Description of the genome assemblies obtained for the five tested species. Data into brackets are from the original publication.
Species |
Diplodia sapinea |
Erysiphe alphitoides |
Armillaria ostoyae |
Thaumetopoea pityocampa |
Bursaphelenchus xylophilus |
class |
Dothideomycetes |
Leotiomycetes |
Agaricomycocetes |
Insecta |
Secermentea |
order |
Botryosphaeriales |
Erysiphales |
Agaricales |
Lepidoptera |
Aphelenchida |
family |
Botryosphaeriaceae |
Erysiphaceae |
Physalacriaceae |
Notodontidae |
Parasitaphelenchidae |
Reference |
This study (reference genome: |
|
|
|
|
Sequencing |
Illumina Hiseq3000 |
Illumina Hiseq2000 |
Illumina Hiseq2000 |
Illumina Hiseq2000 |
Illumina Hiseq2000 |
Strain |
CBS117911 |
MS_42D |
C18 |
PE300i -PE600i |
Ka4C1 |
Number of Reads |
10,544,224 |
369,262,818 |
116,828,130 |
462,786,916 |
58,326,120 |
Number of Reads Used to construct the assembly |
9,044,726 |
313,340,218 |
103,921,206 |
381,071,842 |
55,197,190 |
Total length |
37,650,182 (36.97 Mb) |
316,911,737 (308.4 Mb) |
57,720,627 (60.9 Mb) |
536,111,310 (537 Mb) |
70,264,222 (74.6 Mb) |
Nbcontigs > 500 bp |
1,793 (2,194) |
131,582 (555,289) |
7,119 (106) |
289,399 (68,292) |
10,373 (5,527) |
Nbcontigs > 1000 bp |
1,387 |
79,253 |
4,666 |
185,303 |
7,823 |
Nbcontigs > 50000 bp |
200 |
68 |
215 |
1 |
76 |
Largest contig |
324,688 |
102,030 |
563,590 |
63,395 |
148,994 |
GC(%) |
56.71 |
49.73 |
48.32 |
38.08 |
40.38 |
N50 (kb) |
48.5 (37.7) |
3.4 (1.7) |
34.3 (2800) |
2.3 (163.6) |
15.1 (949) |
L50 (number) |
218 (NA) |
17,657 (NA) |
371 (9) |
67,374 (728) |
1,341 (22) |
Summary of the genes and primers isolated by SNP4Orphanspecies pipeline for the five tested species. * one per gene and not duplicated in the genome.
Species |
Diplodia sapinea |
Erysiphe alphitoides |
Armillaria ostoyae |
Thaumetopoea pityocampa |
Bursaphelenchus xylophilus |
Nb of Busco genes |
3,786 |
3,234 |
3,870 |
5,286 |
3,131 |
Nb of Complete single-copy |
3,733 |
2,353 |
3,787 |
2,219 |
2,068 |
Nb with the validated genus |
3,557 |
987 |
3,765 |
NA |
NA |
Nb of defined amplicons |
6,962 |
1,829 |
20,991 |
3,163 |
13,256 |
Nb of genes with amplicons |
2,760 |
685 |
3,438 |
887 |
1,955 |
Nb of pairs of primers |
6,095 |
1,408 |
20,617 |
1,916 |
10,938 |
Nb of conserved pairs of primers* |
2,570 |
614 |
3,426 |
672 |
1,928 |
% gene duplication in pools |
20.8 |
57.4 |
19 |
52.8 |
23 |
The analysis of genomes from different phyla suggested that the method can be used for a large number of invasive or emergent insect and pathogen species for which genetic markers are searched. Some limitations may occur for large genomes (i.e. several hundreds Mb), since the assembly, even without any curation steps, requires a minimum of computation resources. For example, the analysis of the T. pityocampa genome for which the size was estimated to be more than 500 Mb, generated 268 GB (only 134 GB for the trimmed fastaQ files) and took more than 18 hrs on a Linux cluster using 20 CPUs. For such a large genome, it could be interesting to assess the method with a reduced whole-genome sequencing (i.e. a lower sequencing coverage or randomly amplified genome). Based on our results obtained from the highly fragmented and contaminated E. alphitoides genome assembly, we speculate that several hundreds of the conserved single-copy genes can be generally isolated by the present method, even from a low-quality or partial genome assembly. Another limitation would be the use of contaminated genome assemblies which may be quite frequent in whole-genome sequencing (
Two strategies can then be developed after the isolation of these amplicons. The first one would be sequencing hundreds of samples using one of the designed pools and next-generation sequencers. The combination of SNPs identified in each amplicon can be treated as microhaplotypes (i.e. multi-SNP loci), potentially giving more power for population genetic analyses than analysing independent bi-allelic SNP loci (
We are aware that SNPs isolated from conserved genes may be under selection. It may seriously affect the inferences of demographic dynamics of populations and should be carefully considered if historical scenarios are tested (
We thank Pedro Crous (Westerdijk Fungal Biodiversity Institute, The Netherlands) for providing us with the Diplodia sapinea isolate and G. Sypos and C. Simang (University of Sopron, Hungary) for having sent us the raw sequencing data of the C17 Armillaria isolate genome. We also thank O. Lepais and the two reviewers for their comments on the previous versions of this manuscript that have greatly improved the quality of this manuscript. The genome sequencing of the Diplodia isolate was performed in collaboration with the GeT core facility, Toulouse, France (http://get.genotoul.fr). GeT was supported by France Génomique National infrastructure, funded as part of “Investissement d'avenir” programme managed by Agence Nationale pour la Recherche (contract ANR-10-INBS-09). Preliminary tests were performed at the Genome Transcriptome Facility of Bordeaux (grants from the Conseil Regional d’Aquitaine n°20030304002FA and 20040305003FA, the European Union, FEDER n°2003227 and Investissements d’avenir, N°ANR-10-EQPX-16-01). Parts of the computational resources and infrastructure used in the present publication were provided by the Bordeaux Bioinformatics Center (CbiB). This research has benefitted from the European HOMED project and received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement N°771271.
BP designed the pipeline and wrote the informatic code and the manuscript, BL produced the Diplodia genome, helped in the development of the pipeline and wrote the manuscript, MM produced the Diplodia genome, CN supported all this collective work, FE helped to design and develop the pipeline, CD designed the pipeline and wrote the manuscript.