SNP4OrphanSpecies: A bioinformatics pipeline to isolate molecular markers for studying genetic diversity of orphan species

Abstract Background For several decades, an increase in disease or pest emergences due to anthropogenic introduction or environmental changes has been recorded. This increase leads to serious threats to the genetic and species diversity of numerous ecosystems. Many of these events involve species with poor or no genomic resources (called here "orphan species"). This lack of resources is a serious limitation to our understanding of the origin of emergent populations, their ability to adapt to new environments and to predict future consequences to biodiversity. Analyses of genetic diversity are an efficient method to obtain this information rapidly, but require available polymorphic genetic markers. New information We developed a generic bioinformatics pipeline to rapidly isolate such markers with the goal for the pipeline to be applied in studies of invasive taxa from different taxonomic groups, with a special focus on forest fungal pathogens and insect pests. This pipeline is based on: 1) an automated de novo genome assembly obtained from shotgun whole genome sequencing using paired-end Illumina technology; 2) the isolation of single-copy genes conserved in species related to the studied emergent organisms; 3) primer development for multiplexed short sequences obtained from these conserved genes. Previous studies have shown that intronic regions of these conserved genes generally contain several single nucleotide polymorphisms within species. The pipeline's functionality was evaluated with sequenced genomes of five invasive or expanding pathogen and pest species in Europe (Armillariaostoyae (Romagn.) Herink 1973, Bursaphelenchusxylophilus Steiner & Buhrer 1934, Sphaeropsissapinea (fr.) Dicko & B. Sutton 1980, Erysiphealphitoides (Griffon & Maubl.) U. Braun & S. Takam. 2000, Thaumetopoeapityocampa Denis & Schiffermüller, 1775). We successfully isolated several pools of one hundred short gene regions for each assembled genome, which can be amplified in multiplex. The bioinformatics pipeline is user-friendly and requires little computational resources. This easy-to-set-up and run method for genetic marker identification will be useful for numerous laboratories studying biological invasions, but with limited resources and expertise in bioinformatics.


Introduction
Pest and disease emergences are an important threat to the biodiversity and functioning of world ecosystems (Fisher et al. 2012) and human well-being (Diagne et al. 2021).A dramatic increase in these events has been recorded for several decades (Anderson et al. 2004, Fisher et al. 2012, Santini et al. 2012, Lips 2016, Rohr et al. 2019).Specifically, in the context of invasions or recent expansions of pests and parasites, there is an urgent need to identify methods for avoiding, stopping or, at least, reducing their spread and deleterious effects on ecosystems, species and genetic diversity (Filipe et al. 2012, Gonthier et al. 2014, Hall et al. 2016) .For many of the emerging pests and pathogens, however, taxonomic and biological knowledge is insufficient, with no or little information about their geographical origin, routes of colonisation and ability to adapt to newlycolonised environments and environmental change, for example, induced by human activities (Lavergne and Molofsky 2007, Fraimout et al. 2017, Gross et al. 2021).
Evolutionary and demographic inferences may be obtained from population genetic analyses (Estoup andGuillemaud 2010, Beichman et al. 2018).For emerging species that are generally little studied before causing significant effects on ecosystems or native species, we need to develop new genetic markers because of no or few published genetic resources.This development is sometimes challenging because, to obtain correct genetic estimates, markers should only amplify DNA of the species under investigation.Population genetic studies are often based on samples collected in the field.This sampling may cause, especially for micro-organisms, DNA contamination due to host or other associated micro-organisms in the samples (Ballenghien et al. 2017) or possible mistakes of identification due to the occurrence of cryptic species in the sampled area (i.e.species apparently identical morphologically, but incapable of producing hybrids) (Queloz et al. 2011, Mapondera et al. 2012, Gross et al. 2021).In some cases, the situation is even more challenging when hybridisation may occur amongst sympatric species or between invasive and native species (Sillo et al. 2015, Soghigian et al. 2020).However, the genotyping of multiple polymorphic markers may sometimes allow the identification of the different genetic lineages and their potential hybrids (Altermann et al. 2014, López-Vinyallonga et al. 2015).As late as the first decade of the 21 century, microsatellite markers (also called simple sequence repeats, SSR) were extensively used for population genetic analyses (Selkoe and Toonen 2006).With the emergence of next-generation sequencing (NGS) technologies, single nucleotide polymorphism markers (SNP) have been increasingly used for population genetic studies.
Out of the methods used for SNP genotyping, genome-wide sequencing (GWS) allows us to obtain thousands of markers by using high-throughput DNA sequencing on parts or whole genomes studied (Elshire et al. 2011).These methods can be powerful because they generate a huge quantity of genetic information.However, they may be costly in time and money, since they generally need good DNA quality for all the samples analysed, high sequencing coverage to identify duplicated genomic regions and several computational steps for removing putative false positive SNPs (Ribeiro et al. 2015).All these requirements can be difficult for some species having a genome with numerous repetitive elements, for species for which extraction of good quality DNA is not easy to obtain (Dutech et al. 2020) or for research teams without bioinformatics expertise.In addition, the large genetic information generated by GWS methods may be beyond the need of only a few genetic markers to first estimate reproductive mode, gene flow, spatial structure or origin of the expansion of emerging species (Peccoud et al. 2008, Brodde et al. 2019).Alternatively to GWS, several studies have shown that it is generally possible to isolate a few SNPs within introns of single-copy genes conserved in the genus or the family of targeted species (Feau et al. 2011, Ilves andLópez-Fernández 2014) or in ultraconserved elements (UCEs) (Blaimer et al. 2015, Bossert andDanforth 2018).The method developed by Feau et al. (2011) has been successfully applied to several fungal pathogens and it also allowed us to investigate population genetic structure, gene flow and reproductive mode (Dutech et al. 2017, Tsykun et al. 2017, Dutech et al. 2020).Based on detecting a few single-copy genes in the genomes, this method has yielded less than fifty unlinked SNPs per study and it has not been automated to reduce analysis time.
Therefore, the objective of the present software was to develop an automatic bioinformatics pipeline usable for a large number of species, especially focusing on st emerging forest pest and pathogen species that are often orphan species (i.e. with no or poorly published genomic resources).The pipeline hereafter called "SNP4OrphanSpecies" is designed to be easily installed and used by biologists and based on limited genomic resources (i.e. one single shotgun whole genome sequencing) in order to provide useful genetic markers for assignments to genetic lineages, identification of the origin of invading or expanding populations and estimates of population diversity and structure.

Title: SNP4OrphanSpecies
Design description: The method to isolate single-copy genes for orphan species is based on an automated de novo genome assembly without the step of manual curation, using sequence data generated by paired-end Illumina sequencing technology.The assembly quality is checked by looking at some summary statistics (i.e.genome size, degree of assembly fragmentation, completeness of the genome).For isolating hundreds of SNPs, we focused on the single-copy genes conserved in genomes at a given taxonomic level (i.e.genus, family or order).The focus on these conserved genes allows us: 1) to control for the taxonomic assignment of the analysed genomic regions, 2) to remove duplicated genes in the genome which can produce possible false positive SNPs, 3) to isolate several SNPs generally present in the introns of these genes and 4) to yield several pools of pairs of primers for amplification of around 400 bp sequences which can be amplified together in one multiplex (100 sequences per pool).By automation of these steps, the method decreases the time of genomic analysis, while it selects DNA sequences specific to the studied species (discarding sequences due to, for example, laboratory or field DNA contamination; Ballenghien et al. 2017).Based on the first polymorphic sequences obtained from this method in Feau et al. (2011) andDutech et al. (2016), we expect that the isolated sequences with intron regions are polymorphic within species and valuable candidates for future SNPs detection.
The basic pipeline steps are illustrated in Fig. 1.The first step of the pipeline is the wholegenome de novo assembly using paired-end short reads obtained from Illumina technology.Although not tested in this study, minimum coverage of 10X is recommended for correcting sequencing error, and probably 20X minimum to obtain a correct de novo assembly (Jiang et al. 2019).This step starts with a quality analysis of the raw data using FastQC v.0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).Reads are trimmed by using a sliding window and filtered using a minimum length with the software trimmomatic v.0.39 (Bolger et al. 2014).The parameters for trimming are defined in the parameter file of the pipeline (Snakemake_Config_SNP4OrphanSpecies.yaml).Then a de novo assembly is performed using IDBA-UD v.1.1.3(Peng et al. 2010).A basic statistics report is then generated on the final assembly using Quast v.5.0.2 (Gurevich et al. 2013).For fungi, bacteria and viruses only, detection for DNA contamination can be performed by an automatic assignation of the assembled contigs, using Kaiju v.1.7.4 (Menzel et al. 2016).Additionally, for fungal species only, isolation of the internal transcribed spacer (ITS) can be performed from the de novo genome assembly with ITSx v.1.1b(Bengtsson-Palme et al. 2013).The second step of this pipeline evaluates the completeness of the assembly and identifies the genes which will be used to isolate short sequences (400 bp, hereafter called "amplicon").In this objective, BUSCO v.4.1.4 (Seppey et al. 2019) runs on the de novo assembly obtained in the previous step to identify single-copy genes conserved at the taxonomic level defined in the the configuration file (Snakemake_Config_SNP4 OrphanSpecies.yaml).The more narrowly the taxonomic information is defined, the more specific are the isolated markers.Then, only complete and not duplicate genes are kept for the definition of amplicons.Optionally, a taxonomic assignment of the selected genes can be done using Kaiju.Genes assigned to taxon other than the taxon set in the configuration file are removed from the final selection.This verification step requires large disc spaces (ideally 125 GB for the nr_euk database) and is possible only for fungi, bacteria and viruses.
The last step of the pipeline is the isolation of amplicons to be amplified in pools.For this step, the amplicons are chosen to encompass at least one intron in the sequence.For each amplicon, a pair of primers is designed using a home-made Perl script integrating the programme Primer3 (v.2.5 Koressaar and Remm 2007), with stringent parameters favouring the possibility to be amplified jointly in a single multiplex PCR.All the designed primers are subjected to BLASTn (v.2.10, Altschul et al. 1990) against the de novo genome assembly to test for the specificity of the targeted sequences.Each pair of primers for which one of the two primers was found in at least two copies in the genome, was removed.One pair of primers is finally randomly selected per BUSCO gene to amplify a maximum of physically unlinked sequences.The validated primers are dispatched in several pools for which the primer dimer formation compatibility during a multiplex DNA amplification is tested in silico, using Primer Pooler (v.1.71, Brown et al. 2017).

Technical specification
Programming language: snakemake v.6 or later and singularity v.3 or later.

Implements specification
Keeping in mind biologist users, we implemented this pipeline with Snakemake (Köster and Rahmann 2012) and Singularity (Kurtzer et al. 2017).These softwares allow us to organise the whole bioinformatics workflow within a container, with all software and libraries needed and the automatic achievement of each step of the analysis from the initial input data.This pipeline is easy to install, easy to use and can be run on all Linux machines, including highperformance clusters.The pipeline, its associated notice and parameter files can be downloaded from the Portail Data INRAE (https://doi.org/10.15454/GWKRKY).
For running the pipeline, users only must: 1) produce a paired-end Illumina whole-genome sequencing of the species of interest and 2) set parameters (i.e.taxonomy, filtering, number of amplicons), in the file "Snakemake_Config_SNP4OrphanSpecies.yaml".A README available on https://doi.org/10.15454/GWKRKYgives more details about these different steps and parameters.

Audience
We consider that the time of bioinformatics analyses to isolate and to develop new markers is seriously reduced thanks to this pipeline, easily installed on a personal computer, without the need to access the internet after this setting.Then, this method is especially dedicated to research teams, governmental agencies or organisations, which have limited human and financial resources.With the short sequences provided by this pipeline, the possibility of obtaining the first genetic information on recently emerging populations without the high cost of genome sequencing should help to identify the origin of emergence and the risk of adaptation to new ecosystems and define the best practices to manage new disease or pest species.

Pipeline assessment
We assessed the performance of the pipeline in a new de novo genome assembly of Diplodia sapinea isolate CBS117911.Diplodia sapinea or Sphaeropsis sapinea (fr.)Dicko & B. Sutton 1980 is a worldwide emergent fungal pathogen infecting many host trees, especially pine species (Brodde et al. 2019).A genomic library was constructed for this isolate using the Illumina TruSeq Nano DNA kit, following the company procedure.A total of 10,544,224 raw 150 bp paired-end reads were generated by an Illumina HiSeq3000 sequencer at the Get-Plage Genotoul facility (INRAE, France).In addition, we used this pipeline for analysing four other invasive species from different phyla, for which the genome assembly has been already published (Table 1).Erysiphe alphitoides (Griffon & Maubl.)U. Braun & S. Takam.2000, (Ascomycota) infects many host plants worldwide and was likely introduced to Europe at the beginning of the 20 century (Gross et al. 2021).Genomic resources have recently been published for this obligate biotroph species (i.e.non-culturable on axenic media), for which DNA contamination was detected in the genome assembly (Dutech et al. 2020).Armillaria ostoyae (Romagn.)Herink 1973 (Basidiomycota) is distributed throughout the Northern Hemisphere, infecting numerous conifer species, causing large loss of wood (Heinzelmann et al. 2019) and is associated with some expanding populations in planted forests, as assumed in south-western France ( Labbé et al. 2017).The whole genome sequencing, published by Sipos et al. (2017), was used for this study.Thaumetopoea pityocampa Denis & Schiffermüller 1775 (Lepidoptera) is expanding in Europe due to climatic changes and causes significant defoliation in pine plantations and human health concerns (Battisti and Larsson 2015).The genome sequencing used to assess the pipeline has been published by Gschloessl et al. (2018).Bursaphelenchus xylophilus Steiner & Buhrer 1934, a pine wood nematode, infects several pine species and was introduced from its native North America area to Asia and Europe, where it causes dramatic mortality in forests of the invaded areas (Vicente et al. 2011).The genome used for this study was published by Dayi et al. (2020).
After filtering and trimming Illumina raw reads, new de novo genome assemblies were produced by the pipeline for each species (details of each assembly are given in Table 1).These genome assemblies were strongly fragmented with small L50 and large N50 values (Table 1).As expected for contaminated DNA extraction (Dutech et al. 2020), the E. alphitoides genome assembly was one of the most fragmented genomes together with T. pityocampa.It confirmed the originally published surprisingly large estimate of the genome size for a powdery mildew species (317 Mb vs. less than 110 Mb for other published powdery mildew genome assemblies; Frantzeskakis et al. 2019).For fungal species, identifying the ITS1 sequences using the Kaiju nr_euk database confirmed that at least a part of the genome assembly may be assigned to the expected genus for each sequenced species (Suppl.material 1).The software ITSx used for this identification detected several ITS1 haplotypes in the E. alphitoides genome, which is congruent with the detection of several contigs of the genome assigned to different phyla or fungal families (Suppl.material 2).Between 98.6% (D. sapinea) and 42% (T.pityocampa) of conserved singlecopy genes listed in the Busco database were isolated from the genome assemblies (Table 2).For the analysed fungal species and using Kaiju, a variable proportion of genes was actually identified as different from the targeted genus, leading to the removal of between 70% (E.alphitoides) and 0.6% (A.ostoyae) of the initial list of single-copy genes.In the last steps of the analysis, the pipeline defined in each species between 20,991 (A.ostoyae) and 1,829 (E.alphitoides) short 400 bp sequences (i.e.amplicons), encompassing at least one intron region.The design of the primer pairs for DNA amplification for each amplicon (only one per gene) and the control for their potential duplication in the genomes yielded a final set of between 614 (E.alphitoides) and 3,426 (A.ostoyae) primer pairs (Table 2).All these primer pairs are pooled in five pools for multiple DNA amplification.Depending on th the final number of designed primer pairs, the redundancy rates of primer pairs amongst the primer pools for each species varied between 19% (A.ostoyae) and 57.4% (E.alphitoides) (Table 2).This redundancy can be manually optimised amongst pools when rates are too high.Primer Pooler was not designed to build several pools of primers simultaneously and it may be useful to sequentially remove the pairs of primers used in the first pools to build the next pools.

Perspectives for evolutionary genetic studies of non-model invasive and emergent insect and pathogen species
The analysis of genomes from different phyla suggested that the method can be used for a large number of invasive or emergent insect and pathogen species for which genetic markers are searched.Some limitations may occur for large genomes (i.e.several hundreds Mb), since the assembly, even without any curation steps, requires a minimum of computation resources.For example, the analysis of the T. pityocampa genome for which the size was estimated to be more than 500 Mb, generated 268 GB (only 134 GB for the trimmed fastaQ files) and took more than 18 hrs on a Linux cluster using 20 CPUs.For such a large genome, it could be interesting to assess the method with a reduced wholegenome sequencing (i.e. a lower sequencing coverage or randomly amplified genome).Based on our results obtained from the highly fragmented and contaminated E. alphitoides genome assembly, we speculate that several hundreds of the conserved single-copy genes can be generally isolated by the present method, even from a low-quality or partial genome assembly.Another limitation would be the use of contaminated genome assemblies which may be quite frequent in whole-genome sequencing (Ballenghien et al. 2017), especially for not easily cultivated micro-organisms.The smallest number of validated sequences was obtained for E. alphitoides for which such contamination was assumed (Dutech et al. 2020) and confirmed in this study.When it involves genetically related species (for example, between fungal species), such contamination may be difficult to identify and remove from the genomic data, because of the genetic similarity amongst the sequences.Using the Kaiju nr_euk database to assign the identified genes is then useful to detect this DNA contamination and discard the sequences with the incorrect taxonomic assignment.For the identification of conserved single-copy genes, a taxonomic determination is a sensitive parameter.An incorrect taxonomic identification at the genus level or higher of the emergent species may lead us to discard several BUSCO genes present in the genome assembly, but too different from the chosen reference genome.Conversely, if BUSCO genes are searched at a taxonomical level that includes the emergent species and a contaminating related species, some BUSCO genes non-specific to the emergent species may be selected despite the control performed by Kaiju.To avoid the selection of these unspecific genes, the isolated BUSCO genes can be evaluated again before the step of primers design.This evaluation can be done using the fasta files of the isolated genes (available in the outputs of the pipeline) and a phylogenetic reconstruction for each gene by extracting BUSCO genes of the taxonomical group studied from a database (for example, Mycocosm; Grigoriev et al. 2013) and the use of one of the numerous published phylogenetic software (see, for example, Dereeper et al. 2008).
Two strategies can then be developed after the isolation of these amplicons.The first one would be sequencing hundreds of samples using one of the designed pools and nextgeneration sequencers.The combination of SNPs identified in each amplicon can be treated as microhaplotypes (i.e.multi-SNP loci), potentially giving more power for population genetic analyses than analysing independent bi-allelic SNP loci (Baetscher et al. 2017).Microhaplotypes defined on short sequences, as those isolated by this pipeline, may be well adapted to identify fine population genetic structures, especially for field samples for which the quality of DNA extraction is often poor (Morin et al. 2021) Figure 1.

Table 1 .
Description of the genome assemblies obtained for the five tested species.Data into brackets are from the original publication.

Table 2 .
Summary of the genes and primers isolated by SNP4Orphanspecies pipeline for the five tested species.* one per gene and not duplicated in the genome.
(Altermann et al. 2014)0)les to populations and estimates of population admixture(McKinney et  al. 2017).In a preliminary PCR test on 47 samples of D. sapinea, the majority of the 100 pairs of primers tested in multiplex allowed us to obtain, on average, more than ten sequencing reads per sample and per amplicon.This last result is encouraging new tests on other pools of primers and on other orphan species.A second option would be the genotyping of selected SNPs (one per locus), combined in different pools (i.e.plex), on several hundreds of samples, using, for example, Mass-ARRAY technology by Sequenom(Chancerel et al. 2013, Dutech et al. 2017).Resequencing about ten genomes or partial genomes can first identify variations within each amplicon and then, several SNP-plex can be designed for genotyping hundreds of samples(Dutech et al. 2016).Although the haplotypic information is missed in this case and this strategy of genotyping may introduce ascertainment bias due to the choice of SNPs(Albrechtsen et al. 2010), it allows for genetically characterising many samples from different geographical regions, a central objective when the genetic origin of emergent populations is investigated.We are aware that SNPs isolated from conserved genes may be under selection.It may seriously affect the inferences of demographic dynamics of populations and should be carefully considered if historical scenarios are tested(Beichman et al. 2018).A first study having selected these conserved genes in Armillaria sp., detected two out of 20 tested(Dutech et al. 2016).Notwithstanding this potential bias, we argue that for the first estimation of population genetic structures or identification of the genetic origin of the emergent populations, the methodology presented here remains efficient.No significant difference in the estimates of genetic structures has been observed when comparing SSR and SNP loci isolated using this method in several European populations of A. cepistipes(Tsykun et al. 2017).Basic statistics can also identify loci under selection ( Vitalis 2003).Furthermore, because these loci are chosen in conserved genes at a given taxonomic level, the designed primers can theoretically amplify every species of this level.This robustness for DNA amplification is useful for investigating genetic differentiation amongst closely-related species and their hybrids(Altermann et al. 2014).Some loci, such as SSR loci, are sometimes difficult to transfer even within the same genus(Dutech et al. 2007) and may produce many null alleles and missing data.By contrast, the short sequences obtained by this pipeline would be especially relevant for studying genetic diversity and differentiation amongst closely-related species, because of the expected good repeatability of DNA amplification, the standardisation of genotyping amongst species, experiments or laboratories, as well as the assessment of the sequence orthology within genomes.They may be an efficient alternative to WGS methods for studying genetically related species or comparing or combining, genetic studies produced by different studies or laboratories(see Harvey et al. 2016 for details).Lesur I, Noirot C, Klopp C, Ehrenmann F, Boury C, Provost GL, Label P, Lalanne C, Léger V, Salin F, Gion J, Plomion C (2013) High-density linkage mapping in a pine tree reveals a genomic region associated with inbreeding depression and provides clues to the extent and distribution of meiotic recombination.BMC Biology 11: 50.https://doi.org/10.1186/1741-7007-11-50• Dayi M, Sun S, Maeda Y, Tanaka R, Yoshida A, Tsai IJ, Kikuchi T (2020) Nearly complete genome assembly of the pinewood nematode Bursaphelenchus xylophilus strain Ka4C1.Microbiology Resource Announcements 9 (42): e01002-20.https:// doi.org/10.1128/mra.01002-20• Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard J-F, Guindon S, Lefort V, Lescot M, Claverie J-M, Gascuel O (2008) Phylogeny.fr:robust phylogenetic analysis for the non-specialist.Nucleic Acids Research 36: W465-W469.https://doi.org/10.1093/nar/gkn180 • Diagne C, Leroy B, Vaissière A, Gozlan R, Roiz D, Jarić I, Salles J, Bradshaw CA, Courchamp F (2021) High and rising economic costs of biological invasions worldwide.Nature 592 (7855): 571-576.https://doi.org/10.1038/s41586-021-03405-6• Dutech C, Enjalbert J, Fournier E, Delmotte F, Barrès B, Carlier J, Tharreau D, Giraud T (2007) Challenges of microsatellite isolation in fungi.Fungal Genetics and Biology 44 (10): 933-949.https://doi.org/10.1016/j.fgb.2007.05.003 • Dutech C, Prospero S, Heinzelmann R, Fabreguettes O, Feau N (2016) Rapid identification of polymorphic sequences in non-model fungal species: the PHYLORPH method tested in Armillaria species.Forest Pathology 46 (4): 298-308.https://doi.org/10.1111/efp.12256• Dutech C, Labbé F, Capdevielle X, Lung-Escarmant B (2017) Genetic analysis reveals efficient sexual spore dispersal at a fine spatial scale in Armillaria ostoyae, the causal agent of root-rot disease in conifers.Fungal Biology 121: 550-560.https://doi.org/10.1016/j.funbio.2017.03.001 • Dutech C, Feau N, Lesur I, Ehrenmann F, Letellier T, Li B, Mouden C, Guichoux E, Desprez-Loustau ML, Gross A (2020) An easy and robust method for isolation and validation of single-nucleotide polymorphic markers from a first Erysiphe alphitoides draft genome.Mycological Progress 19 (6): 615-628.https://doi.org/10.1007/s11557-020-01580-w • Elshire R, Glaubitz J, Sun Q, Poland J, Kawamoto K, Buckler E, Mitchell S (2011) A robust, simple Genotyping-by-Sequencing (GBS) approach for high diversity species.PLOS One 6 (5).https://doi.org/10.1371/journal.pone.0019379