Plant diversity in sedimentary DNA obtained from high-latitude (Siberia) and high-elevation lakes (China)

Abstract Background Plant diversity in the Arctic and at high altitudes strongly depends on and rebounds to climatic and environmental variability and is nowadays tremendously impacted by recent climate warming. Therefore, past changes in plant diversity in the high Arctic and high-altitude regions are used to infer climatic and environmental changes through time and allow future predictions. Sedimentary DNA (sedDNA) is an established proxy for the detection of local plant diversity in lake sediments, but still relationships between environmental conditions and preservation of the plant sedDNA proxy are far from being fully understood. Studying modern relationships between environmental conditions and plant sedDNA will improve our understanding under which conditions sedDNA is well-preserved helping to a.) evaluate suitable localities for sedDNA approaches, b.) provide analogues for preservation conditions and c.) conduct reconstruction of plant diversity and climate change. This study investigates modern plant diversity applying a plant-specific metabarcoding approach on sedimentary DNA of surface sediment samples from 262 lake localities covering a large geographical, climatic and ecological gradient. Latitude ranges between 25°N and 73°N and longitude between 81°E and 161°E, including lowland lakes and elevated lakes up to 5168 m a.s.l. Further, our sampling localities cover a climatic gradient ranging in mean annual temperature between -15°C and +18°C and in mean annual precipitation between 36 and 935 mm. The localities in Siberia span over a large vegetational gradient including tundra, open woodland and boreal forest. Lake localities in China include alpine meadow, shrub, forest and steppe and also cultivated areas. The assessment of plant diversity in the underlying dataset was conducted by a specific plant metabarcoding approach. New information We provide a large dataset of genetic plant diversity retrieved from surface sedimentary DNA from lakes in Siberia and China spanning over a large environmental gradient. Our dataset encompasses sedDNA sequence data of 259 surface lake sediments and three soil samples originating from Siberian and Chinese lakes. We used the established chloroplastidal P6 loop trnL marker for plant diversity assessment. The merged, filtered and assigned dataset includes 15,692,944 read counts resulting in 623 unique plant DNA sequence types which have a 100% match to either the EMBL or to the specific Arctic plant reference database. The underlying dataset includes a taxonomic list of identified plants and results from PCR replicates, as well as extraction blanks (BLANKs) and PCR negative controls (NTCs), which were run along with the investigated lake samples. This collection of plant metabarcoding data from modern lake sediments is still ongoing and additional data will be released in the future.

environmental conditions and preservation of the plant sedDNA proxy are far from being fully understood. Studying modern relationships between environmental conditions and plant sedDNA will improve our understanding under which conditions sedDNA is wellpreserved helping to a.) evaluate suitable localities for sedDNA approaches, b.) provide analogues for preservation conditions and c.) conduct reconstruction of plant diversity and climate change. This study investigates modern plant diversity applying a plant-specific metabarcoding approach on sedimentary DNA of surface sediment samples from 262 lake localities covering a large geographical, climatic and ecological gradient. Latitude ranges between 25°N and 73°N and longitude between 81°E and 161°E, including lowland lakes and elevated lakes up to 5168 m a.s.l. Further, our sampling localities cover a climatic gradient ranging in mean annual temperature between -15°C and +18°C and in mean annual precipitation between 36 and 935 mm. The localities in Siberia span over a large vegetational gradient including tundra, open woodland and boreal forest. Lake localities in China include alpine meadow, shrub, forest and steppe and also cultivated areas. The assessment of plant diversity in the underlying dataset was conducted by a specific plant metabarcoding approach.

Introduction
Arctic and high-elevation ecosystems are very sensitive to natural and anthropogenicallyinduced climate variability. Anthropogenic warming and changes in land-use have been considered to shift vegetation composition and plant richness in these areas during the last centuries and decades (Cui and Graf 2009, Xu and Liu 2007, Pearson et al. 2013. Still, there is a lack of understanding how environmental variability shaped local plant diversity in these remote areas. Modern plant sedDNA data, as a modern training-set, can be applied to reconstruct past vegetation types in alpine and arctic ecosystems, which in pollen-based reconstruction, can be largely biased due to the presence of poor modern pollen analogues (Ortu 2010). Since the last decade, plant sedDNA have been applied to lake surface-sediment samples to identify modern plant communities , Niemeyer et al. 2017 and to lake sediment cores to reconstruct local plant diversity changes in the past (Clarke et al. 2019, Giguet-Covex et al. 2019, Parducci et al. 2017. Typically, plant sedDNA is retrieved by a metabarcoding approach that utilises the g/h primers that amplify a short fragment of the trnL P6 loop of the plant chloroplast genome (Taberlet et al. 2007). The P6 loop markers have been widely applied for the identification of plants in lake sediment samples and other environmental samples, such as soils ), permafrost (Willerslev et al. 2014, Zimmermann et al. 2017, Zimmermann et al. 2017) and mock samples (Pornon et al. 2016, Lamb et al. 2016. For lake sediments, the use of P6 loop markers allows an identification of plants to a low taxonomic level and the recognition of more taxa from the local lake catchment (Niemeyer et al. 2017). As plant diversity obtained from lakes' sedDNA reflects local plant communities, it is more suitable to reconstruct the past local plant diversity. It will help to reconstruct local changes of the environment and improve future predictions. Until now, we have only a limited understanding of how environmental conditions influence the preservation of the plant sedDNA proxy (Giguet-Covex et al. 2019) which requires larger sedDNA datasets spanning a wide range of environmental conditions to be investigated. The underlying modern dataset is a large sampling set covering a wide range of environmental conditions, which will provide an openly accessible data resource which can be used to address the effect of the environment on the preservation of the sedDNA proxy resulting in: improved site selection for plant sedDNA, analogues for the effects of DNA preservation and refined reconstruction of past plant communities and environmental conditions.

Project description
Title: Plant diversity from sedimentary DNA in Siberian and Chinese lakes Study area description: Lakes are located in the Siberian Arctic and in low-to highelevation lakes from Northern China and from the Tibetan Plateau. Lake sites include large lakes, which were formed during past glacial periods and smaller lakes formed by thermokarst.

Design description:
The lake sites in Siberia were accessed during field trips conducted by the Alfred Wegener Institute in the years 2005 to 2016. The lake sites in China were visited from 2003 to 2018. This data-set has been established to correlate modern genetic plant diversity with modern vegetation mappings and climate and environmental data. The modern sedDNA data will be used to a.) evaluate suitable localities for sedDNA approaches, b.) provide analogues for preservation conditions and c.) conduct reconstruction of plant diversity and climate change mainly across glacial/interglacial phases in the Late Pleistocene-Holocene and during recent environmental change in the Anthropocene.

Sampling description:
Between the years 2003 and 2018, several expeditions were undertaken in which surface samples from 262 different localities in Siberia and China were sampled. Samples were taken with a bottom sampler from a boat mostly in the centre of the lakes (259 samples) or from soil on a lake's shoreline (three samples).

Quality control:
The first centimetre of the surface sediment (259 samples) and shoreline soil (three samples) was carefully sampled by using gloves and single-use plastic spoons. Samples were transferred into sterile Whirl-Pak® or sterilised Nalgene tubes and were kept dark and cool (+4°C) until further treatment in the laboratories for environmental and ancient DNA at Alfred Wegener Institute Helmholtz Centre for Polar and Marine Research.

Geographic coverage
Description: The lake localities cover a large geographical, climatic and ecological gradient, including elevated lakes from 0 up to 5168 m a.s.l. Latitude ranges between 25°N and 73°N and longitude between 81°E and 161°E. The climatic gradient ranges in mean annual temperature between -15°C and +18°C and in mean annual precipitation between 36 and 935 mm. The localities in Siberia span a large vegetational gradient including tundra, open woodland and boreal forest. Lake localities in China include alpine meadow, shrub, forest and steppe and also cultivated areas.

Taxonomic coverage
Description: The retrieved DNA metabarcoding data provide 623 unique plant sequences which mainly include terrestrial and aquatic vascular plants and a few mosses. Plant DNA sequences are identified to different taxonomic levels. About 78% of sequences types are assigned to species level, 13% to genus-level and 9% to higher taxonomic levels (subfamily, family or order). Sampling localities in Siberia and China. Lake surface-sediments or soils are indicated with a circle or triangle, respectively.

Number of data sets: 2
Data set name: Environmental data of lake localities Data format: Table   Description: Compilation of environmental data for the 262 investigated localities, which include additional intra-lake localities taken within three large lakes, namely: 16-KP-01-L02 (nine samples), 16-KP-03-L10 (five samples), 16-KP-04-L19 (four samples) (Suppl. material 1). The table includes information about the geographic coordinates, elevation, type of sample material, geographic region, water depth (at which samples were taken), pH, water conductivity, mean annual precipitation (MAP), mean annual temperature (MAP), July and January mean temperature, vegetation type and dominant plant taxa ('dominant_Taxon1': indicates the most frequent taxon, 'Taxon2-10': Taxa listed in descending order by their area of distribution in modern vegetation). If no dominant taxon is listed, the surrounding vegetation is too diverse to determine dominant taxon. 'NA' -data not available.

Column label Column description
No.
Running number of items in the

Data set name: Taxa list of plant species detected
Description: Taxa list of identified plant sequences with either a 100% match to the embl138 or arctborbryo taxonomic database (Suppl. material 2). The table contains the 'ID' -unique identifier for a cluster on the sequencing flow cell, 'best_identity_arctborbryo' -best identity with the arctborbryo database, 'best_identity_embl138' -best identity with the embl138 database, 'scientific_name_by_arctborbryo' -best taxa name with the arctborbryo, 'scientific_name_by_embl138' -best taxa name with the embl138, 'DNA sequence' -DNA sequence of the unique sequence type, 'best_match_in_arctborbryo' -accession number of the best reference entry in the arctborbryo database, 'best_match_in_embl138' -accession number of the best reference entry in the embl138 database, 'count' -total count of sequence type in the total sequencing project, 'occurrence_in_PCR' -total number of occurrences in the PCR samples.

Column label Column description
No.

Assessment of environmental data
Geographic coordinates, elevation, lake water depth, pH and electrical conductivity were measured with appropiate devices during the different field surveys. Geographic coordinates and elevation were measured with Garmin etrex devices. Lake water depth was measured with a hand echolot (Hondex PS-7) and pH and electrical conductivity were measured with a WTW multi-340i device. Annual mean temperature, mean temperature in July and January and mean annual precipitation were downloaded from WorldClim 2 (www.worldclim.org) and are based on the average climate data for the years 1970-2000 at a spatial resolution of 30 seconds (ca. 1 km ). The site-specific climate data was interpolated to the location area by using the R packages raster (Hijmans 2020 (Hijmans 2020). The site specific ring-buffer was measured between the sampling point to the furthest lake shoreline. For some sampling points from the relatively small lakes (radius < 50 m), this buffer was set to 100 m. Dominant plant taxa were extracted and categorised in: 'dominant_Taxon1': indicates the most frequent taxon, 'Taxon2-10': Taxa listed in 2 descending order by their area of distribution in modern vegetation. If the extracted vegetation were identified as being too diverse, no dominant taxon ('dominant_Taxon1') could be identified. All environmental data are summarised in Suppl. material 1

DNA extraction and NGS sequencing
DNA extraction of surface sediment samples was carried out in the molecular genetic laboratories for environmental genetics at Alfred Wegener Institute Helmholtz Centre for Polar and Marine Research. The genetic workflow using modern surface sediment samples includes DNA extraction, amplification, purification & pooling, DNA sequencing and bioinformatic analyses. For sediment DNA extractions, about 3-5 g of wet sediment and about 8-10 g of wet sediment for samples of 13-TY, were utilised. All DNA extractions were carried out by using the DNeasy PowerMax Soil Kit (Qiagen, Germany) and PowerMax Soil DNA Isolation kit (MoBio Laboratories, Inc., USA) following the manufacturer's instructions with the following modifications: To each sediment sample in the Power Bead solution 2mg/ml proteinase K, 0.5ml dithiothreitol (DTT) and 1.2 ml C1 buffer were added, then vortexed for 10 min and incubated overnight in a rotating shaker at 56 °C. The final elution step was carried out with 1.6 ml C6 buffer. Each DNA extraction batch contained a maximum of nine samples and one extraction blank. Precautions were taken for all the experimental steps to avoid potential contaminations (Champlot et al. 2010Epp et al. 2019. We ran separate PCR set-ups for each of the extraction batches, including a PCR negative control (NTC). We produced at least two PCR replicates of each sample, which were set up on different days. For a few samples in the AGAK-5 run, we produced up to six PCR replicates. PCR was performed using the trnL g and h universal plant primers for the short and variable P6 loop region of the chloroplast trnL (UAA) intron (Taberlet et al. 2007). For internal barcoding of the samples and PCR replicates, we used modified g and h primers with a unique 8 bp barcode and NNN was added. After evaluation of PCR results, PCR products were purified with MinElute Kit (Qiagen, Germany), including PCR NTCs. The DNA concentration of the purified products was measured with a Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Germany) and finally pooled equimolarly. In total, four pools were generated and resulted in four different sequencing runs. The pool, including samples from 13-TY, was prepared differently from the other PCR pools, because here we performed PCR replicates with the same barcode combination and pooled replicates before the final equimolar pooling of all PCR products. The four pools were sequenced using the external sequencing service (Fasteris SA, Switzerland). Four independent sequencing runs were performed (Table 1). Samples presented in this study derive from four different sequencing runs. The data from the individual runs were later merged into a single dataset for the final interpretation.

Bioinformatic analysis of the genetic data
We analysed in total 262 lake samples, resulting in 688 PCRs, which divide into 553 sample PCRs and 135 PCRs of extraction blanks and NTCs. The analysis of the resulting sequence data and taxonomic assignments was done by using the OBITools package (Boyer et al. 2015). Firstly, we used illuminapairedend to align the raw forward and reverse reads and kept only joined reads by using obigrep with the option -p 'mode!="joined"'. Then, we used demultiplexed the joined sequences by applying ngsfilter, which sorts the samples according to the given unique combination of internal barcodes and the primer sequences. After demultiplexing 674 PCR samples were retrieved, while from 14 PCR samples no reads were obtained. Subsequently, we used obigrep to remove sequences shorter than 10 bp and obiuniq to merge identical reads (coverage of 100% required) by keeping the total count number for each sample in which the read was detected. Further, we used obiclean to remove probable PCR and sequencing errors. For taxonomic assignment, we used ecotag with two different databases. One database comprises a sequence reference database generated from the EMBL standard sequences (http:// www.ebi.ac.uk/ena) using the release 138. To apply the EMBL database with ecotag, a transformation of the EMBL into an ecoPCR database was required. Therefore, we first ran a simulated in silico PCR (Ficetola et al. 2010) with the g/h primers on the EMBL Standard Nucleotide Database (release 138) allowing five mismatches between the primers and target sequences. The resulting ecoPCR output was filtered by obigrep to ensure assignation to the taxonomic levels: species, genus and family. Then, obiuniq was used to de-replicate redundant sequences and obiconvert was applied to transform the ecoPCR output into an ecoPCR database, which is finally used by ecotag. The second database is a sequence reference database for Arctic and Boreal vascular plants (arctborbryo database), containing 1664 vascular plant taxa and 486 bryophyte species (Soininen et al. 2015, Sønstebø et al. 2010, Willerslev et al. 2014) already adjusted to the ecoPCR database format. For further analyses, we kept only those sequence reads that match 100% to the one or the other reference database. The total read sequence for each sample and its replicate(s) are plotted in Fig. 2

Summary of NGS sequencing runs.
Sample number is equivalent to number of pooled PCR products, which also include PCR replicates of samples and corresponding BLANKs and NTCs.
Additional samples were pooled to the run, but do not belong to this project.
Sample replicates were pooled prior to final pooling of all PCR products.

General results of genetic data analyses
After merging raw paired-end reads and demultiplexing according to the internal barcode, two sample PCRs from lakes 16-KP-02-L08 and 16-KP-04-L22, as well as seven BLANKs and five NTCs yielded no read counts and were discarded from the dataset. After taxonomic assignment of the remaining 674 PCRs, we identified a total of 15,754,779 reads which had a 100% match to either the EMBL or arctborbryo database. For 99.6% of the reads (15,692,844 read counts), we identified 621 unique plant sequences types, while 0.39% (61,835 read counts) of the reads were assigned to non-plant taxa, including bacteria, algae and higher eukaryotic taxa (in total 72 unique sequence types). Further, we identified 340,346 reads (2.1% of the total dataset) in extraction blanks and NTCs, whereof 38.7% (23,975 reads) of reads in the BLANKs and NTCs were of non-plant origin. Amongst the samples, excluding BLANKs and NTCs, we found large differences in sample read counts which range between 1 and 718,279. Compilation of total read counts for each PCR sample. As PCR replicates of 13-TY samples were not sequenced separately, but pooled together, only one total read count per sample is shown.