Environmental DNA-based biodiversity profiling along the Houdong River in north-eastern Taiwan

Abstract Background This paper describes two datasets: species occurrences, which were determined by environmental DNA (eDNA) metabarcoding and their associated DNA sequences, originating from a research project which was carried out along the Houdong River (猴洞坑), Jiaoxi Township, Yilan, Taiwan. The Houdong River begins at an elevation of 860 m and flows for approximately 9 km before it empties into the Pacific Ocean. Meandering through mountains, hills, plains and alluvial valleys, this short river system is representative of the fluvial systems in Taiwan. The primary objective of this study was to determine eukaryotic species occurrences in the riverine ecosystem through the use of the eDNA analysis. The second goal was, based on the current dataset, to establish a metabarcoding eDNA data template that will be useful and replicable for all users, particularly the Taiwan community. The species occurrence data are accessible at the Global Biodiversity Information Facility (GBIF) portal and its associated DNA sequences have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI, respectively. A total of 12 water samples from the study yielded an average of 1.5 million reads. The subsequent species identification from the collected samples resulted in the classification of 432 Operational Taxonomic Units (OTUs) out of a total of 2,734. Furthermore, a total of 1,356 occurrences with taxon matches in GBIF were documented (excluding 4,941 incertae sedis, accessed 05-12-2023). These data will be of substantial importance for future species and habitat monitoring within the short river, such as assessment of biodiversity patterns across different elevations, zonations and time periods and its correlation to water quality, land uses and anthropogenic activities. Further, these datasets will be of importance for regional ecological studies, in particular the freshwater ecosystem and its status in the current global change scenarios. New information The datasets are the first species diversity description of the Houdong River system using either eDNA or traditional monitoring processes.


Introduction
Environmental DNA (eDNA) metabarcoding is an emerging tool that can provide an accurate and comprehensive representation of biotic communities (Beng and Corlett 2020, Carvalho et al. 2021, Doi et al. 2021) by identifying multiple lineages from a single environmental sample (Taberlet et al. 2012).While the earliest uses of eDNA specifically focused on RNA detection of microbial communities (Ogram et al. 1987), in the last decade, eDNA has grown in scope to include macrobial communities and, as a result, has become an important tool in biodiversity assessments, biomonitoring and identifying temporal shifts in community assemblages of both terrestrial and aquatic ecosystems (Bista et al. 2017, Jeunen et al. 2019, West et al. 2020, Pawlowski et al. 2020, Burian et al. 2021, Gregorič et al. 2022).
Ecosystems are replete with the genetic material of both resident and transient species.This genetic material includes organismal and extra-organismal DNA from microorganisms and macroorganisms in the form of faeces, shed skin or hair, carcasses and living bodies (Stewart 2019).Using genomic approaches to detect these 'genetic footprints' of organism life from environmental samples, for example, water, snow, soil, air or leaf swabs (Stewart 2019), provides an opportunity to increase the detection rate (Goldberg et al. 2016) and increase temporal resolution (Ogram et al. 1987, Pawlowski et al. 2020, Taberlet et al. 2012, Alexander et al. 2023Taberlet et al. 2012,) of biological studies.Notably, eDNA analyses surpass conventional methods in terms of precision and resolution (e.g.Nakagawa et al. (2018), Rourke et al. (2021)).Furthermore, they exhibit greater efficiency in the context of both cost and time when compared to conventional biological monitoring approaches.These factors are crucial in advancing long-term ecological research on a global scale (Evans et al. 2017, Jerde 2019, Larson et al. 2020, Bruel and White 2021).
The application of eDNA for species composition assessments is flexible depending on the organisms of interest and may function as a general or targeted instrument (e.g.Thomas et al. (2019), Beng and Corlett (2020), Burian et al. (2021), Carvalho et al. (2021), Alexander et al. (2023)).Targeted analyses aim to identify the presence or absence of a specific focal species (i.e."is the targeted species here"), while general analyses aim to identify community composition (i.e."what species are here").For both aims, by relying on the genetic footprints and not direct observation, eDNA can be superior to traditional methods by allowing for higher resolution of community assemblage (e.g.Alexander et al. (2022)); the enhanced detection of rare, migratory, elusive and cryptic species (e.g.Barnes et al. (2014), Beng and Corlett (2020), Rourke et al. (2021), Shen et al. (2022)) and the detection of community shifts (e.g.DiBattista et al. ( 2020)) and small scale spatial assemblage changeovers (e.g.Jeunen et al. ( 2019)).Environmental sampling, followed by eDNA analyses, provides a powerful approach for identifying a more comprehensive community complexion, which is crucial for researchers.
An ecosystem examination utilising eDNA necessitates exchanging the information collected in addition to the noted benefits and significance.For instance, there is undeniable value in sharing eDNA data in accordance with the FAIR (Findable, Accessible, Interoperable and Reusable) principles, thereby enriching our understanding of global biodiversity.The efforts of international organisations committed to advancing and promoting instruments for the exchange of eDNA data are noticeable in the discussions that commenced at the TDWG conference (Suominen et al. 2023).
Hence, in this paper, we describe two datasets: 1.
A species occurrence dataset derived from the eDNA analysis of surface water from the Houdong River in north-eastern Taiwan and 2.
their associated DNA sequences.
We aimed to determine community composition and diversity changes along a 6.5 km stretch of the river as it passes from headwaters (primary subtropical forest) through residential areas, aquaculture farms and agricultural fields (e.g.rice farm), before reaching the estuary/river mouth.This is the first known freshwater eDNA dataset originating from a representative turbulent river in Taiwan and will be important as baseline data for further studies and environmental monitoring of this ecosystem.Given the inexperienced DNA open data attempt within the Taiwan community, the study collaborates with the Taiwan Biodiversity Information Facility (TaiBIF) to establish a data template for the eDNA open data workflow.TaiBIF is amongst the most active data hosting centres and nodes of the Global Biodiversity Information Facility (GBIF).We expect that, through this collaboration, we can promote the dissemination and use of eDNA and DNA metabarcoding datasets from the Taiwan community.

Sampling methods
Description: This was a one-time sampling event of water samples from four sites along the Houdong River in Jiaoxi Township, Yilan County, Taiwan (Fig. 1).The Houdong River is a popular tourist destination running across the Jiaoxi Township.The river system originates east of the Sidu Mountains and flows through primary and secondary forests, agricultural lands (rice), aquaculture farms and developed areas (light industrial and residential) until it eventually drains into the Pacific Ocean.Water samples were used for eDNA analysis and measurement of in-situ water quality.
Sampling description: Water samples for eDNA were collected from near the surface of the river.Before the collection, the water containers were rinsed with the local water at each sampling site.Approximately 3 litres of water were collected for eDNA analysis.Some 200 ml of the collected water was used for water quality measurements.Water quality measurements were made using multiple hand-held probes on-site at each sampling site.Temperature, pH and dissolved oxygen were measured with a multi-parameter meter (Multiline® Multi 3620 IDS, WTW, Weilheim, Germany) equipped with an IDS pH electrode (SenTix 940, WTW) and an optical IDS dissolved oxygen sensor (FDO® 925, WTW).Turbidity was measured by a turbidity meter (TUB-430, EZDO, Taiwan).Salinity was determined by salinity refractometer (2491 MASTER-S/Milla Salinity Refractometer, Atago, Japan).Three replicate measurements of each parameter were taken.Water samples were transported to the Marine Research Station (MRS, Yilan County, Taiwan) of the Institute of Cellular and Organismic Biology for sample filtration.The water was first filtered through a 75 µm pore size sieve to eliminate larger particles.Afterwards, a 1 litre water sample from each site was filtered through a 0.22 µm filter and the sample kept on top of The four water sampling locations along the Houdong riverine system in Yilan County, Taiwan.
the filter membrane, under vacuum compression (PC651-0024, GeneDireX, USA).The filter membranes were placed in sterile Petri dishes and stored at -80°C until DNA extraction.
Step description: Wet lab process DNA was extracted at the Biodiversity Research Center (Academia Sinica, Taipei, Taiwan).Each filtered membrane was cut into quarters.Three of the four pieces of filtered membranes were used in the study as three experimental replicates.The final quarter was saved as the sample backup.DNA from each quarter membrane piece was extracted using the Presto™ Stool DNA Extraction Kit (STLD100, Geneaid Biotech Ltd., Taiwan) following the manufacturer's instructions (Instruction Manual Ver.10.21.17).The quality and quantity of the extracted DNA was assessed using a Nanodrop 2000 (Thermo Fisher Scientific Inc., USA) and the Qubit 4 dsDNA High Sensitivity Assay Kit (Thermo Fisher Scientific Inc., USA).
The MinibarF1 (5'TCCACTAATCACAARGATATTGGTAC) and MinibarR1 (5'GAAAATCA TAATGAAGGCATGAGC) primers that were designed by Meusnier et al. (2008) were used to amplify the 5' region (ca.120-150 bp) of the mitochondrial Cytochrome c oxidase I (COI) gene.The universality of the primers was recommended for distinguishing the highly diverse DNA from the environmental mixture.We conducted PCR using a one-step singleindexed approach, with a 13 bp tag attached to the MinibarR1 primer.The PCR reaction volume was 16 μl, which included 8 μl KAPA HiFi HotStart ReadyMix (KK2602, Roche Molecular Systems Inc., USA), 5 μl ddH20, 1 μl of each primer (10μM) and 1 μl of DNA template.To optimise the protocol, we performed a preliminary PCR using an annealing temperature gradient and found that 54°C gave the best results.The PCR mixture was denatured at 95°C for 15 minutes, followed by 35 cycles of 94°C for 30 seconds, 54°C for 30 seconds and a final elongation at 72°C for 10 minutes.
The PCR products were checked on a 1.5% agarose gel and quantified with the Invitrogen Qubit 4 fluorometer (Thermo Fisher Scientific Inc., USA).Afterwards, all the PCR products were pooled in one tube for next-generation sequencing.Sequencing was performed on the Illumina NovaSeq 6000 platform with 2*150 paired-end reads by Genomics Co., Taipei, Taiwan.

Open data and code
Two datasets were associated with this study: DNA sequence data and occurrence data (see Data resources).We converted the occurrence data into Darwin Core Archive standard (Darwin Core Task Group 2009) and validated the datasheet using the GBIF Data Validator (Global Biodiversity Information Facility 2017).We then published the dataset containing one occurrence core (i.e.foundational part of the dataset with information about each occurrence) and one DNA-derived data extension (Hoh 2023) using the Integrated Publishing Toolkit (IPT) of GBIF installed under the Taiwan Biodiversity Information Facility (TaiBIF).We have included three supplementary files to help describe the dataset.They are the attributes of the sampling event (Suppl.material 1), the water quality measurements (Suppl.material 2) and the relationship of each technical sample to the sampling event (Suppl.material 3).These files were attached as the current GBIF data model schema does not support event core matching with a DNA-derived data extension.
All source code used in the project can be found in the project's GitHub repository.

Geographic coverage
Description: We selected four sites along the Houdong River (猴洞坑), Jiaoxi Township, Yilan County, Taiwan (Table 1) for water sample collections and in-situ water quality measurements: Upstream waterfall (WF), downstream river (FR), estuary (ES) and river mouth (RM).These four sites spanned a river length of 6.5 km.

Taxonomic coverage
Description: We detected eukaryotic organisms in the water samples using the COI mitochondrial gene.A total of 2,736 OTUs were identified and 421 of the OTUs were assigned to at least the kingdom level using the MIDORI database (v.GB250; Machida et al. ( 2017); Fig. 2).On GBIF, this dataset (Hoh 2023) consists of a total of 6,297 occurrences with 22% (1,356 occurrences; last accessed 05-12-2023) having a taxon match on the GBIF Backbone Taxonomy, with remaining occurrences being assigned as incertae sedis (i.e.taxa unknown).The species occurrence dataset was standardised and presented in GBIF annotated Darwin Core Archive (see Data resources), grouped by sampling events (i.e.sites and identifiable via the eventID column).Table 1.
Coordinates of the four sampling sites.
Environmental DNA-based biodiversity profiling along the Houdong River ...  The taxonomic ranking of 432 classified operational taxonomic units (OTUs).The colours represent different kingdoms.
Environmental DNA-based biodiversity profiling along the Houdong River ...  Description: There are two links in the Download URL.The first links to the download page of the GBIF annotated Darwin Core Archive and the second links to the Source Darwin Core Archive from the TaiBIF IPT.The second link is provided because the DNA-derived data extension associated with the occurrence datasheet is not available through the GBIF-annotated Darwin Core Archive download option, although the extension is included in the source archive available either through the GBIF webpage for the dataset or directly from the TaiBIF IPT.Downloading from both links gives GZcompressed files containing the occurrence core and DNA-derived data extension files in TXT format.The below table describes a total of 76 data fields from both the occurrence core and DNA derived data extension, sorted alphabetically.The data field descriptions are written as listed in the List of Darwin Core terms (accessed April 2023; Darwin Core Task Group ( 2009)), but modified as needed if applicable to the current study context.The occurrence core datasheet can also be downloaded via GBIF APIbased tools such as rgbif (Chamberlain et al. 2022) for further analyses.Environmental DNA-based biodiversity profiling along the Houdong River ...
This was a one-time sampling of water samples and corresponding water quality parameters on 28-04-2022.Datasets produced by the current work are licensed under a Creative Commons Attribution (CC-BY) 4.0 Licence.Data resources Data package title: eDNA along Houdong riverine zonation in Taiwan Number of data sets: 2 Data set name: eDNA along riverine zonation of Houdong River, Yilan, Taiwan [Project ID: PRJEB60905] Download URL: https://www.ebi.ac.uk/ena/browser/view/PRJEB60905Data format: Genomic Standard Consortium MIxS water package Data format version: mixs6.1.0Description: DNA sequence data have been deposited on ENA at EMBL-EBI under accession number PRJEB60905 following the Genomic Standard Consortium MIxS standard (Yilmaz et al. 2011).Below are described the nine default columns under the 'Read Files' section on the project (or dataset) page, which can also be obtained from downloading the TSV report on the Project page.number created by ENA for this submission (PRJEB60905 for this dataset).sample_accession The sample accession number created by ENA for this submission.A total of 12 Biosamples (comprised of three replicates from each of the four sampling sites) were registered.Each accession from the link https://www.ebi.ac.uk/ena/browser/view/ [sample_accession] describes basic information about the sample following the MIxS standard.
submitted_ftpThe FTP link to download DNA reads uploaded to ENA by the submitter before automation curation resulting to fastq_ftp.bam_ftp The FTP link to download the BAM file of each Run.Data set name: eDNA along Houdong riverine zonation in Taiwan Download URL: https://www.gbif.org/dataset/2615342d-7349-4e75-ae34-cda6cb403e2e ; https://ipt.taibif.tw/archive.do?r=houdongkeng_water_edna Data format: Darwin Core Archive Data format version: 2021-07-15 name of the class in which the taxon is classified.concentration Concentration of DNA (weight ng/volume µl).concentrationUnit Unit used for concentration measurement.continent The name of the continent in which the Location occurs.coordinateUncertaintyInMetres The horizontal distance (in metres) from the given decimalLatitude and decimalLongitude describing the smallest circle containing the whole of the Location.country The name of the country in which the Location occurs.countryCode The standard code for the country in which the Location occurs.county The full, unabbreviated name of the smaller administrative region in which the Location occurs.datasetName The name identifying the dataset from which the record was derived.dateIdentified The date on which the subject was determined as representing the Taxon.day The integer day of the month on which the Event occurred.decimalLatitude The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic centre of a Location.decimalLongitude The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic centre of a Location.DNA_sequence The DNA sequence.env_broad_scale The major environmental system from which the sample came.ENVO's biome subclasses determined in https://ontobee.org/ontology/ENVO.env_local_scale The entity in which the sample's local vicinity, smaller spatial grain than the entry in env_broad_scale.ENVO's biome subclasses determined in https://ontobee.org/ontology/ENVO.
Environmental DNA-based biodiversity profiling along the Houdong River ... sampleSizeValue Total number of reads in the sample.samplingProtocol The names of, references to, or descriptions of the methods or protocols used during an Event.scientificName The full scientific name in the lowest level taxonomic rank that can be determined.The content in this field was obtained from secondary mapping of the lowest taxonomic rank in verbatimIdentification to the GBIF Backbone Taxonomy (locus name for marker gene studies.tax_ident The phylogenetic marker(s) used to assign an organism name.taxonRank The taxonomic rank of the most specific name in the scientificName.type The nature or genre of the resource.verbatimIdentification The taxonomic identification of otu_db.verbatimLocality The original textual description of the place.year The four-digit year in which the Event occurred, according to the Common Era Calendar.
experiment_accession The experiment accession number created by ENA for this submission.A total of 12 Experiments were registered.Each accession from the link https://www.ebi.ac.uk/ena/ to download DNA reads obtained from each Run.This is the ENA Archived Generated File as described here.The format of the file is a gunzip-compressed FASTQ file.Two FTP links were provided that separate the forward and reverse reads from each paired experiment by [run_accession]_1.fastq.gzand [run_accession]_2.fastq.gz,respect ively.