Contribution to a reference library of DNA barcodes for Colombian freshwater fishes

The Barcoding was proposed motivated by the mismatch between the low number of taxonomists that contrasts with the large number of species, the method requires the construction of reference collections of DNA sequences that represent existing biodiversity. Freshwater fishes are key indicators for understanding biogeography around the world. Colombia with 1610 species of freshwater fishes is the second richest country in the world in this group. However, genetic information of the species continues to be limited, the contribution to a reference library of DNA barcodes for Colombian freshwater fishes highlights the importance of biological collections and seeks to strengthen inventories and taxonomy of such collections in future studies.
 This dataset contributes to the knowledge on the DNA barcodes and occurrence records of 96 species of Freshwater fishes from Colombia. The species represented in this dataset correspond to an addition to BOLD public databases of 39 species. Forty-nine specimens were collected in Atrato bassin and 708 in Magdalena-Cauca bassin during the period of 2010 to 2020, two species (Loricariichthys brunneus and Poecilia sphenops) are considered exotic to the Atrato, Cauca and Magdalena basins and four species (Oncorhynchu mykiss, Oreochromis niloticus, Parachromis friedrichsthalii and Xiphophorus helleri) are exotic to Colombian hydrogeographic regions. All specimens are deposited in the CIUA collection at University of Antioquia and have their DNA barcodes made publicly available in the Barcode of Life Data System (BOLD) online database and the distribution dataset can be freely accessed through the Global Biodiversity Information Facility (GBIF).


Introduction
Neotropical freshwater fishes constitute the most diverse continental vertebrate fauna on Earth, with more than 6,200 named species compressed into < 0.5% the total land-surface area and represent the greatest phenotypic disparity and functional diversity of any continental ichthyofauna (Albert et al. 2020). This fauna is still in a pioneering stage of discovery, with dozens of new species being described each year. The current pace of discovery indicates that the actual Neotropical freshwater fishes diversity could exceed 9,000 species, meaning as many as one-third of the species in the wild have yet to be described (Reis et al. 2016).
Colombia has 1610 species of freshwater fishes (FF) described, consolidating itself as the second richest country in terms of this group in the world. In particular, the trans-Andean Basin of the Magdalena and Cauca Rivers presents altitudinal gradients that testify to geological and climatic events where 232 species of fish have been registered, of which 57% are endemic. Atrato Basin, part of the Pacific hydrogeographic region, has 128 described species, 32 being endemic to this region (DoNascimiento et al. 2018). Despite the efforts of research bodies in collecting data, much remains to be determined about the diversity of FF in the country. The huge diversity of these organisms, the shortage of specialised taxonomists and the difficulties in identifying many species are the main obstacles for fully addressing this lack of knowledge.
The DNA barcode is an approach that aims to perform a rapid identification of species of a taxonomic group, based on the amplification of a fragment of the sequence of the mitochondrial gene cytochrome c oxidase subunit I (COX1) and its comparison to those previously sequenced from morphologically identified specimens (Kress andErickson 2012, Hebert et al. 2003). As the efficacy of COX1 in the discrimination of fish species has previously been demonstrated (Thu et al. 2019, Ude et al. 2020, Ward et al. 2005, this can be very useful for the identification of Colombian ichthyofauna, considering the high diversity of fish in the country and the rate of ten species of freshwater fishes described per year (DoNascimiento et al. 2017).
This contribution to a reference library of DNA barcodes for Colombian freshwater fishes contains records of 757 specimens collected in three basins of Colombia, all of which were morphologically identified to species level, when possible, for a total of 94 species (673 records to species level) and 63 genera (84 records to genus level). All specimens have their DNA barcodes made publicly available in the Barcode of Life Data System (BOLD). Overall, this paper is a contribution to sharing and publicly disseminating the occurrence records and DNA barcodes of specimens from the reference collection of the University of Antioquia in order to increase the available information on Colombian freshwater fishes.

General description
Purpose: This dataset aims to provide information from a COX1 gene sequence database for freshwater fish species present in Atrato, Cauca and Magdalena Basins in Colombia (Fig. 1), facilitating the identification of species by molecular methods and also being a tool for future metabarcoding studies and monitoring of basins, as well as highlighting the importance of biological collections.

Project description
Title: The title "Contribution to a reference library of DNA barcodes for Colombian freshwater fishes" refers to the publication of sequences of the cytochrome oxidase 1 gene generated in the Ichthyology Collection of the University of Antioquia of the fishes catalogued there which occur in the Atrato, Cauca and Magdalena Basins in Colombia. Design description: Freshwater fish specimens were collected in the field, using different fishing methods, morphologically identified and DNA barcoded.
Funding: This project was funded by "Empresas publicas de medellin, EPM" under the "BIO" agreement No. CT-2017-001714 with the University of Antioquia.

Sampling methods
Study extent: The Atrato, Cauca and Magdalena Basins, Colombia.

Sampling description:
The analysed material was collected from 138 different localities in the Atrato, Cauca and Magdalena Basins in Colombia. Sampling was conducted between 2010 and 2020 on a wide range of habitats, using different fishing methods. Collected specimens were fixed and stored in alcohol and a portion of muscle or fin was stored in 96% ethanol for downstream molecular analysis. Morphological identification was performed, based on taxonomical keys and descriptions from literature.
DNA was extracted using the QIAgen Dneasy Blood & Tissue kit (Hilden, Germany), following the manufacturer's protocol. The COX1 fragment was amplified using the primers FishF1 (5'-TCAACCAACCACAAAGACATTGGCAC-3') and FISHR1 (5'-TAGACTTCTGGGTGGCCAAAGAATCA-3') (Ward et al. 2009). The products were sent to a commercial laboratory to be purified and sequenced by the Sanger Method. The forward and reverse sequences were edited and assembled using Geneious Prime (2019) software and inspected manually.
Step description: Specimens were collected from 138 different localities of three Colombian Basins (Atrato, Cauca and Magdalena). Sampling was conducted from 2010 to 2020 and consisted of the use of different fishing methods to collect specimens. Collected specimens were stored in 70% ethanol. A tissue sample from muscle or fin was removed and stored in 96% ethanol, from which DNA was extracted and the COX1 DNA barcode fragment was sequenced. Data generated was submitted to BOLD and is part of the data from the Ichthyology Collection of the University of Antioquia in GBIF.

Taxonomic coverage
Description: This dataset is composed of data relating to 757 specimens of freshwater fishes that occur in Colombia; 673 specimens were identified to the species level and 84 to genus. Overall, 94 species in 24 families are represented in the dataset Suppl. material 1. The families Characidae, Astroblepidae and Loricariidae account for 71% of the total collected specimens. One family (Sciaenidae) is represented by a single sequence, whereas seven families (Apteronotidae, Aspredinidae, Callichthyidae, Cynolebiidae, Erythrinidae, Salmonidae (exotic) and Scianidae) are represented by a single species. Fig.  3 illustrates examples of species that are part of the CIUA dataset. Description: The Barcoding CIUA Database: CIUA01 dataset can be downloaded from the Public Data Portal of BOLD (dx.doi.org/10.5883/DS-CIUA01) in xml format and sequences as fasta files, including 18 columns with the information described below Suppl. material 2. Alternatively, BOLD users can log-in and access the dataset via the Workbench platform of BOLD. All records are also searchable within BOLD, using the search function of the database.
The Barcoding CIUA will continue sequencing Colombian freshwater fishes for the BOLD database, with the main objective of achieving a representative coverage of the Colombian fishes species.
All records of the collection have been reported to the SiB database (Jiménez-Segura et al. 2021) in Darwin Core Standard Format (Suppl. material 3) that includes 58 Columns with the information shown and described in Table 1.  The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record.

Column label Column description
basisOfRecord The specific nature of the data record.

CollectionCode
The name, acronym, coden or initialism identifying the collection or dataset from which the record was derived.
catalogNumber An unique identifier for the record within the collection. type The nature or genre of the resource.
language A language of the resource.
institutionID An identifier for the institution having custody of the object(s) or information referred to in the record.
collectionID An identifier for the collection or dataset from which the record was derived. waterBody The name of the water body in which the Location occurs. island The name of the island on or near which the Location occurs. country The name of the country or major administrative unit in which the Location occurs.
countryCode The standard code for the country in which the Location occurs. stateProvince The name of the next smaller administrative region than country (state, province, canton, department, region etc.) in which the Location occurs. The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic centre of a Location. decimalLongitude The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic centre of a Location.

geodeticDatum
The ellipsoid, geodetic datum or spatial reference system (SRS) upon which the geographic coordinates given in decimalLatitude and decimalLongitude as based.
coordinateUncertaintyInMetres The horizontal distance (in metres) from the given decimalLatitude and decimalLongitude describing the smallest circle containing the whole of the The full scientific name of the kingdom in which the taxon is classified. phylum The full scientific name of the phylum or division in which the taxon is classified. class The full scientific name of the class in which the taxon is classified. order The full scientific name of the order in which the taxon is classified. family The full scientific name of the family in which the taxon is classified. genus The full scientific name of the genus in which the taxon is classified. specificEpithet The name of the first or species epithet of the scientificName. infraspecificEpithet The name of the lowest or terminal infraspecific epithet of the scientificName, excluding any rank designation. taxonRank The taxonomic rank of the most specific name in the scientificName.
scientificNameAutorship The authorship information for the scientificName formatted according to the conventions of the applicable nomenclaturalCode.