The InBIO Barcoding Initiative Database: DNA barcodes of Portuguese Diptera 02 - Limoniidae, Pediciidae and Tipulidae

Abstract Background The InBIO Barcoding Initiative (IBI) Diptera 02 dataset contains records of 412 crane fly specimens belonging to the Diptera families: Limoniidae, Pediciidae and Tipulidae. This dataset is the second release by IBI on Diptera and it greatly increases the knowledge on the DNA barcodes and distribution of crane flies from Portugal. All specimens were collected in Portugal, including six specimens from the Azores and Madeira archipelagos. Sampling took place from 2003 to 2019. Specimens have been morphologically identified to species level by taxonomists and belong to 83 species in total. The species, represented in this dataset, correspond to about 55% of all the crane fly species known from Portugal and 22% of crane fly species known from the Iberian Peninsula. All DNA extractions and most specimens are deposited in the IBI collection at CIBIO, Research Center in Biodiversity and Genetic Resources. New information Fifty-three species were new additions to the Barcode of Life Data System (BOLD), with another 18 species' barcodes added from under-represented species in BOLD. Furthermore, the submitted sequences were found to cluster in 88 BINs, 54 of which were new to BOLD. All specimens have their DNA barcodes publicly accessible through BOLD online database and its collection data can be accessed through the Global Biodiversity Information Facility (GBIF). One species, Gonomyiatenella (Limoniidae), is recorded for the first time from Portugal, raising the number of crane flies recorded in the country to 145 species.


Introduction
Portugal is part of the Mediterranean hotspot of biodiversity, yet Portuguese biodiversity remains poorly studied and genetic data are scarcer still. To tackle this problem, the Research Network in Biodiversity and Evolutionary Biology (InBIO) created the InBIO Barcoding Initiative (IBI), making use of in-house High Throughput Sequencing knowledge to construct a reference collection of morphologically identified Portuguese specimens and corresponding DNA barcodes (Ferreira et al. 2018). Invertebrates, especially insects, were given priority in the IBI due to their share of overall biodiversity and importance in ecosystems functioning (e.g. Weisser and Siemann 2004, Losey and Vaughan 2006, Mata et al. 2016, da Silva et al. 2019) and due to the lack of available DNA barcodes in public databases representative of Portuguese Invertebrates (e.g. Corley and Ferreira 2017, Corley et al. 2017, Weigand et al. 2019. DNA barcoding is a molecular biology method for species identification that relies on the comparison of a short mitochondrial DNA sequence of interest to a library of sequences with known species identity (Hebert et al. 2003). The construction of comprehensive reference libraries is therefore essential and these require the morphological identification of vouchers by expert taxonomists (Kress et al. 2015, Ferreira et al. 2018. DNA barcoding has expanded beyond single organism and species identification, to broader metabarcoding studies (Porter and Hajibabaei 2020). DNA barcodes are now a ubiquitous tool in ecological and biological conservation studies, as well as, for example, in forensic applications (Pečnikar and Buzan 2013, Kress et al. 2015, DeSalle andGoldstein 2019).
The order Diptera is one of the most diverse, widespread and common of the holometabolic insects, having more than 158,000 described species (Pape et al. 2009, Evenhuis andPape 2021). Within Diptera, the crane flies (Tipuloidea) are further classified into four families, Cylindrotomidae, Limoniidae, Pediciidae and Tipulidae (Starý 1992; but see Petersen et al. 2010 andStarý 2021) and are one of the most diverse groups, with over 15,630 recognised species (Oosterbroek 2021). Adult Tipuloidea can superficially resemble mosquitoes, with their slender bodies and long antennae, wings, legs and abdomen, but can be identified by the presence of two complete anal veins in the wings, a V-shaped transverse suture on the mesothorax and the absence of ocelli (de Jong et al. 2007). Larvae of Tipuloidea are mainly identified by the presence of a hemicephalous, retractible head capsule ocelli (de Jong et al. 2007). Larvae of most species are found in aquatic habitats, from fast-flowing streams to brackish water or in semi-aquatic habitats, such as organic sludge along the edge of water bodies or saturated mosses and hepatics. Those that are terrestrial are mostly still found in humid habitats, like leaf-litter (Pritchard 1983, de Jong et al. 2007, although a few also live in dry soils. Contrary to immature forms, all adult Tipuloidea, mostly short-lived after emergence, are terrestrial (Pritchard 1983, de Jong et al. 2007, while in the larvae stage, most species feed on algae or decaying plant material and associated microflora and some groups also feed on mosses and hepatics, though several Limnophilinae and Pediciinae larvae are predatory (Pritchard 1983, de Jong et al. 2007). Most species do not feed after reaching adulthood, although adults generally drink water to offset body evaporation (Pritchard 1983, de Jong et al. 2007. A few species are known to be important crop pests, as their larvae, when in large numbers, can damage crops by feeding on their roots or seedlings (Alford 2012, Alford 2014, Blackshaw andColl 1999, de Jong et al. 2007). Furthermore, species of Tipuloidea play important ecological roles in several ecosystems being well known components of bird and bat diets (e.g. Alford 2012, Alford 2014, Buchanan et al. 2006, Krüger et al. 2013, Rhymer et al. 2012, Vaughan 1997, Wilson et al. 1999).
In Portugal, the knowledge on Tipuloidea is still very incomplete. Of the four families that compose the Tipuloidea, only the Cylindrotomidae has not so far been recorded for the country, although it is known from Spain (Oosterbroek et al. 2020, Oosterbroek 2021. Recently, 33 new species were added to the Portuguese species list, raising its total to 149 (Eiroa and Báez 2002a, Eiroa and Báez 2002b, Oosterbroek et al. 2020, Oosterbroek 2021, Kolcsár et al. 2021. This is certainly an underestimate as 376 species are already known from the Iberian Peninsula (Oosterbroek et al. 2020). Furthermore, the distribution and ecology of the Portuguese crane flies are also poorly known.
The IBI Diptera 02 dataset contains records of 412 specimens of crane flies collected in Portugal, all morphologically identified to species level, for a total of 83 species, two of which were further identified to subspecies level. This dataset is part of the ongoing IBI database public releases in both the Global Biodiversity Information Facility (GBIF) and the Barcode of Life Data System (BOLD) (e.g. Ferreira et al. 2020a, Ferreira et al. 2020b List of taxa that were collected and DNA barcoded within this project. In column Taxa: * -Indicates taxa without a DNA barcode prior to this study; '' -indicates taxa with less than 10 sequences available prior to this study. Study area description: Portugal, including the Autonomous Regions of the Azores and of Madeira (Fig. 2).

Sampling methods
Study extent: Portugal, including the Autonomous Regions of the Azores and of Madeira.

Sampling description:
The studied material was collected in 83 different localities from Portugal, 77 from continental Portugal and six from the Autonomous Regions of the Azores and of Madeira. The Bragança District was the most heavily sampled (21% of total specimens) and where most species were recorded, with almost half of the species (41%) in the dataset found there (Fig. 2, Table 2). Sampling was conducted between 2003 and 2019, although the vast majority of specimens were collected in 2018 (32%) and 2019 (50%). Specimens were collected by direct search and individual netting of specimens, by sweeping the vegetation or were directly collected at light traps (using both UV and mercury vapour lights) and stored in 96% ethanol for downstream molecular analysis, unless stated otherwise. DNA extraction and sequencing followed the general pipeline in use by the IBI. Genomic DNA was extracted from leg tissue using the EasySpin Genomic DNA Tissue Kit (Citomed) according to the manufacturer's protocol. The cytochrome c oxidase I (COI) barcoding fragment was then amplified as two overlapping fragments (LC and BH), using two sets of primers: LCO1490 (Folmer et al. 1994) + Ill_C_R (Shokralla et al. 2015) and Ill_B_F (Shokralla et al. 2015) + HCO2198 ( Folmer et al. 1994, respectively. The COI barcode (Folmer region) was then sequenced in a MiSeq benchtop system. OBITools (https:// git.metabarcoding.org/obitools/obitools) was used to process the initial sequences which were then assembled into a single 658 bp fragment using Geneious 9.1.8. (https:// www.geneious.com).
Quality control: All DNA barcodes sequences were compared against the BOLD database and the top 99 hits were inspected to detect possible problems arising from contaminations or misidentifications. The data were checked for errors and inconsistencies with OpenRefine 3.4 (http://openrefine.org) before submission to GBIF.
Step description: 1. Specimens were collected in 83 different Portuguese localities. Fieldwork was carried out between 2003 and 2019, with 82% of the records made in the years 2018 and 2019. 2.
Specimens were collected during fieldwork by direct search and individual netting of specimens, by sweeping the vegetation or were directly collected at light traps (using both UV and mercury vapour lights) and preserved in 96% alcohol. The majority of captured specimens were deposited in the IBI reference collection at CIBIO (Research Center in Biodiversity and Genetic Resources). 3.
All specimens were morphologically identified using the available literature, except seven that were identified using the BOLD Identification Engine. For some specimens, it was necessary to prepare and then exam their terminalia. 4.
All specimens were DNA barcoded. To sequence the 658 bp COI DNA barcode fragment, one leg was removed from each individual, DNA was extracted and then amplified. All DNA extracts were deposited in the IBI collection. 5.
All sequences in the dataset were submitted to BOLD and GenBank databases and, to each sequenced specimen, the morphological identification was contrasted with the results of the BLAST of the newly-generated DNA barcodes in the BOLD Identification Engine. 6.

Geographic coverage
Description: Continental Portugal, Autonomous Regions of the Azores and of Madeira.

Taxonomic coverage
Description: The dataset is composed of data relating to 412 specimens of Diptera, all from the Tipuloidea superfamily. All specimens were morphologically identified to species or subspecies level by Pjotr Oosterbroek and/or Jaroslav Starý, except for seven specimens identified using the BOLD Identification Engine. In total, 83 species and two subspecies are represented in the dataset. These species belong to three families, Limoniidae, Pediciidae and Tipulidae. Limoniidae and Tipulidae account for similar numbers of collected specimens, 197 (48%) and 211 (51%) (Fig. 3A), respectively, although Limoniidae have a higher proportion of recorded species in the dataset (45 species -54% of the total) when compared with Tipulidae (36 species -43%) (Fig. 3B). At the subfamily level, Tipulinae and Limoniinae represented the most collected specimens (50% and 29%, Fig. 3A) and also the highest number of recorded species in the dataset (41% and 29%, Fig. 3B). The species, represented in this dataset, correspond to about 55% of all the crane fly species known from Portugal (52% Limoniidae, 66% Pediciidae and 57% Tipulidae) and 22% of crane fly species known from the Iberian Peninsula.     in different formats (data as dwc, xml or tsv and sequences as fasta files). BOLD users can also log-in and access the dataset through the Workbench platform of BOLD. All records are also discoverable within BOLD, using the platform search function.
The InBIO Barcoding Initiative will continue to sequence crane flies and other Diptera for the BOLD database, with the ultimate objective of achieving a comprehensive coverage of the Portuguese fauna. The version of the dataset, at the time of writing the manuscript, is included as Suppl. materials 1, 2, 3 in the form of two text files with specimen data, as downloaded from BOLD and from GBIF (the latter in Darwin Core Standard format) and one fasta file containing all sequences as downloaded from BOLD.
The BOLD database is not completely compliant with the Darwin Core Standard (DCS) format and, as such, the Darwin Core formatted file (dwc) downloaded from the BOLD platform is not strictly DCS formatted. For a correctly DCS formatted file, see http:// ipt.gbif.pt/ipt/resource?r=ibi_diptera_02&amp;v=1.6 (Suppl. material 2).
Column labels below follow the labels downloaded in the tsv file downloaded from BOLD. Columns with no content in our dataset are left out in the list below.