The InBIO Barcoding Initiative Database: DNA barcodes of Portuguese Diptera 01

Abstract Background The InBIO Barcoding Initiative (IBI) Diptera 01 dataset contains records of 203 specimens of Diptera. All specimens have been morphologically identified to species level, and belong to 154 species in total. The species represented in this dataset correspond to about 10% of continental Portugal dipteran species diversity. All specimens were collected north of the Tagus river in Portugal. Sampling took place from 2014 to 2018, and specimens are deposited in the IBI collection at CIBIO, Research Center in Biodiversity and Genetic Resources. New information This dataset contributes to the knowledge on the DNA barcodes and distribution of 154 species of Diptera from Portugal and is the first of the planned IBI database public releases, which will make available genetic and distribution data for a series of taxa. All specimens have their DNA barcodes made publicly available in the Barcode of Life Data System (BOLD) online database and the distribution dataset can be freely accessed through the Global Biodiversity Information Facility (GBIF).


Introduction
Diptera is one of the most diverse, abundant and widespread insect orders, with more than 158 000 described species, and many more still to be described (Pape et al. 2009;Evenhuis and Pape 2019). Dipterans are ubiquitous in many terrestrial ecosystems and larval stages of some species can also been found in aquatic ecosystems. They play important ecological roles in ecosystems, including those of pollinators, detritivores and parasites. Some species are also important disease vectors (Merritt et al. 2009) and crop pests (Skuhravá et al. 2010).
In continental Portugal, 1 475 species of Diptera were recorded, a sizable diversity, but small when compared to the 5 800 species known to occur in continental Spain (Carles-Tolrá 2002). Since the seminal work of Carles-Tolrá (2002), further increments to the Portuguese dipteran fauna have been made (e.g. Andrade and Gonçalves 2014;Ebejer and Andrade 2015;Pollet et al. 2019), but much remains to be known about its diversity and distribution patterns in the country. The huge diversity of this order, the shortage of specialised taxonomists and the difficulties in identifying many species are the main obstacles to overcome this lack of knowledge. DNA barcoding is a method that aims to identify organisms based on a short DNA sequence previously sequenced from morphologically identified specimens (Hebert et al. 2003). This requires the construction of comprehensive reference collections of DNA sequences that represent the existing biodiversity (Baird et al. 2011, Kress et al. 2005, Ferreira et al. 2018. DNA barcoding can also be used as a first step in new species discovery and, as such, can be used as a tool to help address the taxonomic impediment problem (e.g. Kekkonen and Hebert 2014).
The striking scarcity of genetic data associated with the high biodiversity found in Portugal instigated the creation of a DNA barcoding initiative by the Research Network in Biodiversity and Evolutionary Biology -InBIO (Associate Laboratory). The InBIO Barcoding Initiative (IBI) makes use of Next Generation Sequencing technologies (NGS) to develop a reference collection of DNA barcoding sequences, focusing on Portuguese invertebrate taxa. Within the project, a special focus is afforded to insects, given their relevance to food webs and ecosystems functioning (e.g., Weisser and Siemann 2004;Mata et al. 2016;Silva et al. 2019). Furthermore, for many insect species occurring in Portugal, there are no barcodes available in public databases Corley et al. 2017;Weigand et al. 2019), and those that exist often show high distances to sequences obtained in Portugal, which may indicate cryptic diversity Ferreira et al. 2018).
The IBI Diptera 01 dataset contains records of 203 specimens of Diptera collected in continental Portugal, all of which were morphologically identified to species level, for a total of 154 species. This is the first IBI dataset to be released to the Global Biodiversity Information Facility (GBIF) and all specimens have their DNA barcodes made publicly available in the Barcode of Life Data System (BOLD). We have included in this dataset the barcodes of all identified Diptera specimens in IBI up to December 2019, except those from the families Tipulidade and Limoniidae, for which we will provide a more detailed treatment in a future paper, due to the detection of new species and the need for further research. Overall, this paper is a contribution to sharing and publicly disseminating the distribution records and DNA barcodes of specimens from our reference collection to increase the available information on Portuguese Diptera fauna.

General description
Purpose: This dataset aims to provide a first contribution to an authoritative DNA barcode sequences library for Portuguese Diptera. Such a library should facilitate DNA-based identification of species for both traditional molecular studies and DNA-metabarcoding studies and constitute a valuable resource for taxonomic research on Portuguese Diptera and its distribution.
Additional information: A total of 203 specimens of dipterans were collected and DNA barcoded (Suppl. material 2). Fig. 1 illustrates examples of the diversity of species that are part of the dataset of distribution data and DNA barcodes of Portuguese Diptera 01. All sequences of cytochrome c oxidase I (COI) DNA barcodes are 658 bp long. From the 154 species barcoded, twenty nine (19%) from 16 families are new to the DNA barcode database BOLD at the moment of the release (marked with * in Species field of Table 2). Forty-two additional species (27%) from 24 families were previously represented in BOLD, but with less than 10 DNA barcode sequences at the moment of the release (marked with '' in Species field of Table 2). Therefore, this dataset represents a significant contribution to enhance the species and genetic diversity of Diptera fauna represented in public libraries. Examples of the diversity of species that are part of the dataset of distribution data and DNA barcodes of Portuguese Diptera 01. All photos by Rui Andrade. List of species that were collected and DNA barcoded within this project. * Indicate species without DNA barcode prior to this study, '' indicates species with less than 10 sequences prior to this study.

Study area description:
North of the Tagus river in Portugal (Fig. 2).
Design description: Dipteran specimens were collected in the field, morphologically identified and DNA barcoded.

Study extent: North of the Tagus river in Portugal
Sampling description: The studied material was collected in 59 different localities from the northern half of continental Portugal (Fig. 2, Table 1). Sampling was conducted between 2014 and 2018 on a wide range of habitats, using mainly hand-held sweep-nets or direct search for specimens. Collected specimens were examined both dry and in alcohol using a binocular stereoscopic microscope (Optika ST-30-2LR, 20x-40x) and stored in 96% ethanol for downstream molecular analysis. Morphological identification was performed, based on keys and descriptions from literature (Suppl. material 1).
Quality control: All DNA barcodes sequences were compared against the BOLD database and the 99 top hits were inspected in order to detect possible issues due to contaminations or misidentifications. Prior submission to GBIF, data was checked for errors and inconsistencies with OpenRefine 3.2 (http://openrefine.org).
Step description: Specimens were collected in 59 different localities of continental Portugal. Sampling was conducted from 2014 to 2018, and consisted of direct search of specimens (e.g. Hecamede albicans, Eutropha fulvifrons, Canace nasica), the use of entomological nets to intercept specimens flight (e.g. Hemipenthes morio, Sphaerophoria scripta) or to sweep the vegetation (e.g. Opetia nigra, Trigonometopus frontalis). Specimens collected were stored in 96% ethanol. A tissue sample was removed, from which DNA was extracted and the COI DNA barcode fragment was sequenced. Data generated were submitted to BOLD, GenBank and GBIF.

Taxonomic coverage
Description: This dataset is composed of data relating to 203 Diptera specimens. All specimens were determined to species level. Overall, 154 species are represented in the dataset. These species belong to 41 families. Five families account for 56% of the total collected specimens, Syrphidae, Muscidae, Tachinidae, Calliphoridae and Chloropidae (Fig. 3). These five families account for 54% of the total species represented (Fig. 3).  Distribution of specimens (A) and species (B), in percentage, per Diptera family present in the dataset. Families representing less than 2% of specimens/species were lumped together. All records are also searchable within BOLD, using the search function of the database.

Data resources
The InBIO Barcoding Initiative will continue sequencing Diptera for the BOLD database, with the ultimate goal of comprehensive coverage. The version of the dataset, at the time of writing the manuscript, is included as Suppl. materials 2, 3, 4 in the form of two text files for record information as downloaded from BOLD, one text file with the collecting and identification data in Darwin Core Standard format (downloaded from GBIF) and of a fasta file containing all sequences as downloaded from BOLD.
It should be noted that, as the BOLD database is not compliant with the Darwin Core Standard format, the Darwin Core formatted file (dwc) that can be downloaded from BOLD is not strictly Darwin Core formatted. For a proper Darwin Core formatted file, see http://ipt.gbif.pt/ipt/resource?r=ibi_diptera_i&amp;v=1.0 (Suppl. material 3).
Column labels below follow the labels downloaded in the tsv format. Columns with no content in our dataset are left out in the list below.