The InBIO Barcoding Initiative Database: DNA barcodes of Portuguese Hemiptera 01

Abstract Background The InBIO Barcoding Initiative (IBI) Hemiptera 01 dataset contains records of 131 specimens of Hemiptera. Most specimens have been morphologically identified to species or subspecies level and represent 88 species in total. The species of this dataset correspond to about 7.3% of continental Portuguese hemipteran species diversity. All specimens were collected in continental Portugal. Sampling took place from 2015 to 2019 and specimens are deposited in the IBI collection at CIBIO, Research Center in Biodiversity and Genetic Resources. New information This dataset increases the knowledge on the DNA barcodes and distribution of 88 species of Hemiptera from Portugal. Six species, from five different families, were new additions to the Barcode of Life Data System (BOLD), with another twenty five species barcodes' added from under-represented taxa in BOLD. All specimens have their DNA barcodes publicly accessible through BOLD online database and the distribution data can be accessed through the Global Biodiversity Information Facility (GBIF). Eutettix variabilis and Fieberiella florii are recorded for the first time for Portugal and Siphanta acuta, an invasive species, previously reported from the Portuguese Azores archipelago, is recorded for the first time for continental Portugal.


Introduction
Hemiptera is the most diverse order of non-holometabolan insects, with more than 107,000 described species (Henry 2017, Bartlett et al. 2018, Hardy 2018, being second only to the four so-called "megadiverse" holometabolan orders (Coleoptera, Lepidoptera, Diptera, and Hymenoptera), which include over 150,000 described species each (Zhang 2013). Hemipterans are among the most abundant and widespread insects on land and in freshwater habitats (Andersen 1999). The Hemiptera, or true bugs, have piercing-sucking mouthparts that constrain them to feed on liquid food (Schuh and Slater 1995, Scudder 2017, Panfilio and Angelini 2018. The primary feeding habit of Hemiptera is herbivory but the order also includes numerous carnivores, scavengers, hematophages and some necrophages (Forero 2008, Gullan andCranston 2014). As a result their ecological role is strongly linked to their trophic interaction with plants, several species are among the most important crop pests (Schuh and Slater 1995, Schaefer and Panizzi 2000, Dietrich 2009, Gullan and Martin 2009, Scudder 2017. A few hematophagous hemipterans in the subfamily Triatominae (Reduviidae) have a direct impact on human health as vectors of Chagas disease (Balczun et al. 2012).
In continental Portugal, the knowledge about the order Hemiptera is fragmentary and heterogeneous. The latest diversity estimate was close to 1,100 species (Grosso-Silva 2003), but the description of new species (e.g., Emeljanov and Drosopoulos 2004, Ribes and Baena 2006, Sanchez et al. 2006, as well as the detection of previously unrecorded ones (e.g. Grosso-Silva 2004, Hollier 2005, Goula and Mata 2011, Baena and Zuzarte 2012, Foster 2019, Grosso-Silva and Ferreira 2020 lead to an estimated number of more than 1,200 species to date. However, additional studies are needed to validate the distribution of the species in general. Furthermore, the introduction or expansion of alien species from nearby areas has also occurred regularly (e.g. Valente et al. 2004, Franco et al. 2011, Sánchez 2011, Borges et al. 2013, Garcia et al. 2013, Bella 2014.
DNA barcoding is a standard molecular biology method for species identification based on the sequencing of a short mitochondrial DNA sequence that is then compared to a library of known sequences (Hebert et al. 2003). The construction of such libraries is an essential step in the process that requires the morphological identification of specimens to establish a baseline for comparisons (Kress et al. 2015, Ferreira et al. 2018). Open libraries of DNA barcodes exist, namely the Barcode of Life Data System (BOLD), but they are not comprehensive yet, especially in regions of high diversity or endemicity. Furthermore, regional variation in species genetic variability can confound identification results (Phillips et al. 2019). DNA barcodes can be used as a discovery step, on a two-step approach of species delimitation (e.g. Rannala 2015), but also combined with ecological traits (Kress et al. 2015), greatly contributing to the solution of the taxonomic impediment problem in Biology (e.g. Riedel et al. 2013, Kekkonen andHebert 2014). DNA barcodes usefulness has rapidly extended beyond organism and species identification; they are increasingly used in ecological and biological conservation studies, as well as in forensic applications, such as food source identification (Pečnikar and Buzan 2013, Kress et al. 2015, DeSalle and Goldstein 2019. DNA barcoding has been successfully applied to the Hemiptera (e.g. Jung et al. 2011, Park et al. 2011, Raupach et al. 2014, Havemann et al. 2018, Govender and Willows-Munro 2019, with identification success rates of 80% to 100%. It is especially useful to identify immature and female individuals' (e.g. Raupach et al. 2014, Havemann et al. 2018), which may not be reliably identified through morphological characters, or in areas where diversity remains poorly known (e.g. Govender and Willows-Munro 2019). DNA barcoding as also highlighted the existence of cryptic diversity and the need for taxonomic revisions of certain taxa (e.g. Jung et al. 2011, Park et al. 2011, Raupach et al. 2014, Havemann et al. 2018, Govender and Willows-Munro 2019. In this context, Portuguese biodiversity is still underestimated and undersampled, although being part of the westernmost portion of the Mediterranean hotspot of biodiversity. The paucity of genetic data on Portuguese biodiversity led to the creation of a DNA barcoding initiative by the Research Network in Biodiversity and Evolutionary Biology -InBIO. The InBIO Barcoding Initiative (IBI) makes use of High-Throughput Sequencing technologies to construct a reference collection of morphologically identified Portuguese specimens and respective DNA barcodes. Within IBI, invertebrates, and insects in particular, are prioritied, given their large contribution to overall biodiversity and ecosystems (e.g. Weisser and Siemann 2004, Losey and Vaughan 2006, Mata et al. 2016) and the clear shortage of DNA barcodes available in public databases (e.g. , Weigand et al. 2019).
The IBI Hemiptera 01 dataset contains records of 131 specimens of Hemiptera collected in continental Portugal, all of which were identified to species level, mostly through morphological identification, for a total of 88 species and one additional subspecies. This dataset is the first IBI dataset on Hemiptera and is part of the ongoig IBI database public releases in both the Global Biodiversity Information Facility (GBIF) and the Barcode of Life Data System (BOLD) (e.g. Ferreira et al. 2020a, Ferreira et al. 2020b. We have included in this dataset the barcodes of all identified Hemiptera specimens in IBI up to December 2020. Overall, this paper contributes to the open dissemination and sharing of the distribution records and DNA barcodes of Hemiptera specimens that are part of our reference collection, to increase the available public information on a group of Portuguese Invertebrates.

General description
Purpose: This dataset aims to provide a first contribution to an authoritative DNA barcode sequences library for Portuguese Hemiptera. Such a library aims to enable DNA-based identification of species for both traditional molecular studies and DNA-metabarcoding studies. Furthermore, it constitutes an important resource for taxonomic research on Portuguese Hemiptera and its distribution.
Additional information: A total of 131 specimens of hemipterans were collected and DNA Barcodes (Suppl. materials 1, 2). Fig. 1 illustrates examples of the diversity of species that are part of the dataset of distribution data and DNA barcodes of Portuguese Hemiptera 01. All sequences of cytochrome c oxidase I (COI) DNA barcodes are 658 base pairs (bp) long, except for one with 418 bp. From the 88 species barcoded, six (7%) from five families are new to the DNA barcode database BOLD at the moment of its release (January 2021, marked with * in Species field of Table 1). Twenty-five additional taxa (28%) from 17 families were already represented in BOLD with less than 10 DNA barcode sequences (marked with " in Species field of Table 1). A few noteworthy species are included in the dataset. The record of the species Eutettix variabilis Hepner, 1942 is, to the best of our knowledge, the first record published for Portugal. European records for this north American species (Metcalf 1967) exist online (e.g. http://boldsystems.org/index.php/ Taxbrowser_Taxonpage?taxon=+Eutettix+variabilis&searchTax=Search+Taxonomy; all European records in BOLD are based on genetic identifications). The species Fieberiella florii Stål, 1864, a vector for phytoplasmas, is also recorded for Portugal for the first time, with a few records known for Spain (e.g. Aguin-Pombo et al. 2007). Another important result is the record of the invasive species Siphanta acuta (Walker, 1851), recorded here for the first time for continental Portugal, although it has been previously reported from the São Miguel Island in the Azores Archipelago (Borges et al. 2013  Sampling description: The studied material was collected in 60 different localities from continental Portugal, almost half of which (47%) belong to the Bragança District (Fig. 2, Table 2). Two specimens were integrated in the IBI reference collection without further sampling information available besides being collected in Portugal. Sampling was conducted between 2015 and 2019 in a wide range of habitats, by direct search of specimens or by sweeping the vegetation. Collected specimens were examined using a stereoscopic microscope and stored in 96% ethanol for downstream molecular analysis. Morphological identification was performed, based on keys and descriptions from literature (Suppl. material 3). DNA extraction and sequencing followed the general pipeline used in the InBIO Barcoding Initiative. Genomic DNA was extracted from leg tissue using EasySpin Genomic DNA Tissue Kit (Citomed) following the manufacturer's protocol. The mitochondrial cytochrome c oxidase I (COI) barcoding fragment was amplified as two overlapping fragments (LC and BH), using two sets of primers: LCO1490 (Folmer et al. 1994) + Ill_C_R and Ill_B_F (Shokralla et al. 2015) + HCO2198 (Folmer et al. 1994), respectively. The COI gene (Folmer region), was then sequenced in a MiSeq benchtop system. OBITools (Boyer et al. 2015) was used to process the initial sequences which were then assembled into a single 658 bp fragment using Geneious 9.  Table 2.
Number of specimens and species collected per Portuguese District and corresponding percentage.
Quality control: All DNA barcode sequences were compared against the BOLD database and the 99 top results were inspected in order to detect possible problems due to contaminations or misidentifications. Prior to GBIF submission, data were checked for errors and inconsistencies with OpenRefine 3.3 (http://openrefine.org).
Step description: 1. Specimens were collected in 60 different localities of continental Portugal. Fieldwork was carried out between 2015 and 2019.

2.
Specimens were collected during fieldwork by direct search of specimens or by sweeping the vegetation with a hand-net and preserved in 96% alcohol. Captured specimens were deposited in the IBI reference collection at CIBIO (Research Center in Biodiversity and Genetic Resources).
DNA barcodes were sequenced from all specimens. For this, one leg was removed from each individual, DNA was then extracted and a 658 bp COI DNA barcode fragment was amplified and sequenced. For one specimen of Ceraleptus lividus, Figure 2.
Map of the localities where Hemiptera samples were collected in continental Portugal. Portuguese Districts are also represented, with those referred in Table 2 numbered as follows:  1   only a 418 bp fragment was sequenced. DNA extracts were deposited in the IBI collection.

5.
All obtained sequences were submitted to BOLD and GenBank databases and, to each sequenced specimen, the morphological identification, when available, was contrasted with the results of the BLAST of the newly-generated DNA barcodes in the BOLD Identification Engine. 6.
Prior to submission to GBIF, data were checked for errors and inconsistencies with OpenRefine 3.3 (http://openrefine.org/).

Geographic coverage
Description: Continental Portugal .

Taxonomic coverage
Description: This dataset is composed of data relating to 131 Hemiptera specimens. All specimens were determined to species level, with three specimens further identifed to subspecies level. Overall, 88 species are represented in the dataset. These species belong to 30 families. The Pentatomidae family accounts for 21% of the total collected specimens (Fig. 3A) and no other family accounts for more than 8%. The Pentatomidae and Miridae families combined account for 26% of the total taxa represented (Fig. 3B) and no other family accounts for more than 7%. Eleven families are represented by a single taxon and nine by two taxa.  Column labels below follow the labels downloaded in the tsv format from BOLD. Columns with no content in our dataset are left out in the list below.