DNA Barcode library of the endemic-rich avifauna of the oceanic islands of the Gulf of Guinea

Abstract Background The BioSTP: DNA Barcoding of endemic birds from oceanic islands of the Gulf of Guinea dataset contains records of 155 bird specimens belonging to 56 species in 23 families, representing over 80% of the diversity of the breeding landbird community. All specimens were collected on Príncipe, São Tomé and Annobón Islands between 2002 and 2021 and morphologically identified to species or subspecies level by qualified ornithologists. The dataset includes all endemic species and 3/4 of the extant endemic subspecies of the islands. This dataset is the second release by BioSTP and it greatly increases the knowledge on the DNA barcodes of Gulf of Guinea birds. All DNA extractions are deposited at Associação BIOPOLIS - CIBIO, Research Center in Biodiversity and Genetic Resources. New information The dataset includes DNA barcodes for all 29 endemic bird species and for 11 of the 15 extant endemic bird subspecies from the oceanic islands of the Gulf of Guinea. This is the first major DNA barcode set of African birds. The three endemic subspecies of Crithagrarufobrunnea, an island endemic with three allopatric populations within the Archipelago, are also represented. Additionally, we obtained DNA barcodes for 16 of the 21 non-endemic landbirds and for one vagrant (Sylviacommunis). In total, forty-one taxa were new additions to the Barcode of Life Data System (BOLD), with another 11 corresponding to under-represented taxa in BOLD. Furthermore, the submitted sequences were found to cluster in 55 Barcode Index Numbers (BINs), 37 of which were new to BOLD. All specimens have their DNA barcodes publicly accessible through BOLD online database and GenBank.


Introduction
The Gulf of Guinea islands, off the Atlantic coast of central Africa, form the offshore part of the Cameroon Line of Volcanoes, which stretches 1,600 km from Annobón, in the SW, to the Mandara Mountains, in the NE (Ceríaco et al. 2022a).They include three oceanic islands -Príncipe, São Tomé and Annobón -and the land-bridge Island of Bioko (Fig. 1).Príncipe and São Tomé form the Democratic Republic of São Tomé and Príncipe, whereas Annobón and Bioko are part of the Republic of Equatorial Guinea.
The three oceanic islands lie in seas over 1,800 m deep and were never connected to each other or to the mainland (Ceríaco et al. 2022a).The origin of Príncipe, ca. 31 Ma, dates back to the origin of the Cameroon line; the origin of São Tomé is estimated at 15 Ma and that of Annobón at 6 Ma.However, due to a long history of intense volcanic activity, the oldest sub-aerial rocks may have been covered by subsequent lava flows.Volcanic activity was present until as recently as 36 Ka (Barfod and Fitton 2014).
The islands share an oceanic equatorial climate with extensive rainy seasons and very high humidity throughout most of the year (Ceríaco et al. 2022a).On Príncipe and São Tomé, there is a main dry season from mid-May to late August and a short dry season that lasts for a few weeks, sometime between December and February.On Annobón, south of the Equator, there is a single, but longer dry season, from mid-May to the end of October.The predominant warm and moist SW winds are intercepted by the high relief of the islands, creating a north-south divide in annual precipitation that, in São Tomé, can go from over 7,000 mm in the southwest to less than 600 mm in the north.Mean annual temperatures are above 25ºC at sea level, decreasing with altitude, where mean maximum temperatures can be similar, but the absolute minimums are much lower, falling below 10ºC at 700 m a.s.l.
The three oceanic islands were uninhabited at the time of discovery by the Portuguese in the late 15 century (Muñoz-Torrent et al. 2022).The Gulf of Guinea islands were then almost entirely covered by tropical moist broadleaf forests (Dauby et al. 2022), which are stratified by altitude into lowland, montane and mist forests.The latter type is absent from Príncipe.In Annobón, these altitudinal belts are compressed in an altitudinal range that only goes up to 700 m and a similar phenomenon seems to occur in the southern peaks of São Tomé (Ogonovszky 2003).On São Tomé, species such as Erica thomensis (Ericaceae) and Lobelia barnsii (Campanulaceae), found in clearings close to the highest peak, are representatives of an incipient montane grassland (Figueiredo et al. 2011).On São Tomé and on Annobón, the strong rain-shadow effect on the north creates a distinct dry forest, currently highly altered by human action (Diniz andMatos 2002, Lebigre 2003).The original vegetation also includes coastal sand dunes communities and mangroves, even though these have always covered tiny portions of the islands (Diniz andMatos 2002, Dauby et al. 2022).
The three oceanic islands of the Gulf of Guinea host an outstanding high number of endemic species across taxonomic groups (Ceríaco et al. 2022b, Melo et al. 2022a).Their avifauna is particularly unique.With at least 29 endemic bird species in an area of just over 1,000 km , these oceanic islands have the highest concentration of endemic birds in the world (Melo et al. 2022b).Many of the endemics are threatened (IUCN 2023), making the islands a top global priority for conservation.The south-western forests of São Tomé ranked as the second most important for bird conservation in Africa (Collar and Stuart 1988 ) and, more recently, the moist lowland forests of the three islands were considered the third most important ecoregion for the conservation of forest birds globally (Buchanan et al. 2011).The international recognition of the global relevance of these islands for biodiversity has led to a steady build-up of conservation efforts, which nevertheless remain insufficient (de Lima et al. 2022).Each island has its own protected area: Príncipe Obô Natural Park, São Tomé Obô Natural Park (both since 2006) and Annobón Nature Reserve (since 2000).Additionally, Príncipe was declared a UNESCO Biosphere Reserve in 2012.In Príncipe and in São Tomé, the parks cover around one third of each island, including most of the remaining native forest, whereas Annobón's protected area covers the whole island.In a global analysis including more than 175,000 protected areas, the natural park of São Tomé came up as the 17 most irreplaceable area for the conservation of threatened vertebrates, whereas the small natural park of Príncipe came in the 265 place (Le Saout et al. 2013).The diversity and threat levels of the endemic birds were a major factor behind these high rankings.
The avifauna of the oceanic islands of the Gulf of Guinea comprises 148 confirmed species, of which 66 are resident, six are breeding seabirds and two feral species (de Lima and Melo 2021).The breeding avian community is characterised by high phylogenetic diversity, with 33 families represented.Endemism is restricted to the resident landbirds, which include 29 endemic species and 16 subspecies from species with mainland populations, most of which are restricted to a single island.This includes the endemic subspecies of the Olive Ibis Bostrychia olivacea from Príncipe (spp.rothschildi) that became extinct during the 20 century (de Lima and Melo 2021).Additionally, the endemic Principe Seedeater Crithagra rufobrunnea has diversified into three endemic subspecies within the Archipelago.Excluding the 15 resident species that are likely non-native, 82% of the confirmed resident native birds are endemic taxa.
The striking scarcity of genetic data associated with the high biodiversity found in São Tomé and Príncipe instigated the creation of a DNA barcoding initiative by the Research Network in Biodiversity and Evolutionary Biology -InBIO (Associate Laboratory).DNA barcoding is a method that aims to identify organisms, based on a short DNA sequence previously sequenced from morphologically identified specimens (Herbert et al. 2003).This requires the construction of comprehensive reference collections of DNA sequences that represent the existing biodiversity (Kress et al. 2005, Baird et al. 2011, Ferreira et al. 2018 ).DNA barcoding can also be used as a first step in new species discovery and, as such, can be used as a tool to help address the taxonomic impediment problem (e.g.Kekkonen and Hebert (2014)).The BioSTP: DNA Barcoding for São Tomé and Princípe makes use of Next Generation Sequencing technologies (NGS) to develop a reference collection of DNA barcoding sequences.The first DNA barcode library has just been released for all 2 th th th amphibians and reptiles of São Tomé and Príncipe (Ceríaco et al. 2023).The bird dataset presented here is the first to cover taxa of the three oceanic islands of the Gulf of Guinea.
The DS-IAAST BioSTP: DNA Barcoding of endemic birds from oceanic islands of the Gulf of Guinea dataset contains records of 155 specimens of birds collected in São Tomé, Principe and Annobon islands, all of which were morphologically identified to species or suspecies level, for a total of 56 species.All specimens have their DNA barcodes made publicly available in the Barcode of Life Data System (BOLD) and GenBank.We have included in this dataset the barcodes of all identified birds specimens in DS-IAAST up to October 2021.Overall, this paper is a contribution to sharing and publicly disseminating the DNA barcodes of specimens from our reference collection to increase the available information on the Gulf of Guinea bird fauna.

General description
Purpose: This dataset aims to provide a contribution to the knowledge on the DNA barcode sequence library for the endemic birds of the oceanic islands of the Gulf of Guinea.Such a library should facilitate DNA-based identification of species for both traditional molecular studies and DNA metabarcoding studies and constitute a valuable resource for taxonomic and ecological research in the region.

Additional information:
We obtained the full DNA barcode sequence of cytochrome c oxidase I (COI -658 bp Folmer region) for 155 specimens (Table 1, Fig. 1).Fig. 2 and Fig. 3 illustrate examples of the diversity of species that are part of the dataset.Sequences are distributed in 55 Barcode Index Numbers (BINs), 37 of which are unique to this dataset.Genetic distances between species of the same genus vary from 0.46% between Crithagra rufobrunnea and Crithagra concolor (two endemics) and 15.62% between Columba malherbii and Columba larvata.Despite sharing the same BINs, DNA barcoding seems also to be useful to distinguish the pairs of the closely-related sister species Zosterops leucophaeus (Principe) and Zosterops lugubris (São Tomé), Lamprotornis ornatus (Principe) and Lamprotornis splendidus (Principe) and the pairs of subspecies Corythornis cristatus nais (Principe) and Corythornis cristatus thomensis (São Tomé), despite the reduced genetic distances observed (pdistances = 0.61%, 0.77% and 0.46%, respectively), as they exhibited consistent differences.In contrast, the three subspecies of Crithagra rufobrunnea could not be identified using DNA barcoding, based on COI sequences.Nevertheless, there is evidence that justifies their current treatment as distinct subspecies, based on the reduced gene flow between the three allopatric populations and observable phenotypic differentiation (Melo 2007, Melo et al. 2022b).In the case of Columba larvata subspecies, DNA barcodes of two of the three subspecies were sequenced and, although two distinct BINs were attributed, it seems not possible to distinguish C. l. principalis from C. l. simplex specimens.
Despite repeated attempts, we failed to obtain the full DNA barcode sequence for Treron sanctithomae, of which we could only obtain and sequence a smaller fragment: 325 bp using primers FwhF1

Project description
Title: The name "BioSTP: DNA Barcoding of the endemic birds from oceanic islands of the Gulf of Guinea dataset" refers to data release of DNA barcodes and distribution data of birds in BOLD Systems.Nevertheless, only a few areas have currently no forest cover, such as the fire-prone anthropogenic savannahs in the north, the few urban centres spread mostly along the coast and northeast, the horticultural areas at higher altitudes, coconut groves along the coast and oil palm monocultures in the south.Most agricultural areas correspond to agroforestry systems with dense canopy cover, such as cocoa and coffee shade plantations or forest gardens and there are also extensive portions of the island that are covered by secondary growth.

Personnel
Annobón Island is 17 km and lies 340 km west of the mainland and 180 km southwest of São Tomé Island (Ceríaco et al. 2022a).The central part of the Island is comprised by the crater of Quioveo (640 m a.s.l.) and by Santamina, the highest point (700 m a.s.l.).Other geologic landmarks include Pico do Fogo, a trachyte plug (450 m a.s.l.) and Lago A Pot, a small crater lake (220 m a.s.l.).Only three valleys contain permanent streams and the north has savannah-like formations and dry bush, which are surrounded by dry lowland forest to the south (de Lima et al. 2022).The south of the Island is covered by taller mistforest covered by epiphytes.The vegetation is reported to have been less modified by humans than on São Tomé and Príncipe and there is little sign of cocoa and coffee plantations, which are now abandoned and colonised by secondary alien-rich regrowth.Nevertheless, the north has been most affected and most flat areas up to the inside of the crater are cultivated.
Design description: A total of 155 specimens were collected in the field, morphologically identified and DNA barcoded.
Funding: The present study was funded by the project TROPIBIO NORTE-01-0145-FEDER-000046, supported by Norte Portugal Regional Operational Programme (NORTE2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF) and was carried out in collaboration with BirdLife International, in the framework of the partnership agreement between BirdLife International and BIOPOLIS association to promote informed, evidence-based biodiversity conservation

Sampling methods
Description: Oceanic islands of the Gulf of Guinea: São Tomé, Príncipe and Annobón Sampling description: Blood samples were collected non-destructively by puncturing the brachial vein of mist-netted birds, except for Bostrychia bocagei for which feathers were obtained from a live juvenile captured by a local hunter.Samples were preserved in 96% ethanol, which were kept at room temperature in the field and at -80ºC in the lab.
DNA was extracted using the QIAmp DNA Micro Kit that is designed to extract higher concentrations of genetic material from samples with small amounts of DNA.DNA amplification was performed using two different primer pairs, that amplify partially overlapping fragments (LC + BH) of the 658 bp barcoding region of the COI mitochondrial gene.We used the primers FwhF1 all modified with Illumina adaptors.PCRs were performed in 10 µl reactions, containing 5 µl of Multiplex PCR Master Mix (Qiagen, Germany), 0.3 µl of each 10 mM primer and 1-2 4 µl of DNA, with the remaining volume in water.PCR cycling conditions consisted of an initial denaturation at 95ºC for 15 min, followed by 45 cycles of denaturation at 95ºC for 30 sec, annealing at 45ºC for 45 sec and extension at 72ºC for 45 sec and a final elongation step at 60ºC for 10 min.
Successfull amplification was validated through 2% agarose gel electrophoresis and samples selected for sequencing followed for a second PCR, where Illumina P5 and P7 adapters with custom 7 bp long barcodes were attached to each PCR product.The index PCR was performed in a volume of 10 µl, including 5 µl of KAPA HiFi PCR Kit (KAPA Biosystems, U.S.A.), 0.5 µl of each 10 mM indexing primer and 2 µl of diluted PCR product (usually 1:4).PCR cycling conditions were as before, except that only 10 cycles were performed and at an annealing temperature of 55ºC.The amplicons were purified using AMPure XP beads (New England Biolabs, U.S.A.) and quantified using NanoDrop 1000 (Thermo Scientific, USA).Clean PCR products were then pooled equimolarly per fragment.Each pool was quantified with KAPA Library Quantification Kit Illumina® Platforms (KAPA Biosystems, USA) and the 2200 Tapestation System (Agilent Technologies, California, USA) was used for fragment length analysis prior to sequencing (Paupério et al. 2018).DNA sequencing was done at CIBIO facilities on an Illumina MiSeq benchtop system, using a V2 MiSeq sequencing kit (2 x 250 bp).
Illumina sequencing reads were processed using OBITools (Boyer et al. 2015) and VSEARCH (Rognes et al. 2016).Briefly, paired-end reads were aligned, collapsed into exact sequence variants, filtered by length, denoised and checked for chimeras.The resulting sequences from both LC and BH fragments of each sample were further assembled using CAP3 (Huang and Madan 1999) to produce a single 658 bp contig per sample.
Quality control: All DNA barcodes sequences were compared against the BOLD database and the 99 top hits were inspected in order to detect possible issues due to contaminations or misidentifications.
Step description: 1. Blood samples were collected non-destructively by puncturing the brachial vein of mist-netted birds, except for Bostrychia bocagei for which feathers were obtained from a live juvenile captured by a local hunter.

Collection data
Collection name: BioSTP: DNA Barcoding of endemic birds from oceanic islands of the Gulf of Guinea.
Curatorial unit: DNA extractions -1 to 156.All records are also searchable within BOLD, using the search function of the database.The version of the dataset, at the time of writing the manuscript, is included as Suppl.materials 1, 2, 3. Column labels below follow the labels downloaded in the tsv format.Columns with no content in our dataset are left out in the list below.identification_provided_by Full name of primary individual who assigned the specimen to a taxonomic group.DNA Barcode library of the endemic-rich avifauna of the oceanic islands ...

thFigure 1 .
Figure 1.Map of the localities where birds samples were collected in the oceanic islands of the Gulf of Guinea: Principe (left), São Tomé (centre) and Annobón (right).Top right: location of the Gulf of Guinea islands in Africa.
Usage licence: Creative Commons Public Domain Waiver (CC-Zero) Data resources Data package title: BioSTP: DNA Barcoding of endemic birds from oceanic islands of the Gulf of Guinea Number of data sets: 1 Data set name: DS-IAAST BioSTP: endemic birds Download URL: https://dx.doi.org/10.5883/DS-IAAST Data format: dwc, xml, tsv, fasta Description: The BioSTP: DNA Barcoding of endemic birds from oceanic islands of the Gulf of Guinea dataset can be downloaded from the Public Data Portal of BOLD in different formats (data as dwc, xml or tsv and sequences as fasta files).Alternatively, BOLD users can log-in and access the dataset via the Workbench platform of BOLD.

identification_method
The method used to identify the specimen.voucher_statusStatus of the specimen in an accessioning process (BOLD controlled vocabulary).tissue_typeA brief description of the type of tissue or material analysed.collectorsThefull or abbreviated names of the individuals or team responsible for collecting the sample in the field.

Table 1 .
List of bird taxa from the oceanic islands of the Gulf of Guinea collected and DNA barcoded within this project.+ Indicates endemic species.++ Indicates endemic subspecies.# Indicate species with new BINs.NA indicates no data.Taxonomy and nomenclature follow de Lima and Melo (2021).
(Vamos et al. 2017) + C_R(Shokralla et al. 2015)for LC amplification.Likewise, no DNA barcode could be obtained from its sister species, Treron calvus, which has an endemic subspecies on Principe.For this taxon, we only managed to obtain a sequence of a nuclear copy amplified with primers BF3 (Elbrecht et al. 2019) + BR2 (Elbrecht and Leese 2017), identified as such by the presence of a stop codon.Other researchers had previously generated DNA sequences of COI fragments of T. calvus: (Ceríaco et al. 2022a) Gulf of Guinea: Príncipe, São Tomé, Annobón Príncipe Island is 139 km and lies 220 km west of the coast of Central Africa and 146 km northeast of São Tomé Island(Ceríaco et al. 2022a).It comprises a relatively flat, low-lying basalt platform in the north, contrasting with a rugged and mountainous south, where the main peaks are Pico do Príncipe (948 m a.s.l.), Mencorne (935 m a.s.l.) and Carriote (830 m a.s.l.).Once completely covered in rainforest, most accessible areas have been cleared and planted, even though some have regrown and are now covered with secondary forest(de Lima et al. 2022).The remaining native forest is mostly restricted to rugged terrain, including some lowland forest in the south and the montane forest around Pico do Príncipe.São Tomé Island is 857 km and lies 255 km west of Gabon and 150 km southwest of Príncipe Island(Ceríaco et al. 2022a).The Equator passes through Ilhéu das Rolas, an islet just south of the main island.The Island is cone-shaped, typical of islands with recent events of volcanism.It reaches its highest point at Pico de São Tomé (2,024 m a.s.l.), even though there are many high peaks and volcanic plugs spread across the Island.Native vegetation is mostly restricted to the rugged centre and southwest(de Lima et al. 2022).
Study area description: São Tomé and Príncipe, which is funded by the European Union through the 'Landscape Management in São Tomé and Príncipe' project (ENV/2020/420-182), has received funding from the European Union's Horizon 2020 Research and Innovation Programme under grant agreement No 854248 and by the project NORTE-01-0246-FEDER-000063, supported by Norte Portugal Regional Operational Programme (NORTE2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF).SF and VM were funded by the Fundação para a Ciência e a Tecnologia through the programme 'Stimulus of Scientific Employment, Identifier for the sample being sequenced, i.e.DS-IAAST catalogue number at Cibio-InBIO, Porto University.Often identical to the "Field ID" or "Museum ID". sampleid