The Mt Halimun-Salak Malaise Trap project - releasing the most species rich DNA Barcode library for Indonesia

Abstract The Indonesian archipelago features an extraordinarily rich biota. However, the actual taxonomic inventory of the archipelago remains highly incomplete and there is hardly any significant taxonomic activity that utilises recent technological advances. The IndoBioSys project was established as a biodiversity information system aiming at, amongst other goals, creating inventories of the Indonesian entomofauna using DNA barcoding. Here, we release the first large scale assessment of the megadiverse insect groups that occur in the Mount Halimun-Salak National Park, one of the largest tropical rain-forest ecosystem in West Java, with a focus on Hymenoptera, Coleoptera, Diptera and Lepidoptera collected with Malaise traps. From September 2015 until April 2016, 34 Malaise traps were placed in different localities in the south-eastern part of the Halimun-Salak National Park. A total of 4,531 specimens were processed for DNA barcoding and in total, 2,382 individuals produced barcode compliant records, representing 1,195 exclusive BINs or putative species in 98 insect families. A total of 1,149 BINs were new to BOLD. Of 1,195 BINs detected, 804 BINs were singletons and more than 90% of the BINs incorporated less than five specimens. The astonishing heterogeneity of BINs, as high as 1.1 exclusive BIN per specimen of Diptera successfully processed, shows that the cost/benefit relationship of the discovery of new species in those areas is very low. In four genera of Chalcidoidea, a superfamily of the Hymenoptera, the number of discovered species was higher than the number of species known from Indonesia, suggesting that our samples contain many species that are new to science. Those numbers shows how fast molecular pipelines contribute substantially to the objective inventorying of the fauna giving us a good picture of how potentially diverse tropical areas might be.


Introduction
The Indonesian archipelago features an extraordinarily rich biota that is, amongst other factors, derived from its sheer size and geographic position, basically linking the Oriental and Australian regions. This transition was first described in detail by Wallace (1860), who laid the foundation for the discipline of biogeography in this region. Our understanding of the biogeography of the region has steadily advanced since then, increasingly embracing new technology and interdisciplinary research approaches (see Lohman et al. 2011). However, the actual taxonomic inventory of the archipelago remains highly incomplete (see Schmidt 2015) and there is hardly any significant taxonomic activity that utilises recent technological advances (but see Barlow and Woiwod 1990, Riedel et al. 2013, Riedel et al. 2014, Wibowo et al. 2017, Hubert et al. 2015, Dahruddin et al. 2016, Cancian de Araujo et al. 2018. Large-scale databasing, in particular of hyperdiverse invertebrates of the region, is also in its infancy. The GBIF to date only features 147,463 occurrence data published for Indonesia, for 13,210 species -surprisingly few compared, for example, to Germany with 37,917,568 occurrences and 16,742 species (GBIF, accessed on 1 July 2018). At the same time, vast areas of supposedly high biodiversity disappear every year (Brooks et al. 2002, Curran et al. 2004, Gaveau et al. 2013, Wilcove et al. 2013, Abood et al. 2014, Margono et al. 2014) and with them, possibly thousands of species never formally known to mankind, which means also a significant loss of ecosystem service and knowledge of potentially useful compounds (see Hooper et al. 2005, Loreau 2009, Norris 2011. The Indonesian and German ministries of Research and Education have therefore provided funding to establish a biodiversity information system (IndoBioSys), that integrates occurrence databasing, species discovery and species characterisation, using morphology and DNA sequence data, specimen vouchering, as well as integrated tools for the discovery of substances of potential use for society. IndoBioSys is, therefore, a case study and foundation for the large-scale exploration of Indonesian species diversity. Moreover, IndoBioSys could be a foundation for the empirical and objective,scientific assessment of species distribution patterns across the archipelago, for example, needed for conservation priority setting.
One work package of the IndoBioSys project was an assessment of the species diversity of the hyperdiverse insect fauna of the Mount Halimun-Salak National Park in West Java, with a focus on sampling with Malaise traps. The National Park has been recognised as one of the largest tropical rain-forest ecosystems left in Java, being designated as a National Park in 2003 with a present area of about 113,357 hectares. Malaise trapping (Malaise 1937, Townes 1972) is a method that allows standardised sampling of flying insects, with a number of highly diverse groups of minute species, e.g. in the Diptera and Hymenoptera.
Subsets of the samples obtained were submitted to a well-established pipeline employing DNA barcoding (Hebert et al. 2003, Ivanova et al. 2006) in order to estimate species diversity (see Ratnasingham andHebert 2007, Ratnasingham andHebert 2013) and to obtain data for future beta diversity studies with data from other localities.
Here, we release these data with an analysis of their taxonomic content, an approximation of the species diversity encountered and an evaluation of the novelty of the data with respect to publicly available data from the Barcode of Life Data Systems (BOLD).

Materials and Methods
A summary of fieldwork and laboratory procedures employed in the IndoBioSys project were presented by . Methodological steps specific for the work package presented here are described below. The traps were run for about 120 days in total and the collecting bottles changed monthly. Collecting liquid was 300 ml of 96% Ethanol in each bottle.

Fieldwork and samples processing
The samples were taken to the IndoBioSys Indonesian laboratory at the Museum Zoologicum Bogoriense (MZB) in Cibinong, West Java. Using a 3 mm mesh sieve, they were broken down into two fractions, according to the size of the animals with the smaller samples passing the sieve into a collecting tray.
This fractioning is important for optimising the sorting process as well as for separating the specimens that will be sent entirely for molecular laboratory processing ("voucher recovery pipeline") from the ones that are large enough for a procedure where only one or more legs are removed from the voucher for laboratory use ("leg picking pipeline"). Most of the fractions were sent to the IndoBioSys laboratory at SNSB-ZSM in Munich, where they were sorted to order and family level.
Given the enormous number of specimens (we estimated over 300,000 specimens of invertebrates collected during the project), the orders Coleoptera and Hymenoptera were chosen as the main target groups for the present analysis. Selected groups of Diptera, in particular Syrphidae and Phoridae, will be dealt with in a separate data release. Here, we present the results of a few specimens randomly picked from the samples. For Coleoptera and Hymenoptera, specimens were taken quantitatively from the samples except in case of a long series of morphologically similar individuals, in which case we took only representatives. In these cases, a smaller amount of specimens that represents the Spatial and temporal specimens and molecular access success distribution on Mt Halimun-Salak National Park. morphological diversity of the series was chosen in order to prevent cryptic species bias. The number of specimens taken was determined on a case by case basis.
Lepidoptera, another target group of the IndoBioSys project, were collected using a different method, as described earlier . Some Geometridae that were collected using Malaise traps and that were suitable for morphological analysis were processed and included in the present study. A specific release of the geometrid barcode data is currently being prepared (OS in prep.).
All specimens that were not further processed were repatriated to the MZB as ethanol samples. All processed specimens were returned to MZB as dry mounted and labelled voucher specimens (Fig. 2 and Fig. 3).
All specimen data are accessible in BOLD as a single citable dataset (dx.doi.org/10.5883/ DS-IDBMTP). The data include collecting locality, geographic coordinates, elevation, collector, one or more digital images, identifier and voucher depository. Sequences data can be obtained through BOLD and include a detailed LIMS report, primer information and access to trace files. The sequences are also available on GenBank (accession numbers MH926363-MH929079).

Data analysis
Locality information and molecular data from the Malaise trapping programme were downloaded from the BOLD IndoBioSys campaign projects. The records downloaded were individualised by trap and by insect order in separate excel worksheets for analysis of spatial and diversity distribution. Here, we only focus on the orders Hymenoptera, Coleoptera, Diptera and Lepidoptera.

Results
A total of 4,531 specimens were prepared for DNA barcoding. Of these, we obtained cox1-5P sequences from 2,732 individuals. Sequences from 2,598 of these individuals were longer than 300 base pairs. In total, 2,380 individuals produced barcode compliant records (Table 1). The success rate was therefore comparably low, with only 60.5% on average, varying between the samples from 2.7% to 100% (Fig. 1). These 2,380 individuals represent 1,197 exclusive BINs or putative species. They could be assigned to 98 different insect families (Table 2). Gunung Botol had the largest success rate (80.9%) and Sukamantri the lowest (32.2%) in terms of processed specimens producing barcode compliant sequences. From those 1,197 exclusive BINs, only 46 BINs (3.8%) are not new to BOLD. Only 15 BINs were recovered with more than 10 specimens of each BIN. A total of 804 BINs were singletons and more than 90% of the BINs were recorded with less than 5 specimens (Fig. 4). Specimens and BINs distribution per order. Table 2.

Specimens and BINs distribution per Family
The Mt Halimun-Salak Malaise Trap project -releasing the most species ... The highest diversity of BINs was found in Hymenoptera (712 BINs), followed by Coleoptera (398), Diptera (53) and Lepidoptera (34). The diversity per order was always high, with two or less individuals per BIN on average. The diversity per family was also impressive with 50% of the families being composed by BINs represented by singletons or doubletons.

Discussion
Given the discrepancy in the sampling effort, it was not possible to compare taxonomic disparities amongst the four sampling areas. The sampling was focused on Cikaniki due to the better conservation of the forest in this area and the presence of the research station that provided better infrastructure to the scientific staff.
Even collecting at four different locations in one nature reserve, the IndoBioSys Malaise trap project alone has added 1,149 new BINs to BOLD. It shows how fast molecular pipelines contribute substantially to objectively inventorying the fauna of megadiverse areas. It also allows us to estimate the enormous diversity of tropical areas like the Halimun-Salak National Park. The astonishing heterogeneity of BINs (See Fig. 5 and Table  2), as high as 1.1 specimen successfully processed per exclusive BIN of Diptera, shows the magnitude of the diversity that is waiting to be discovered in the tropics. Only 15% of the specimens that produced DNA barcode compliant records belong to putative species that have more than five specimens processed, being 81.7% of all BINs represented by singletons or doubletons. It makes the cost/benefit relationship of the discovery of new species in those areas very low, even with low success rates of the molecular processing that this project has been facing. Such large error rates have not been encountered in similar projects of the ZSM and we suspect that the poor quality of the ethanol used for the collecting bottles might have been the crucial issue. The supraspecific taxonomic diversity was relatively high considering the number of specimens analysed. As a comparison, Hendrich and collaborators in their release of a comprehensive DNA barcode database for Central European beetles (Hendrich et al. 2014) have sequenced 15,948 specimens to obtain 97 families meaning that, on average, a family in the database is represented by 164.4 processed specimens. In the present paper, we recorded 39 families of Coleoptera after processing only 788 specimens, corresponding to one family per 20.7 specimens on average. Therefore and even considering that this discovery process is not linear, it is quite clear that we are far behind the accumulation curve plateau for families and that there are many more to be discovered at Halimun-Salak National Park, especially at the species level.
The diversity of Chalcidoidea, a superfamily of Hymenoptera, gives us a clear picture of the diversity uncovered at Halimun-Salak National Park. The Universal Chalcidoidea Database (Noyes 2018) has returned records for 17 genera and 302 species from Java. Here, we detected 11 genera and 155 species for this superfamily. For four families (Aphelinidae, Eulophidae, Mymaridae and Torymidae), the diversity detected was higher than the diversity described (Fig. 5), showing that those samples are composed of many species new to science.