The data of the Swedish Malaise Trap Project, a countrywide inventory of Sweden's insect fauna

Abstract Background Despite Sweden's strong entomological tradition, large portions of its insect fauna remain poorly known. As part of the Swedish Taxonomy Initiative, launched in 2002 to document all multi-cellular species occurring in the country, the first taxonomically-broad inventory of the country's insect fauna was initiated, the Swedish Malaise Trap Project (SMTP). In total, 73 Malaise traps were deployed at 55 localities representing a wide range of habitats across the country. Most traps were run continuously from 2003 to 2006 or for a substantial part of that time period. The total catch is estimated to contain 20 million insects, distributed over 1,919 samples (Karlsson et al. 2020). The samples have been sorted into more than 300 taxonomic units, which are made available for expert identification. Thus far, more than 100 taxonomists have been involved in identifying the sorted material, recording the presence of 4,000 species. One third of these had not been recorded from Sweden before and 700 have tentatively been identified as new to science. New information Here, we describe the SMTP dataset, published through the Global Biodiversity Information Facility (GBIF). Data on the sorted material are available in the "SMTP Collection Inventory" dataset. It currently includes more than 130,000 records of taxonomically-sorted samples. Data on the identified material are published using the Darwin Core standard for sample-based data. That information is divided up into group-specific datasets, as the sample set processed for each group is different and in most cases non-overlapping. The current data are divided into 79 taxonomic datasets, largely corresponding to taxonomic sorting fractions. The orders Diptera and Hymenoptera together comprise about 90% of the specimens in the material and these orders are mainly sorted to family or subfamily. The remaining insect taxa are mostly sorted to the order level. In total, the 79 datasets currently available comprise around 165,000 specimens, that is, about 1% of the total catch. However, the data are now accumulating rapidly and will be published continuously. The SMTP dataset is unique in that it contains a large proportion of data on previously poorly-known taxa in the Diptera and Hymenoptera.


Introduction
Sweden has a long entomological tradition, starting even before Linnaeus's groundbreaking work on the Swedish fauna and flora. Nevertheless, large portions of the insect fauna remain poorly known to this day. The Swedish distribution of numerous species is documented only by scattered occurrence records and their ecology is poorly documented or unknown. It has also been clear for some time that further research on neglected insect groups would expand the known Swedish species stock significantly and lead to the discovery of a number of species new to science. These neglected groups include many taxa in the orders Diptera and Hymenoptera; it also includes the lice (Phthiraptera) and even some select groups in more well-known insect orders.
To address these knowledge gaps, an ambitious national insect inventory was started in 2003, the Swedish Malaise Trap Project (Karlsson et al. 2005. Malaise traps were chosen as the trapping method because of their efficiency in catching many of the poorly-known taxa of Diptera and Hymenoptera. A total of 73 Malaise traps were deployed at 55 localities spread out over the country and representing a diverse set of habitats. Most of the traps were run continuously from 2003 to 2006 or a significant portion of that time period. A detailed description of all accounts of the project, from collection through transfer to taxonomic experts and collection storage, was recently published ).
The total SMTP catch is estimated to contain 20 million insects, distributed over 1,919 samples ). The samples have been sorted into more than 300 taxonomic units, suitable for further processing by experts. Thus far, more than 100 taxonomists have been involved in identifying the sorted material, recording the presence of 4,000 species. One third of these had not been recorded from Sweden before and 700 have tentatively been identified as new to science .
The SMTP data are published through the Global Biodiversity Information Facility (GBIF). Data are published both on the sorted taxonomic fractions (the "SMTP Collection Inventory" dataset) and on the identified material of individual groups. The latter data are published using the Darwin Core standard for sample-based data to facilitate biodiversity analyses. The SMTP datasets are unique in that they contain a large proportion of data on previously poorly-known, but species-rich taxa in the Diptera and Hymenoptera. Due to this and the fact that the Swedish insect fauna was better known than most other insect fauna before the start of the inventory, the SMTP data offer biologists one of the best opportunities currently for detailed analysis of the size and composition of temperate insect fauna .
In this paper, we provide background data on the SMTP and describe the rationale behind the data publication strategy. We also provide an overview of the currently available datasets, comprising information on more than 130,000 taxonomically-sorted samples and on the species identity of around 165,000 specimens, that is, about 1% of the total SMTP catch. The species-level data are now accumulating rapidly; more than 600,000 specimens are currently on loan to experts for identification. The data will be published continuously through GBIF as they become available over the coming years.

General description
Purpose: The purpose of this paper is to provide an overview of the SMTP data published through GBIF and to describe the data publication strategy.

Additional information:
The SMTP data fall into two different categories: data on the material sorted to taxonomic fractions and data on the specimens identified to species. The data on the sorted material are available in the "SMTP Collection Inventory" dataset . It currently includes more than 130,000 records of taxonomicallysorted samples. As the taxonomic order sorting of the SMTP material is now complete, the changes to this dataset over the coming year or two will reflect only the finer levels of sorting as they occur. As samples are processed by taxonomists, however, they will disappear from the sorted material dataset so that it will continuously reflect the SMTP material that is currently available to taxonomic experts.
Data on the identified material are published using the Darwin Core standard for samplebased data. The goal is to make the data easy to use for biodiversity analyses, such as species richness estimation. Such analyses typically require that all species in a specific taxon set are recorded for the same set of samples. As the set of SMTP samples processed for each taxonomic unit is different, the information is divided into group-specific datasets. The taxon coverage of each dataset usually corresponds to one of the taxonomic fractions used in the sorting process. In some cases, several related taxonomic fractions are combined into a single dataset, but only if the same set of samples have been processed for all fractions. The circumscription of SMTP datasets is subject to change, based on discussions with the taxonomists involved in identifying the material, amongst other things.
The taxonomic coverage of each dataset is described in its metadata, in the taxonomic coverage section. The field generalTaxonomicCoverage is used to provide information about subtaxa that may be excluded. For instance, a genus may be excluded from the dataset of a family because it is very difficult to identify to species or because it is so numerous that identification to species would be too time-consuming. As appropriate for sample-based data, the absence of a species from the Occurrence table (the Extension table of the Darwin Core Archive) is significant. Provided that the species belongs to the covered taxon set, it means that the species was not encountered in the processed samples.
The samples processed for the taxa covered by the dataset are listed in the Event table (the Core table of the Darwin Core Archive). The sampling site location and sampling effort (time period in days) of each sample are specified, as well as the associated TrapID and EventID identifiers that are used consistently throughout the SMTP project. These identifiers facilitate analyses that look at patterns across SMTP datasets. For instance, one may be interested in the overlap in spatial or temporal coverage amongst SMTP datasets. The absence of one of the 1,919 samples from the Event table is significant: it means that the sample has not been processed for the taxa covered in the dataset. If a sample is listed in the Event table, but there are no occurrence records tied to it, it means that it has been verified that there are no specimens of the covered taxa in that sample. Note that the SMTP samples processed for the taxa in the dataset are usually an arbitrary subset of all the available SMTP samples. If the subset has been chosen according to a principled method, this will be noted in the metadata fields Methods:sampling and Methods:samplingDescription.
In general, the Occurrence table lists abundance data for the observed species. This may be recorded separately for each sex or as a total number of specimens regardless of sex. In a few cases, the Occurrence table instead lists incidence data or a mix of incidence and abundance data. If so, this is noted in the metadata of the dataset, in the Dataset:additionalInfo field. This field may also contain annotations about the determinations. For instance, in the Phoridae dataset, only the males are determined to species; females are usually determined only to genus.
As far as possible, the species-level taxonomy used for the SMTP data follows the national Swedish checklist Dyntaxa, also available as a checklist through GBIF (https://doi.org/ 10.15468/j43wfc). Deviations may occur for several reasons. Manuscript names are used in the SMTP datasets for new species; the ambition is to update these records and match them to new entries in Dynyaxa when the taxa are described. In some cases, specialists involved in identifying SMTP material provide corrections of Dyntaxa species names or concepts. These corrections are forwarded to Dyntaxa curators for review and possible action. The ambition is to synchronise the content in the SMTP datasets and Dyntaxa as soon as the issues have been resolved, but this is currently a manual process and time lags may occur.
The sample-based SMTP datasets are linked through their names; they are named "SMTP X", where "X" refers to the taxon set. For instance, the dataset covering Coleoptera is named "SMTP Coleoptera". The metadata of each dataset include standardised fields describing the SMTP project, further facilitating collective retrieval of all SMTP datasets.

Sampling methods
Sampling description: The project used a standard Townes-style Malaise trap obtained from Sante Traps, Lexington, KY, USA (Fig. 2).
The trap sites are described in more detail in Suppl. material 1. Trap locations were chosen to maximise habitat diversity (for more details, see . To facilitate analyses, the sites have been classified into different habitat types (Suppl. material 1). A somewhat more detailed characterisation of habitat has been done by noting the dominant plants on each site. Suppl. material 2 provides a list of the 1,919 sampling events, each corresponding to a particular range of dates for a particular trap. Within the SMTP project, we maintain a unique set of identifiers for the traps and another set of identifiers for the collecting events. These identifiers are used consistently in all SMTP datasets to facilitate analyses that combine information from several datasets. Individual datasets of identified specimens span different subsets of the 1,919 samples and typically only include some of the trapping sites. Therefore, the geographic coverage varies considerably amongst datasets. As the geographic coverage will increase as data are added to each individual dataset, the actual geographic coverage of the dataset has to be computed from the Event table (the Core table in the Darwin Core Archive). The geographic coverage specified in the metadata of the dataset matches that of the entire project.

Taxonomic coverage
Description: The taxonomic coverage of the entire SMTP catch is quite broad. Malaise traps principally target flying insects (especially Hymenoptera and Diptera), but SMTP material also includes large numbers of other terrestrial arthropods (Araneae, Acari, Collembola) and scattered specimens of other invertebrates (Pulmonata, Lumbricidae etc). There are also single examples of unwanted bycatch of vertebrates (several lizards, a bird, a bat), but these specimens have not been preserved as part of the SMTP material.
The taxonomic coverage of each sample-based SMTP dataset is given as part of the metadata published with the dataset. The current data (Table 1, Suppl. material 3) comprise 79 different data datasets. The datasets are dominated by groups belonging to the orders Diptera and Hymenoptera, mostly at the family or subfamily level. Together, these orders comprise about 90% of the specimens in the SMTP material and they also dominate amongst the datasets that cover large numbers of samples and specimens (Table 1, Suppl. material 3). The remaining insect taxa in the SMTP material are mostly sorted to the order level; there are also substantial datasets for some of these. In total, the 79 data datasets currently available cover around 165,000 specimens, that is, about 1% of the total catch. The data are now accumulating rapidly and will be published continuously, so these numbers are likely to increase considerably over the coming years. The datasets only cover a portion of the available taxonomic fractions from the SMTP material. About 330 taxonomic sorting fractions are currently used (Table 2; see the list at the Station Linné webpage for more details, including older sorting fractions). Around half of these fractions are at present actively being worked on. As new experts become involved in the project, we expect that the taxonomic coverage of the identified material will increase. The taxonomic fractions have changed slightly through the years, reflecting how experts want to work with the material at a given point in time and according to the sorting expertise available in the SMTP lab. These types of changes are likely to continue over the coming years. Similarly, we expect that the taxonomic circumscription of published datasets will change slowly over time.  Trap managers were instructed to empty the traps every two weeks. However, during the most intense summer period, some traps had to be emptied more often. Conversely, during the winter, sample periods were often much longer (Suppl. material 2). Individual datasets of identified material span only a subset of the 1,919 samples and the temporal coverage therefore varies.
The temporal coverage of individual sample-based datasets varies considerably, depending on which samples have been processed. The temporal coverage will increase over time as data are added to the dataset. The actual temporal coverage has to be computed from the Event