Diversity and abundance of soil macroinvertebrates along a contamination gradient in the Central Urals, Russia

Abstract Background Since the late 1980s, long-term monitoring of terrestrial ecosystems in metal-contaminated areas near the Middle Ural Copper Smelter has been carried out in the Central Urals. As a part of these monitoring programmes, the data on species diversity, community composition and abundance of soil macroinvertebrates continue to be gathered. New information The dataset (available from the GBIF network at https://www.gbif.org/dataset/61e92984-382b-4158-be6b-e391c7ed5a64) includes a 2004 census for soil macroinvertebrates of spruce-fir forests along a pollution gradient in the Central Urals. The dataset describes soil macrofauna’s abundance (the number of individuals per sample, i.e. the density) and community structure (list of supraspecific taxa, list of species for most abundant taxa and supraspecific taxa or species abundance). Seventeen sampling plots differed in the levels of toxic metal (Cu, Zn, Pb, Cd and Fe) soil contamination from air emissions of the Middle Ural Copper Smelter (heavily polluted, moderately polluted and unpolluted areas). The dataset consists of 340 sampling events (= samples corresponding to upper and lower layers of the 170 soil monoliths) and 64658 rows (2907 and 61751 for non-zero and zero density of taxa, respectively). Arachnida (Araneae and Opiliones), Carabidae (imagoes), Elateridae (larvae), Chilopoda, Diplopoda, Gastropoda, Staphylinidae (imagoes) and Lumbricidae were identified to species level. In contrast, Mermithida, Enchytraeidae, Lepidoptera larvae, Diptera larvae, Hemiptera, Hymenoptera and some other insects were identified to family or order levels. In total, 8430 individuals of soil macroinvertebrates were collected in two soil layers (organic and organic-mineral horizons), including 1046 Arachnida (spiders and harvestmen), 45 Carabidae, 300 Elateridae, 529 Myriapoda, 741 Gastropoda, 437 Staphylinidae, 623 Lumbricidae and 4709 other invertebrates. The presence-absence data on each taxon are provided for each sampling event. An overwhelming majority of such absences can be interpreted as “pseudo-absences” at the scale of sampling plots or study sites. The dataset contains information helpful for long-term ecotoxicological monitoring of forest ecosystems and contributes to studying soil macrofauna diversity in the Urals.


Introduction
Industrial pollution can drastically affect soil macroinvertebrates (Rusek and Marshall 2000). Soil contamination with toxic metal(loid)s caused by non-ferrous smelters is especially hazardous: some taxa disappear (e.g. earthworms, potworms and molluscs) or decrease in abundance (e.g. centipedes and spiders), leading to the radical transformation of the community structure (Bengtsson and Tranvik 1989, Stepanov et al. 1991, Nekrasova 1993, Nahmani and Lavelle 2002, Gongalsky et al. 2007, Tanasevich et al. 2009, Vorobeichik et al. 2012, Vorobeichik et al. 2019. Such changes drive the slowdown of organic matter decomposition and disintegration of soil aggregates Vorobeichik 2016, Korkina andVorobeichik 2018), disappearance of some mammals, for example, the common mole (Vorobeichik andNesterkova 2015, Nesterkova 2019) and the imbalance of mineral nutrients in plants (Sukhareva and Lukina 2014) and birds (Belskii and Grebennikov 2014). Given the considerable role of soil macroinvertebrates in terrestrial ecosystems (Brussaard et al. 2007), they are often used in environmental monitoring and assessment (Cortet et al. 1999, Paoletti et al. 2010. Areas in the vicinity of point polluters (i.e. sources of atmospheric emissions with an incomparably smaller size than the areas polluted by them) provide a convenient model for analysing the response of terrestrial ecosystems to the toxic load. These areas can be considered the result of a long-term, large-scale natural experiment with ecosystems, which began when a factory was launched. The data obtained in the vicinity of point polluters can be used to reveal the mechanisms of ecosystem resistance and resilience to stress factors (Vorobeichik andKozlov 2012, Vorobeichik 2022).
Thus, the uniqueness of the study area lies in the ability to investigate in real-time the ecosystem's recovery since there is information about its state before and after reducing emissions in the same sites. Therefore, information on the state of soil macrofauna in 2004  can be taken as a starting point in analysing its recovery. This census is the last before the almost complete cessation of emissions in 2010. A partial analysis of this information has already been presented in a study on species diversity changes along the pollution gradient (Vorobeichik et al. 2012). In addition, an analysis of soil macrofauna recovery at the supraspecific taxa level was carried out (Vorobeichik et al. 2019). Therefore, the advantage of the presented dataset is the ability to implement such an analysis at the species level. In addition, metadata on metal concentrations in forest litter (as an index of toxic load) makes it possible to analyse dose-response dependences and estimate macrofauna resistance thresholds to soil pollution.
We present the sampling-event dataset that introduces the outcomes of a multi-species sampling in the field. Currently, most of the datasets in GBIF are occurrence-based and describe point records of species. In contrast, the contribution of sampling-event datasets remains very low, about 3% of all published datasets (Ivanova andShashkov 2021, Shashkov et al. 2021). Sampling-event datasets can contain more zeros than non-zeros for the multi-species communities since only a tiny part of the regional species pool may be present in each specific sample. Undoubtedly, the overwhelming majority of such zeros we can qualify as "pseudo-absences" at the scale of sampling plots or, a fortiori, at the scale of study sites.
Nevertheless, such "pseudo-absences" are not needless. These data, given for each species in each sample, provide the most detailed original information about community structure at all investigated spatial scales, from samples to the whole area. Such information is helpful for many research tasks. For example, we can easily estimate the frequency of occurrence at different spatial scales (i.e. within the sampling plot, study site, pollution zone or whole area) for each species or combination of species pooled in ecological groups or supraspecific taxa. Collapsing the data, for example, to the sampling plots scale, will lead to irreversible loss of information, so it is not appropriate. Moreover, if the sampling-event dataset did not contain zeros for "pseudo-absences" of species, we must add them "artificially" for such calculations.
In the context of a pollution gradient study, information about species absence is more crucial than about their presence, assuming that study sites did not differ considerably before a factory operation. The presence-absence data allow the assessment for which species or supraspecific taxa are disappearing with an increase in soil contamination. Considering that most zeros in samples are "pseudo-absences" for taxa, we must apply the taxa absences at least at the scale of sampling plots (after collapsing the data), but not samples for such analysis.
It is important to distinguish the actual disappearance of a species and the declines in a species abundance below the detection limits for the accounting method. Although the interpretation of these two cases is quite different, to differentiate them is a challenging task. Moreover, extraordinary research is needed to detect the pollution-induced elimination of species. For example, we discovered that earthworms and molluscs could inhabit decaying logs in heavily polluted areas near the smelter; however, they were eliminated in topsoils in these sites . The presented dataset does not distinguish the species elimination in polluted areas from the declines in their abundance below the detection limits. Nevertheless, the dataset enables us to assess relative changes in species composition and the community structure along a pollution gradient because we used a rigorous sampling design: the number of samples, size of soil monoliths and collecting method were the same in each sampling plot.

Project description
Study area description: The study area is situated in the lowest uplands of the Urals (altitudes are 150-400 m a.s.l.) and belongs to the southern taiga subzone. Primary coniferous forests (Picea abies, Abies sibirica and Pinus sylvestris) and secondary deciduous forests (Betula pendula, Betula pubescens and Populus tremula) prevail. Spruce and fir forests with nemoral flora on loam or heavy loam soils dominate on the western slope of the Urals and pine forests on sandy loam or light loam soils prevail on the eastern side. Study sites are located in spruce-fir forests. The ground vegetation layer is dominated by Oxalis acetosella, Aegopodium podagraria, Gymnocarpium dryopteris, Dryopteris carthusiana, Asarum europaeum, Maianthemum bifolium, Cerastium pauciflorum and Stellaria holostea. Soil cover is formed mainly by soddy-podzolic soils (Albic Retisols, Stagnic Retisols and Leptic Retisols), burozems (Haplic Cambisols) and grey forest soils (Retic Phaeozems). Zoogenically-active humus form (Dysmull) prevails Vorobeichik 2018, Korkina and.
The average annual air temperature is +2.0°С; the average annual precipitation is 550 mm; the warmest month is July (+17.7°С) and the coldest month is January (-14.2°С) (mean values for the last 40 years, 1975-2015, according to the data of the nearest meteorological station in Revda). The snowless period is about 215 days (from April to October), the maximum depth of the snow cover being about 40-60 cm.
In the moderately polluted area, emissions have suppressed the tree stand and ground vegetation layer (decreasing species diversity and productivity). Only fragments of the spruce-fir forests have persisted in the heavily polluted area. Near the MUCS, ground-layer vegetation consists of several pollution-tolerant species (Equisetum sylvaticum, Deschampsia caespitosa, Tussilago farfara, Agrostis capillaris) and a moss layer has been formed by only one species (Pohlia nutans). Apart from the metal accumulation and increased acidity, soil transformation manifests itself in the enhancement of the eluvialgleying process, degradation of soil aggregates, decrease in exchangeable potassium and magnesium, increase in forest litter thickness and shifts from zoogenically-active Mull humus forms to Eumor humus forms without any signs of soil macrofauna activity (

Sampling methods
Study extent: Study sites ( Fig. 2) were located on gentle slopes of ridges in spruce-fir forest. A total of nine study sites (= locationID) were established, corresponding to areas with different pollution levels. The number of sampling plots within each study site ranged from one to three; 20 samples were collected from each sampling plot (Table 1).   Soil macrofauna was hand-sorted out of soil monoliths 20 × 20 cm in area and 25-30 cm in depth, depending on the presence of macroinvertebrates. The time interval for extracting one soil monolith from the sampling plot was approximately 5 minutes. Ten monoliths were collected from each plot randomly, excluding nearby trunk areas with a radius of 0.5-1 m around large trees (more than 30 cm in diameter) and any visible pedoturbations. During sampling, monoliths were divided into two layers: the O horizon (organic) and A horizon (organic-mineral). Monoliths were placed in plastic bags (separately for the layers), delivered to the laboratory and stored before processing at 12°C for no more than five days (as a rule, 1-2 days). The collected invertebrates were wet-preserved in 70% ethanol; earthworms were carefully washed with water, fixed with 10% formalin and then wetpreserved in 70% ethanol. Ants and relatively large micro-arthropods (springtails, oribatid mites) were not accounted for. A total of 340 samples and 8430 individuals of soil macroinvertebrates were collected.
When preparing the dataset, we assumed that each species recorded in the investigated area could be found in each sample. Based on this assumption, the zero-densities of species in the sample indicated by zero and correspondingly dwc:occurrenceStatus=absent.
To study the metal contents, we collected five pooled samples of forest litter in August 2004 at each sampling plot (85 samples in total, Table 3). Dried samples were ground and sieved (2 mm). The pH was measured potentiometrically (the soil-to-water ratio was 1:25 w/v). We used acid-soluble forms of the potentially toxic metals (Cu, Pb, Cd, Zn and Fe) to approximate its total content and as an index of toxic loads. Metal concentrations were determined by an atomic absorption spectrophotometer AAS 6 Vario (Analytik Jena, Germany) after extraction with 5% nitric acid (HNO ) (the soil-to-acid ratio was 1:10 w/v) following USEPA Method 7000B (USEPA 2007    The community's core in unpolluted and moderately polluted areas is formed by Lumbricidae and Enchytraeidae (30-60% of the total abundance). The earthworm density reached 260 ind./m² (considering cocoons, up to 1000 ind./m²). In total, eight species of earthworms were recorded: two Ural endemics (Riphaeodrilus diplotetratheca (Perel, 1967) and Perelia tuberosa (Svetlov, 1924) (Koch, 1836)).
Amongst insects, species identification has been made only for some Coleoptera (Carabidae, Staphylinidae and Elateridae). A total of nine species of ground beetles, 54 species of rove beetles and seven species of click beetles were recorded. phylum The full scientific name of the phylum or division in which the taxon is classified. A variable. Example: "Annelida". class The full scientific name of the class in which the taxon is classified. A variable.

order
The full scientific name of the order in which the taxon is classified. A variable.
Example: "Crassiclitellata". family The full scientific name of the family in which the taxon is classified. A variable.
Example: "Lumbricidae". genus The full scientific name of the genus in which the taxon is classified. A variable.
Example: "Dendrobaena". specificEpithet The name of the first or species epithet of the scientificName. A variable. Example: "octaedra". Examples: "present", "absent". An overwhelming majority of "absences" can be interpreted as "pseudo-absences" at the scale of sampling plots or study sites. year The four-digit year in which the Event occurred, according to the Common Era