Ecological data in Darwin Core: the case of earthworm surveys

Abstract Background This sampling-event dataset provides primary data about species diversity, age structure, abundance (in terms of biomass and density) and seasonal activity of earthworms (Lumbricidae). The study was carried out in old-growth broad-leaved and young forests of two protected areas ("Kaluzhskiye Zaseki" Nature Reserve and Ugra National Park) of Kaluga Oblast (Russia). New information The published dataset provides new data about earthworm communities in European Russia. We propose a new schema according to Darwin Core for the standardisation of the soil invertebrates survey data.


Introduction
Earthworms occur in soils almost across the whole world, preferring moist habitats of moderate temperature. They are amongst the major components of terrestrial ecosystems dominating the biomass of soil invertebrates in non-acidic soils (Lee 1985, Edwards andBohlen 1996). Through burrowing, casting and mixing of litter and soil (bioturbation), they influence aggregate stability, soil structure, infiltration of water, aeration of deeper soil layers, nutrient cycling, microbial biomass and other soil invertebrates (Eisenhauer et al. 2007, Eisenhauer 2010, and so linked with the development of sustainable forest ecosystems (Lavelle et al. 1997, Blouin et al. 2013. Despite this, the amount of available data on the distribution of earthworms in the world is very limited. Recent studies (Cameron 2018, Phillips 2019, Phillips 2021) have highlighted many gaps in our knowledge of the distribution of Lumbricidae, amongst which the territory of Russia is characterised by an extremely low amount of available data. For example, GBIF.org provides only 9602 occurrences of family Lumbricidae for the Russian territory (GBIF.org. 2021) in contrast to an extensive scientific heritage accumulated by Soviet and Russian researchers (Baluev 1950, Malevich 1954, Horizonova et al. 1957, Malevich and Perel 1958, Malevich 1959, Vsevolodova-Perel 1997, Striganova and Porjadina 2005, Berman et al. 2009, Makarova and Kolesnikova 2019, Shekhovtsov et al. 2020 and many others).
In our opinion, this situation can be explained by two reasons. The first one is timeconsuming and labour-intensive field data collection (Coja et al. 2008), which does not allow continuous gathering of material from many locations. There are different methods used for earthworm extraction from the soil. The most widely used technique for quantitative sampling of earthworms is hand-digging and hand-sorting (Satchel 1969, Edwards and Lofty 1977, Lee 1985, as well as a formalin extraction method (Raw 1959), electrical octet method (Rushton and Luff 1984, Bohlen et al. 1995, Eisenhauer et al. 2008, hot mustard (Gunn 1992, Eisenhauer et al. 2008) and onion extraction (Steffen et al. 2013) methods.
The matter that the data standardisation process is not clear is the second barrier to earthworm data exchange and integration because, usually, earthworms are collected according to sampling-event design. Nowadays, Darwin Core (Wieczorek et al. 2012) is the key data standard for biodiversity data mobilisation through the GBIF portal. This specimen-based standard was developed for describing species point records and has served as the basis for the interoperability of taxonomic and occurrence-based datasets. However, it has its origin in the natural history collections community and was not initially intended to capture metadata about multi-species sampling processes (Wiser et al. 2011, Guralnick et al. 2018. Although recent efforts have begun to develop an 'Event Core', as well as new terms that related to ecological data mobilisation (dwc: samplingProtocol, dwc: sampleSizeValue, dwc: sampleSizeUnit, dwc: samplingEffort), the contribution of samplingevent datasets to GBIF remains low (3.1% of all published datasets). The Humboldt Core TDWG task group is working to develop a new standard for biodiversity inventory data sharing. However, ecological data, as well as data collection protocols, are often so different, even for studying the same taxonomic group. For example, species data in earthworms censuses may be available for the whole census, each soil sample in the survey or each soil sample layer in the soil sample. In this case, it is not always clear what an event is. At the same time, it is essential to establish the possibility of combining surveys from different datasets.
Here, we provide the sampling event dataset of long-term earthworm surveys (Shashkov and Ivanova 2021), carried out in protected areas of Kaluga Oblast (Russia) (Shashkov 2014, Shashkov 2016, with detailed data on species in soil sample layers, as well as the schema for representing this data in the Darwin Core.

General description
Purpose:

1.
Provide high-quality soil biodiversity data.

2.
Suggest the schema for earthworms surveys data standardisation according to Darwin Core. Soil sample locations on the study site.
Ecological data in Darwin Core: the case of earthworm surveys Additional information: We used data collected by the hand-sorting method in our example. During each survey (usually taken during one day), soil samples of fixed size were randomly collected within the sampling plot (in similar tree and herb cover and soil type). Each soil monolith was hand-sorted by layers for earthworms (see details in the Sampling description section). An example of sampling design is shown in Fig. 1. Geographic coordinates were recorded for the sampling plot, not for each soil sample. Some sampling plots were studied once, others -several times during the year or several years.
Thus, our primary data included information for each individual in the soil sample (species, biomass and life stage) and earthworm density (number of individuals) for the survey.
During the data standardisation process, we considered three types of events (Fig. 2), connected hierarchically. The most large-scale event is a survey. One survey is a set of soil samples collected at one location during one sampling period. The second level is a soil sample. It is a part of the survey, each soil sample collected during the survey, including empty samples. The third level is a soil sample layer. It is a part of the soil sample.
Thus, we included in the dataset occurrences of two levels (Table 1): individual specimens occurrences assigned to soil sample layer (with individual biomass and life stage) and occurrences assigned to a survey (with total density).
Used event hierarchy allowed us to maintain data consistency and completeness. Nevertheless, our method has some bottlenecks. Firstly, it is not common practice to combine events of different levels in one dataset. At the same time, each event level should be described in the dataset. This information requires a particular Darwin Core term, but it is currently absent. We used the general term dwc: dynamicProperties as a temporary solution in this work. Secondly, the event hierarchy includes 338 events (the soil sample level), which are not assigned to any occurrences. These events are empty not because no species were registered. We used this event level for the relationship between  Possibly, another data standardisation design could be more understandable. It would be simpler to use the soil sample as the event and bind samples from one sample plot via dwc: locationID and different surveys via dwc: parentEventID. This scheme avoids empty events not related to occurrences. However, its implementation is not possible due to technical IPT limitations. We cannot assign different depths for occurrences into one event because dwc: verbatimDepth, dwc: minimumDepthInMeters and dwc: maximumDepthInMeters are related to the Event Core.
On the other hand, events of different levels made it possible to provide different level traits. In our dataset, we provided life stage and biomass for each specimen and density for the survey. This is an essential advantage for ecological data re-analysis.
Overall, our solution is not optimal. This approach is a trade-off between the need to provide as complete data as possible, the current state of the Darwin Core standard and the technical limitations of the IPT. We believe that further development of biodiversity data standards and data publishing protocols will optimise the process of ecological samplingevent data mobilisation and facilitate their reuse.

Study extent:
The study area was located in the central part of the East European Plain. Earthworms were collected in 13 locations of old-growth broad-leaved forests and young birch forests in the "Kaluzhskiye Zaseki" Nature Reserve and Ugra National Park. There were 10 sampling plots in old-growth broad-leaved forests at a late successional stage or subclimax (Fig. 3). All of them, but one (Val), were located either on the watershed or watershed slope. Two more sites in 30-year birch forests with broad-leaves regrowth at an early stage of reforestation succession (Fig. 4), one in a locality of former tillage and the second one in a locality of former pasture, were sampled. One more sample plot represented black alder forest in the floodplain (Table 2).   Figure 3.
The second investigated group of forest stands comprises young forests established on abandoned arable field and pasture. The stands of young forest are predominantly composed of Betula spp. and Salix caprea L. Sampling plots were located on abandoned farmlands. The distance to the edgeof old-growth forests was about 30-50 metres.
Sampling description: At each sampling plot, 8-24 randomly located soil samples (25 cm × 25 cm) were dug to a depth of 35 cm for earthworms collection (Ghilarov 1975). Soil monoliths were taken, if possible, under the middle of the crown projection of a large tree between the crown edge and the trunk, for reducing the possible influence of microstational condition differences. Earthworms were separated from soil by hand-sorting onsite ( Fig. 5 and Fig. 6) by layers: litter (A0), 0-10 cm, 10-20 cm and >20 cm. Collected earthworm specimens were preserved in 4% formaldehyde, transferred to the laboratory and, if possible, identified to species level. Specimens were identified using the key of Vsevolodova-Perel (1997) by Maxim Shashkov. Most of the juvenile specimens were identified to species level, except ones belonging to the genus Lumbricus. Identification of some specimens was confirmed by T.S. Vsevolodova-Perel personally.

Traits coverage
The dataset provides three trait types.

Life stage
Earthworms were distinguished into three ontogenetic stages -juvenile, subadult and adult, based on the development of the clitellum. It is the reproductive gland used for cocoon production by mature earthworms generally forming an obvious band around the mid-section segments. Adult earthworms had a fully developed clitellum. Earthworms were considered subadult if they had any signs of tubercula pubertatis, but no clitellum and adult if they are clitellate (Sims and Gerard 1999). Earthworms were considered juveniles if they had neither tubercula pubertatis nor clitellum. Cocoons were not taken, as the washing method is more suitable for cocoons collection, but takes more time than hand-sorting (Singh et al. 2015). Occasionally, found cocoons were not included in the dataset because of the impossibility of identifying them by morphological features.

Biomass
Preserved specimens were weighed to determine earthworm biomass with portative balance Ohaus SPU 123. This device allows taking weight with precision of 0.001 g with an accuracy of 0.003 g. All the worms were weighed under laboratory conditions in a preserved state. No corrections were made for gut content or dehydration in formaldehyde. Individual biomass was in the range of 2 to 5220 mg. The largest worms were specimens of Aporrectodea caliginosa (max. 1630 mg) and Lumbricus terrestris. The total biomass was highest in old-growth forests on Phaozems (61.4-110.5 g/m ) and Luvisols grey (45.9-104.0 g/m ), as well as the young forest on former pasture (97.3-135.9 g/m ). The lowest values were recorded for the young forest on former arable land (4.4-43.5 g/m ) and the alder forest experiencing seasonal flooding (17.9-25.1 g/m ).

Density
Some worms were damaged during soil excavation with a shovel. The fragment was considered a specimen when it had an anterior end, but each counted for biomass. The most abundant population of earthworms in terms of relative density (individuals per square metre) was revealed in the old-growth forest on Phaozem (R1) and in the young forest on the former pasture. The poorest values were observed in the young forests on the former arable soil.  A numeric value for a measurement of the size of a sample in a sampling event (number of soil samples for the 'plot survey' event, size of the soil sample for the 'soil sample' event and area of sampling for the 'soil sample layer' event). https://dwc.tdwg.org/terms/#dwc:sampleSizeValue Constant for soil and layer level: "25×25×35" and "0.0625", respectively.
sampleSizeUnit(Darwin Core Event) The unit of measurement of the size of a sample in a sampling event.
https://dwc.tdwg.org/terms/#dwc:sampleSizeUnit Constant for each level: "soil samples", "centimetres" and "square centimetres" -survey, soil sample and layer, respectively. The full scientific name of the kingdom in which the taxon is classified.