Bryophytes Occurrences Dataset Based On SYKO Herbarium Moss Collection

Abstract Background The dataset with 49,726 bryophytes occurrences (49,261 moss occurrences and 465 liverworts occurrences), located predominantly on the territory European north-east Russia, is described in this data paper. The dataset was based on the digitised moss labels from the Institute of Biology of Komi Scientific Сenter of the Ural Branch of the Russian Academy of Sciences herbarium (SYKO). The information from the labels was recognised, cleaned and brought into compliance with the Darwin Core. More than 99.9% of occurrences were georeferenced with a precision of at least 3 km. For each occurrence, the original label image URL was given. The dataset contains occurrences of 539 moss and liverworts taxa (species and lower ranks) belonging to 190 genera and 75 families. New information Information about 49,726 bryophytes occurrences was published in GBIF. The dataset was based on label data of 94% of SYKO herbarium moss collection specimens. Most of the occurrences were described with the following fields: occurrenceID, institutionID, collectionCode, catalogNumber, basisOfRecord, scientificName, taxonRank, kingdom, phylum, class, order, family, genus, recordedBy, identifiedBy, associatedMedia, day, month, year, country, countryCode, decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMetres, georeferencedBy.


Introduction
The herbarium of Institute of Biology of Komi Science Centre of the Ural Branch of the Russian Academy of Sciences (SYKO) is one of the largest herbaria on European northast Russia with more than 309,800 specimens. It was established by the famous Russian botanist A. Tolmachev in Syktyvkar city in 1941. There are five subdivisions in SYKO Herbarium: vascular plants (205,000 specimens), bryophytes (58,184 specimens), algae (17,420 specimens), lichens (26,000 specimens) and fungi (3,207 specimens).
The next largest SYKO herbarium subdivision after the vascular plant's subdivision is the bryophytes' one organised by I. Kildjushevskij in 1969. There are two collections in this subdivision: moss collection and liverworts collection. These collections were based on the specimens collected during the Komi Republic vegetation exploration in . It should be noted that there are some liverworts samples in the moss collection. These liverworts samples are stored, not in the form of separate storage units, but as a mixture of specimens containing several species collected at one point.
There are exsiccata from other herbaria (LE, KPABG) in this SYKO subdivision. The exsiccata originated from the territories of European Russia, Caucasus, Western and Southern Siberia, Russian Far East, Ukraine, the Republic of Kazakhstan, the Republic of Azerbaijan, the Republic of Tajikistan, Mongolia and USA (Alaska). The exsiccata labels were not planned for digitising as it was thought better for the community to have this information published by the original herbaria.
The bryophytes' subdivision of SYKO herbarium is an important reference source for the study of the moss flora of the European north-east Russia and, in particular, for such a large region (416,774 km²) as the Komi Republic (Zheleznova 1994, Degteva et al. 2001, Shubina and Zheleznova 2002.
All rare and protected moss species (43 species, 289 occurrences) are presented as specimens in the bryophyte collection. These specimens were used for preparation of three editions of the Komi Republic Red Data Book (Taskaev 1998, Taskaev 2009, Degteva 2019. The herbarium SYKO bryophyte collection is increasing by approximately 700 curation units annually as a result of fieldworks.

Sampling methods
Study extent: The bryophytes subdivision of SYKO is divided into two collections: mosses and liverworts. We have not digitised the labels of the liverworts collection at this moment. However, some occurrences of liverworts were added in the dataset as a result of keeping them simultaneously in one specimen packet with mosses. The labels of the liverworts collection are planned for digitisation in the near future.
According to the SYKO bryophytes subdivision register (maintained manually since 1969), there were 58,184 specimens (45,198 mosses and 12,986 liverworts) at the beginning of August 2020. The label data of 42,698 unique moss samples (94 percent of moss collection) have been digitised to that date. The 1,697 moss storage units have duplicates (specimens that have the same label data as original specimens). We stored these duplicates in the main collection and used them for exchange with other institutions. The duplicates were not used for the described occurrence dataset preparation. The collection of mosses is characterised by the frequent presence of more than one species in one specimen (from 1 to 9 species per specimen, 1.2 on average). Some parts of the digitised labels were excluded from the described dataset. A total of 2,754 labels were used for updating the dataset "Moss occurrences in Yugyd Va National Park, Subpolar and Northern Urals, European North-East Russia" published earlier (Zheleznova et al. 2020a). The images for 3,452 occurrences published earlier were added in the field "associatedMedia" for the Yugyd Va dataset.
Thus, the dataset, described in this paper, was based on 39,916 of 42,698 digitised moss labels which allowed us to publish 49,726 occurrences (Zheleznova et al. 2020b).
Sampling description: Bryophytes herbarium samples were collected during two main types of fieldwork: floristic explorations and vegetation studies. Field samples were separated into storage specimens during the species identification in a way that, in each specimen, there was a minimum number of bryophyte species. Two label copies are generated for each sample. One copy of the label was fixed on a bag with a dried moss sample, the second was stored in a separate storage for labels (library card catalogue cabinet was used). The labels and the moss specimens themselves were arranged in alphabetical order of species names. Each moss sample was assigned a catalogue number. The catalogue numbers have been increasing since the time of the organisation of the bryophytes subdivision in the SYKO herbarium. Information about the label catalogue number, date of collection, name of the collection place, species name, field number and habitat were entered in the register.
The labels from label storage were used for digitisation. The label images were obtained with a digital camera. Images were uploaded to the server and their filenames to the label database. The database web interface, written specifically for this project, was used for manual label data recognition and interpretation. The following minimum set of data were deciphered (in Darwin Core terms): scientificName, recordedBy, identifiedBy, day, month, year, catalogNumber, decimalLatitude, decimalLongitude.
The digitisation of most of the moss collection labels showed that the names of 139 collectors were on the labels and 38 botanists were engaged in species identification. The most productive collector and botanist who was principallyengaged in species identification was one person -G. Zheleznova (Tables 1, 2).  Label images quality. Each image of the label was checked for readability by operators who deciphered label data. Images that were out of focus or had extraneous objects in the frame were deleted from the database. It was possible to recapture bad label images only if the catalogue number of the label was detectable on discarded images. In other cases (about 6% of the total number of labels in the moss collection), the second round of label image capturing will be performed later (after forming a list of missing labels with the help of the label register).
Top ten collectors for the SYKO herbarium moss collection dataset. Table 2.
Names of botanists who identified species for most of the occurrences of the SYKO herbarium moss collection dataset.

Check of georeferencing.
Occurrences locations were added to a map with the OpenStreetMap layer and with Russian regions borders polygon layers in QGIS software (Open Source Geospatial Foundation Project 2020). The names of regions were assigned to each occurrence with the help of "Point Sampling Tool" QGIS plugin. The occurrences located outwith the land border of any Russia region and occurrences located far from the borders of Komi Republic were subject to verification.
Text recognition quality. All label data recognised by operators were checked visually for each label image. Special boolean-like fields were added to the database table with main label information: the check was carried out (yes / no), data clarification is required (yes / no). The label data needing to be checked were divided in two groups: 1) the collection date and catalogue number, 2) names of taxa indicated on the label and the names of people who collected the sample and who identified the species.
Additional verification of collection dates and collectors names was carried out during labels georeferencing. It is known that one collector could not be in all points located more than several kilometres from each over during the same day. After the main array of labels digitising and recognition, it became possible to compare the series of labels to identify and correct obvious errors that were made, not only during image data recognition, but also errors that were made by laboratory technicians during manual filling out of label blanks. In the latter case, corrected information was added in the database and the label was marked for replacement in the near future.
Taxonomy validation. Verbatim taxon names indicated on labels, in many cases, were out of date and not valid. In our case, only professional bryologists were the operators for taxon name recognition, so verbatim names were corrected on the fly during data entering in the database. The next step of taxon name checking was normalising species names against the GBIF backbone (https://www.gbif.org/tools/species-lookup). The GBIF backbone normalised species names and higher taxonomy were updated manually by our bryologists to bring the taxon name usage in concordance with the latest moss checklists (Ignatov et al. 2006, Hodgetts et al. 2020).
Dataset validation. The publication-ready Darwin Core compliant dataset was generated as a csv-file by Python script which included SQL queries to the database. This file was checked for errors manually with the data filtering function of spreadsheet software and automatically with the GBIF Data Validator service (https://www.gbif.org/tools/datavalidator).
2. Batch of labels images were captured per box (drawer) of the labels' catalogue with strict adherence to the labels order in each box. Labels in boxes are kept in alphabetical order of taxon names. Labels of samples collected on the same dates by the same collectors were often grouped within every box. Label images captured in the order they were kept allowed us to significantly simplify the data recognition process for operators. Images were taken with a photo camera with a minimum frame size of 4000 × 3000 pixels.
3. Batch of labels images up to several thousands JPEG files were processed simultaneously. Each image was cropped to remove most of the background so the image size became approximately 2000 × 1500 pixels. The white balance of all images was automatically adjusted with Fred Weinhaus 'autowhite' script for ImageMagick software (http://www.fmwconcepts.com/imagemagick/autowhite).
4. Cropped images were uploaded to the server and their file path names were added in label database.
5. An operator decrypted the label data with a web application. Different web forms for different types of data were used: entering catalogue number and collection date; entering the names of taxa; entering the names of the collectors and persons who carried out the identification of taxa; input of geographic coordinates. Dates were entered as three separate numbers: day, month and year. This format of dates storage allowed the processing of labels with omitted days or month in collection date. Qualified bryologists entered the names of taxa, the names of the collectors and the persons who identified the species of mosses. Georeferencing of labels was performed by an engineer with cartographic skills. In some cases, for a more accurate determination of coordinates, it was possible to question the collector of the sample.
6. All entered data (excluding geographic coordinates) were checked with special forms in the web application. Label images were compared with entered data and errors were corrected simultaneously or marked for correction later.  Table 3.

Geographic coverage
Description: Most of the dataset occurrences were located in the territory of European north-east Russia. Only one occurrence was located far from this region on the Kamchatka Peninsula (55.72222°N, 160.3714°E). The polygon with the shortest perimeter that encloses most of the occurrences (the convex hull) was approximately 820,000 square kilometres (Fig. 1). In total, the dataset contained 3,918 collection sites for bryophyte specimens with unique geographic coordinates. The point with the largest number of occurrences (1564) was located on Vaygach Island (69.75°N, 59.82°E).
Most of the published occurrences were located in the territory of the Komi Republic (86% of all occurrences) and the Nenets Autonomous district (12%). The remaining occurrences (2%) were collected mainly in the territory of seven the Komi Republic neighbouring regions (Table 3).   Most of the moss species sampled in the herbarium were sufficiently abundant to be collected in hundreds and sometimes thousands of samples. Seven moss species were represented in more than 1000 occurrences (Table 5). These are the most widespread mosses in the European north-east Russia. They account for 24% of all moss finds in the published dataset.  Table 5.

Coordinates
Moss species with more than 1000 occurrences.

Figure 2.
Taxonomic diversity of moss families in the dataset. The figure was prepared with the "treemap" package in R (Tennekes 2017).  Table 6.
Top ten families with most numerous occurrences.  taxonRank The taxonomic rank of the most specific name in the scientificName.

kingdom
The full scientific name of the kingdom in which the taxon is classified. phylum The full scientific name of the phylum or division in which the taxon is classified.

class
The full scientific name of the class in which the taxon is classified. order The full scientific name of the order in which the taxon is classified. family The full scientific name of the family in which the taxon is classified. genus The full scientific name of the genus in which the taxon is classified. The ordinal year in which the sample was collected country The name of the country or major administrative unit in which the Location occurs.
countryCode The standard code for the country in which the Location occurs. decimalLatitude The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic centre of a Location. decimalLongitude The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic centre of a Location.
geodeticDatum The ellipsoid, geodetic datum or spatial reference system (SRS) upon which the geographic coordinates given in decimalLatitude and decimalLongitude are based.
coordinateUncertaintyInMetres The horizontal distance (in metres) from the given decimalLatitude and decimalLongitude describing the smallest circle containing the whole of the scientificName An identifier for the nomenclatural (not taxonomic) details of a scientific name. taxonRank The taxonomic rank of the most specific name in the scientificName. kingdom The full scientific name of the kingdom in which the taxon is classified. phylum The full scientific name of the phylum or division in which the taxon is classified. class The full scientific name of the class in which the taxon is classified. order The full scientific name of the order in which the taxon is classified. family The full scientific name of the family in which the taxon is classified.

genus
The full scientific name of the genus in which the taxon is classified.
nameAccordingTo An identifier for the source in which the specific taxon concept circumscription is defined or implied.
specificEpithet The name of the first or species epithet of the scientificName.
infraspecificEpithet The name of the lowest or terminal infraspecific epithet of the scientificName, excluding any rank designation.