Biodiversity Data Journal :
Methods
|
Corresponding author: Andriy Novikov (novikoffav@gmail.com)
Academic editor: Patricia Mergen
Received: 05 Feb 2025 | Accepted: 21 Mar 2025 | Published: 28 Mar 2025
© 2025 Andriy Novikov, Viktor Nachychko
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Novikov A, Nachychko V (2025) The digitisation workflow of the herbarium of the State Museum of Natural History of the NAS of Ukraine (LWS). Biodiversity Data Journal 13: e148861. https://doi.org/10.3897/BDJ.13.e148861
|
|
The digitisation workflow currently applied at the Herbarium of the State Museum of Natural History of the National Academy of Sciences of Ukraine (LWS) differs from other similar by cascade ('object-to-data-to-image') multilevel organisation. Its application is predicted by the need to preselect specimens by taxon and region, as well as by batched digitisation, which occurs with significant interruptions. Focusing on certain taxonomic groups from specific regions allows us to digitise specimens that could be more valuable for early scientific processing. At the same time, the herbarium benefits from such a digitisation model by revising the existing collection classification and keeping the initial ID system. The presented digitisation workflow can be easily reproduced in any herbarium with a limited budget. The purpose of this paper is to provide detailed description and schemas of the principal digitisation stages applied at the LWS Herbarium and to briefly discuss the steps crucial for a successful result. Provided information should help to maintain the digitisation and choose appropriate equipment and materials. We can conclude that, despite its general complexity, the described workflow demonstrated itself as viable and relevant due to its robust design and focus on data quality. Despite a focus on specialists' involvement, it maintains flexibility that allows combining volunteers and, if needed, outsourced efforts. Moreover, its modularity promotes independence of principal digitisation stages and allows long interruptions between the digitisation batches.
herbarium, natural history collections, digitisation workflow, imaging, data mobilisation
Herbarium digitisation is a crucial modern task providing numerous benefits such as biodiversity data mobilisation and remote access to collections (
The Herbarium of the State Museum of Natural History of the NAS of Ukraine (LWS) has been established in Lviv City since 1832. It is the third oldest and seventh richest herbarium in Ukraine (
Digitisation is a complex and multi-level process that is organised differently in different institutions, depending on financial capabilities, technical support, qualified personnel, collection size and time frames. Usually, the complete digitisation cycle includes several key stages: specimen selection (optional), pre-digitisation curation (cleaning, mounting, barcode placing etc.), databasing, data enhancement (e.g. georeferencing; optional), imaging, publishing and archiving (
Different digitisation workflows and their stages can be separated both in time and space and have irregular or linear order, determining two principal digitisation approaches: 'object-to-image-to-data workflow' and 'object-to-data-to-image workflow' (
The 'object-to-image-to-data workflow' assumes a linear organisation of digitisation when the specimen operation, imaging and data processing occur successively. It is most widely applied and focuses on obtaining the maximum number of digital images within a limited time. In such a case, data are extracted directly from the specimen image and then processed. Due to high automatisation (sometimes with conveyor belt), often this approach implements fast extraction of a minimum amount of data about the specimen (e.g. barcode and/or specimen ID, taxon name, country and/or local region and, occasionally, collector name and collection date) with the idea that these data can be supplemented or corrected later (
The linear digitisation approach is probably the most cost-effective, but it also has its drawbacks. Automatisation of the processes and the involvement of unqualified personnel can lead to mistakes, which can potentially remain uncorrected or incomplete for an indefinite time. The second disadvantage of such an approach is the potential delay in obtaining even minimal data since they are obtained only after the production of digital images. In the case of using automatic text recognition technologies (i.e. optical character recognition - OCR or handwritten text recognition - HTR), such a delay may be insignificant (
At the LWS Herbarium, we applied the second digitisation approach, the 'object-to-data-to-image workflow' (
The main disadvantage of the cascade approach is its complexity. Requiring qualified personnel makes digitisation more labour-intensive, expensive and time-consuming (
Current digitisation at the LWS Herbarium is focused on the priority group comprising specimens of endemic, rare and relict taxa of the Ukrainian Carpathians (
Ideally, for the digitisation of an entire herbarium, the labels for all specimens should be captured, folder by folder and processed respectively. At the same time, all new incoming specimens should also be digitised before entering the collection. However, in the case of financial issues and staff rotations, choosing the specimens prioritised for digitisation could be a more viable strategy, allowing us to digitise small portions of the collection case by case. This requires extra physical handling of the specimens, but it avoids producing an extra amount of files that can remain unprocessed for years or become outdated (e.g. if the specimen has been re-identified).
Volunteers or technical staff with moderate experience in the taxonomy and geography of the region can be involved. The labels and IDs can be photographed using any digital camera or smartphone. In the LWS Herbarium, labels are usually located in the bottom right corner and accession numbers (six-digit IDs, with sometimes omitted zeros at the beginning) are in the bottom left corner of the herbarium sheet. Therefore, only the bottom part of the herbarium sheet must be photographed (Fig.
The common view and organisation of the pictures captured during the labels imaging. Examples of horizontal (A, B) and vertical (C) images containing elements of the herbarium sheet required for the databasing: 1 - accession number; 2 - label(s); 3 - field number. Example of the files organisation (D) representing the key elements of the folder with labels images: 4 - readme file with a list of species/infraspecies represented in the current folder and supporting notes; 5 - images of empty table captured after the last specimen of each species/infraspecies helping to navigate between them quickly; 6 - the label images of the first species/infraspecies; 7 - the label images of the second species/infraspecies; 8 - the label images of the third species/infraspecies.
The flow chart of imaging the labels of specimens of predefined taxa from a particular region. The herbarium folders are precisely checked for the specimens corresponding to the predefined checklist. All potential synonyms must be checked and respective notes must be made in a checklist and/or readme file. The labels and IDs of all specimens meeting the checklist criteria and doubtful ones (requiring additional examination and/or clarification of the collection region) are photographed. All the specimens that do not meet the predefined checklist are omitted. There is no need to make extra-detailed (excepting doubtful specimens) photos at this stage, but the quality of the resulting picture must be enough to recognise the text. The same illumination and orientation of the pictures are preferable, as it later helps to process the images faster.
The image files should preferably be sorted into folders named correspondingly after the processed taxa (species or infraspecies). However, in the case of intensive digitisation with many different taxa processed daily, placing the files in folders named by working date could be more convenient, with the prospect sorting the files by taxa later or without such sorting at all.
During the bulk photographing of the specimens' labels in the LWS Herbarium, the contrasting pictures (e.g. of an empty table or blank sheet) are made after the last specimen of each species or infraspecies (Fig.
The LWS Herbarium contains ca. 147,000 specimens, many of which were collected in the second half of the 1800s and in the 1900s by a few dozen principal collectors. These specimens mostly have handwritten labels, some of which are partly damaged. They were primarily written in Polish, Ukrainian and Russian. However, many labels here were still written in Latin, Slovakian, Romanian, French, German and Hungarian. Few recent specimens were annotated in English. Some specimens combine labels in several languages. Therefore, processing of such specimens requires at least basic skills in multiple languages and palaeographics.
The data from the labels at the LWS Herbarium are extracted only by qualified persons directly from the herbarium labels using the protocol depicted in Fig.
The flow chart of data mobilisation from the herbarium labels and data enhancement. Initially, the data are stored in a Microsoft Excel table. This table is formatted following the DarwinCore standard (
The LWS Herbarium does not use OCR or HTR technologies to parse the data from the herbarium labels due to lack of expertise and the significant variation of handwriting and languages presented on labels. Unfortunately, the test application of Transkribus (
However, we found it helpful for data processing to create an additional list of the collectors (at the moment, it comprises 374 records), which contains standardised collectors' names in English and shortenings following the IPNI database (
Pre-imaging preparation of the specimens is a multi-level process that includes different tasks, most of which can be done by technicians, but specialists must still be involved in quality control. Our experience revealed that, even after qualified transcription of the labels, ca. 5-7% of specimens require additional data modifications and clarifications at the stage of pre-imaging preparation (Fig.
The flow chart of pre-imaging preparation of the specimens. Based on the dataset developed during the label transcription, required specimens are selected from the collection and placed in a separate working table for further pre-imaging preparation. The initial preparation of the specimen before imaging includes mounting unbound parts of the specimen and labels to the herbarium sheet, packing the small plant parts into the envelope attached to the sheet, restoring labels or preparing new ones and changing damaged herbarium sheets and/or covers. The second part of the specimen preparation includes placing the stamp of the herbarium, checking and stamping (if absent) the ID (specimen accession number), attaching the barcode label, attaching the nota critica with re-identification (if needed) and stamping the sign 'Digitised'. The barcode can be potentially detached, so the stamping or writing the accession number directly on the herbarium sheet is required. If the specimen is missing or has a duplicate accession number, the new accession number is designated to it and indicated in the dataset. At the LWS Herbarium, the barcodes with six-digit accession numbers as locally unique IDs (LUIDs) are applied. Such barcodes are prepared using Zint Barcode Studio software (
Colour reference charts (colour checkers) are generally required for digitisation, including the herbarium specimens digitisation (
Scale bar, crucial for estimation of the sizes of the specimen, is present on many colour reference charts. However, many herbaria use branded scale bars placed beside the colour reference chart. Such scale bars can be printed on special transparent non-reflective film (e.g. biaxially-orientated polyethylene terephthalate (BoPET) film DuPont Mylar), which allows them to be placed over the specimen. At the LWS Herbarium, the X-Rite ColorChecker Classic Mini and ISA Golden Thread Object-Level Target charts, each containing scale bars, are used for digitising specimens of vascular plants. Therefore, we do not apply an additional branded scale bar since it requires additional space and manipulations. However, for digitising specimens of non-vascular plants, we use a smaller colour reference chart, Charttu Nano (without a preprinted scale bar), supplemented by an additional branded scale bar.
Herbarium digitisation can require introducing a new specimen ID system, which can either be used in parallel with the existing one or to replace the latter. However, replacing old IDs with new ones can create errors when specimens have already been published and cited in publications with old IDs. Therefore, the new ID system must have unique IDs that do not overlap with the old ones. This can be realised by providing a unique prefix to the new IDs or by applying GUIDs. The image must contain a machine-readable barcode or QR code for better automatisation. Sometimes, specimen IDs can be mistakenly duplicated or missing in the collection. The preselection of specimens, printing and placing the barcodes (Code 128, ISO 15417) or QR codes (ISO 18004) corresponding to specimens' original IDs, as well as identification of erroneous IDs, are time-consuming. Therefore, a new ID system is often introduced when barcoded or QR-coded IDs are printed and placed consequently or randomly on the specimens. A good practice in such a case is using globally unique identifiers (GUIDs) to ensure effective digital data curation and databasing (
Keeping the same order of the elements on the herbarium sheet is useful. It helps to navigate over the specimen elements on its image. At the LWS Herbarium, the specimen label is typically placed in the right bottom corner, while the specimen accession number (LUID) is written or typed in the left bottom corner (Fig.
The final image of the entire specimen. The key elements represented on the herbarium sheet during the imaging: 1 - principal label; 2 - handwritten or stamped accession number (LUID); 3 - barcode corresponding to the accession number; 4 - herbarium stamp; 5 - stamp confirming the digitisation of the specimen; 6 - colour reference chart (ISA Golden Thread object-level target); 7 - supplementary label (indicating inclusion of the specimen in a local research programme).
After pre-imaging preparation, the imaging can be done by technicians, volunteers or outsourced to a company. This stage is entirely related to image production, adjustment and file organisation (Fig.
The flow chart of imaging and image processing. Before imaging, attention is paid to several important steps. In particular, the presence of all key elements i.e. principal label, ID, barcode, herbarium stamp, digitisation stamp, colour reference chart and supplementary labels must be checked on the herbarium sheet. The illumination must be adjusted to indicate preferences and its uniformity along the specimen must be checked. The camera lens must be kept clean. In case of dirt or dust, it must be carefully removed with the specially provided tools (brush and microfibre napkin). The free space availability on the SD card and predefined camera preferences are also checked. The herbarium specimens must be handled carefully, one by one, avoiding mixing and damaging. During the processing of the images, it is important to keep RAW files intact; only renaming is allowed. The images in distribution format (i.e. JPEG) are the objects to further manipulations (i.e. cropping, rotation, colour adjustment etc.). The sharpening of images is not permitted.
The final imaging of the herbarium specimens can be realised using different technical solutions, i.e. scanners (including flatbed and planetary - see
The quality is crucial for the images of herbarium specimens, as they can be used for investigations at different magnifications.
It is worth noting that image sharpening is not allowed for the archiving master files (
The publishing stage includes publishing the data, publishing the images and crosslinking the data and images (Fig.
The flow chart of final data verification, cross-linking with images and publishing on GBIF. Special attention is paid to data uniformity and consistency. Data from different rows must not be mixed. Data verification is realised in the OpenRefine (
Biodiversity data should be published and permanently archived in appropriate, trusted, general or domain-specific repositories (
The LWS Herbarium does not have its virtual herbarium platform and Ukraine does not have a joint specialised national platform or portal for publishing data on digitised collections. Only two independent Ukrainian platforms allow the publication of biodiversity data, including those obtained from the digitised collections, i.e. the Ukrainian Biodiversity Information Network (
Considering this, the LWS Herbarium data are published online on several platforms simultaneously, i.e.
In archiving, the application of storage media (e.g. optical discs) produced by different manufacturers is advisable to avoid possible manufacturing defects and potentially low production quality (
Similarly to the case of data publishing, the stored materials must be well-structured, well-annotated, clearly identifiable and represented in open and raw formats (
Simultaneous use of several different types of storage media (e.g. magnetic tapes and external hard drives) for the herbarium digitisation is advised by
The flow chart of the data and images archiving. At the LWS Herbarium, the data are archived on three types of the storage media: (a) on the internal server using the service of the National Academy of Sciences of Ukraine; (b) on the SD memory cards (128 GB SanDisk Extreme Pro cards, which can store ca. 1,024 image files in RW2 format or 6,400 in JPEG format); (c) on the BluRay MABL discs (25 GB Verbatim SL BD-R, which can store ca. 200 image files in RW2 format or 1,250 in JPEG format). Such a combination of storage media (magnetic, electronic and optical) provides long-term preservation of the digitised data and images. Besides this, datasets are also archived on Zenodo (
However, combination of several different archiving media has drawbacks, namely, controlling the preservation conditions and data persistence can be complicated. Moreover, the State Museum of Natural History of the NAS of Ukraine has no common archiving and data management strategy yet. Developing a common archiving strategy would help resolve questions about file formats, priority long-term storage media, cyclic media renewal terms and the distribution of personnel roles in data archiving.
The digitisation of the herbarium specimens at the LWS Herbarium is a multilevel process organised in a cascade manner. This so-called 'object-to-data-to-image' workflow prioritises data extraction and enhancement to meet MIDS levels 2 and 3. Such digitisation is complicated, but allows additional access to physical specimens (once during the label photographing and, secondly, during the capture of the entire specimens), resulting in an extra step of quality control. In contrast, in the case of a linear approach, the physical specimen is usually accessed only once and the later processing is realised using its digital copy (so-called 'digital specimen'). Additional access to the specimens increases the negative load on the collection as a result of extra handling. However, it allows the revealing and early fixing of errors and gaps in data (e.g. incorrect identifications, duplicating numbers, misplaced specimens etc.) that can be overlooked in the case of digital specimen use. At the same time, such an approach allows early preselection of the specimens for digitisation and their extra filtering before the image of the entire specimens. Such filtering allows revealing the specimens that are out of digitisation priority, but were mistakenly considered as such. For example, if the specimen has been considered as collected in the Ukrainian Carpathians, but after detailed explorations, it appeared to be collected out of this geographic range. Hence, the cascade approach principally focuses on thematic (i.e. focused on specific taxa and certain regions) digitisation in parallel to data mobilisation, enhancement and publishing (e.g. through GBIF). Similarly to the linear approach, it allows the involvement of volunteers and non-qualified staff at certain stages. However, data-intensive strategy applied in a cascade approach requires extra involvement of specialists familiar with the plant biology (especially, nomenclature and taxonomy) and regional geography.
The digitisation workflow described in this article and illustrated by the detailed schemas is intended to help other herbaria with limited budgets to effectively organise the digitisation and virtual publishing of the data regarding their collections. Clarifications and notes are provided on each digitisation stage aimed to help choose appropriate equipment, materials, archiving media and other technical solutions.
This paper has been prepared as part of the project “Digitisation of natural history collections damaged as a result of hostilities and related factors: development of protocols and implementation on the basis of the State Museum of Natural History of the National Academy of Sciences of Ukraine” (Nr 2022.01/0013), financed by the National Research Foundation of Ukraine in the grant programme “Science for the Recovery of Ukraine in the War and Post-War Periods”.
The supplement contains the list of the equipment, materials, software and online resources applied during the digitisation of the specimens at the LWS Herbarium.