Biodiversity Data Journal :
Methods
|
Corresponding author: Abraham Nieva de la Hidalga (nievadelahidalgaa@cardiff.ac.uk)
Academic editor: James Macklin
Received: 04 Oct 2019 | Accepted: 11 Feb 2020 | Published: 26 Mar 2020
© 2020 Abraham Nieva de la Hidalga, Paul Rosin, Xianfang Sun, Ann Bogaerts, Niko De Meeter, Sofie De Smedt, Maarten Strack van Schijndel, Paul Van Wambeke, Quentin Groom
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nieva de la Hidalga A, Rosin PL, Sun X, Bogaerts A, De Meeter N, De Smedt S, Strack van Schijndel M, Van Wambeke P, Groom Q (2020) Designing an Herbarium Digitisation Workflow with Built-In Image Quality Management. Biodiversity Data Journal 8: e47051. https://doi.org/10.3897/BDJ.8.e47051
|
|
Digitisation of natural history collections has evolved from creating databases for the recording of specimens’ catalogue and label data to include digital images of specimens. This has been driven by several important factors, such as a need to increase global accessibility to specimens and to preserve the original specimens by limiting their manual handling. The size of the collections pointed to the need of high throughput digitisation workflows. However, digital imaging of large numbers of fragile specimens is an expensive and time-consuming process that should be performed only once. To achieve this, the digital images produced need to be useful for the largest set of applications possible and have a potentially unlimited shelf life. The constraints on digitisation speed need to be balanced against the applicability and longevity of the images, which, in turn, depend directly on the quality of those images. As a result, the quality criteria that specimen images need to fulfil influence the design, implementation and execution of digitisation workflows. Different standards and guidelines for producing quality research images from specimens have been proposed; however, their actual adaptation to suit the needs of different types of specimens requires further analysis. This paper presents the digitisation workflow implemented by Meise Botanic Garden (MBG). This workflow is relevant because of its modular design, its strong focus on image quality assessment, its flexibility that allows combining in-house and outsourced digitisation, processing, preservation and publishing facilities and its capacity to evolve for integrating alternative components from different sources. The design and operation of the digitisation workflow is provided to showcase how it was derived, with particular attention to the built-in audit trail within the workflow, which ensures the scalable production of high-quality specimen images and how this audit trail ensures that new modules do not affect either the speed of imaging or the quality of the images produced.
Data capture, digitisation workflow, image quality control, herbarium sheets, natural history collections, digital specimen
Digital imaging of large numbers of fragile specimens is an expensive and time-consuming process that is likely to be done only once. Consequently, the digital images produced need to be useful for the largest set of applications possible and have a potentially infinite shelf life (in theory). The applicability and longevity of the images depend directly on their quality. As a result, the quality criteria that specimen images need to fulfil must influence the design, implementation and execution of the digitisation workflows. All aspects of the workflow are affected, including selection of equipment, definition of image and data formats, digital curation practices, image processing software and definition of operational constraints. In response to this challenge, the digitisation team at Meise Botanic Garden (MBG) has designed, implemented and operated a modular digitisation workflow which can support in-house and outsourced digitisation campaigns. The workflow is operated normally to support the continuous digitisation of specimens using in-house facilities handling hundreds of specimens per day. However, it can scale-up to support mass digitisation campaigns which process thousands of specimens daily. This article reports on the design and operation of the digitisation workflow and is provided to showcase how this workflow was designed and implemented, particularly looking at the built-in audit trail within the workflow, which ensures the production of high-quality specimen images. Mass digitisation of the world’s specimens is required for two main reasons. Firstly, digitisation can provide a permanent record of a specimen even if the original eventually deteriorates or becomes unavailable (lost or destroyed). Digitisation also reduces wear on specimens because they are not handled manually every time they are consulted. Secondly, perhaps more importantly, digitisation will significantly enhance the accessibility of specimens. For many applications, a digital image of a specimen can replace the original, with the added benefits of being endlessly shared, duplicated, edited and printed. Accordingly, a fully digitised herbarium is a useful research tool for scientists, not only locally, but also globally. For example, digitisation can be relevant for countries in the tropics and southern hemisphere, given that many important collections from those countries are held in institutions located in Europe and North America. Considering that the ideal is to digitise collections once, it is critical that the digitisation process does not limit the eventual uses of the images. Potential applications include basic ones, such as determining the identity of the specimen and reading the label details but they might also include automated extraction of character traits using pattern recognition or information extraction from labels and annotations through optical character recognition (
Specimen digitisation workflows run at different paces depending on the degree of automation, type of collection and specimen handling protocols (e.g.
The paper is structured as follows. The first section describes the context in which the original workflow was implemented and the drivers for modularising and extending the workflow to allow the participation of external providers, as well as the quality criteria observed during its design and evolution. The second section describes the digitisation workflow, describing the tasks performed, the actors participating in its execution and the products derived from it. The third section elaborates on the implementation of image quality management within the workflow, by describing the images' audit trail. The fourth section discusses the results obtained with the operation of the MBG workflow and compares it to similar efforts. Finally, the fifth section describes the conclusions and further work.
The world’s 3,095 active herbaria contain an estimated 387 million specimens (
Until 2015, MBG had digitised over 100,000 specimens, mainly funded by the Andrew W. Mellon Foundation’s Global Plants Initiative (
The definition of quality criteria can serve to manage the expectations of digitisation processes, guide the acquisition of equipment, the selection of processing software and the selection of storage and publishing infrastructures. This section describes recommended quality criteria of research quality images. The criteria are derived from practical experience of the MBG while implementing and improving their digitisation practices, in line with established standards and recommendations.
The Global Plants Initiative guideline for herbarium specimen digitisation (
Table
Image Use | Resolution | Bit Depth | Grey Scale Factors | Colour Accuracy** |
Web Publishing | 72 PPI | 24-bit colour | ΔE < 5 | |
Printing | 300 PPI | 24-bit colour | ΔE < 5 | |
OCR Labels | 400 PPI | 8-bit grey scale |
Min: 28 steps Min: 5.5 f-stops Y channel noise <= 5% |
|
Identify Specimen Features | 400 PPI | 24-bit colour | ΔE < 5 | |
Research on Specimen | 600 PPI* | 24-bit colour | ΔE < 5 | |
Preservation | 600 PPI* | 24-bit colour | ΔE < 5 | |
* Minimum resolution recommended; if digitisation devices available allow for higher resolution, that resolution should be used. ** ΔE (Delta E, dE) is a metric for understanding how the human eye perceives colour difference. The term delta comes from mathematics, meaning change in a variable or function. The suffix E references the German word Empfindung, which broadly means sensation. |
MBG determined that, for every specimen, a set of three images need to be produced: a high definition uncompressed archive quality master image (TIFF 450 PPI). Apart from the master image, two derivatives are produced: a high definition lossless image for derivation of other images (JPEG2000 420 PPI) and a lossy image for web publishing/online inspection (JPEG 420 PPI). The high definition uncompressed image for archiving is intended for long-time preservation, the high definition lossless image is intended to provide a working image which is easy to store, transfer and process and the low-resolution image is intended for online publishing. Any other image derivatives required can be produced as needed.
The DPI value is largely meaningless, because it relates to an arbitrary print size of the object. The Global Plants Initiative (GPI) project (
In addition to the image quality requirements described above, herbarium sheet images must include a set of image elements. Image elements refer to visual elements which appear next to the herbarium sheet specimen and which are intended to help in the identification, processing and quality control. There are five elements recommended by the Global Plants Initiative (
Examples of herbarium sheets and the required elements to capture. The left image corresponds to a specimen digitised during the GPI project and the one on the right is an specimen digitised during DOE!. The elements are (1) Colour Chart, (2) Scale Bar, (3) Barcode, (4) Labels and (5) Institution Name. As the images show, some elements may be combined, for instance the scale bar and institution name on the left and colour chart and scale on the right.*
The colour chart is recommended for helping with quality control and post-processing; this can help in verifying the lighting, white balance and colour accuracy of the image. The Federal Agencies Guideline for Digitisation, the Library of Congress and Synthesys 3, recommend the use of the colour chart, referenced as colour target or colour checker (
Scale bar is recommended to enable the calculation of the dimensions of the specimen (
Herbarium name (with or without logo) is required to quickly identify the institution holding the specimen (
Labels are commonly placed next to the specimen attached to the herbarium sheet. Clear capture of labels is important for further processing and documentation of the specimens (
Barcodes are identifiers used for cataloguing specimens which are also useful for linking them to digital specimens. Synthesys 3 and GPI recommend the use of barcodes as internal identifiers which are important for further documentation and linking of the physical and digital specimens (
The MBG workflow is designed to handle in-house and outsourced imaging. The in-house mode of the workflow is run entirely by MBG, while the outsourced mode is designed to seamlessly integrate the outputs from a digitisation line run by an external contractor, so it is managed jointly by the contractors and MBG. When operating in internal mode, the MBG digitisation workflow can produce images at a rate of 5000 image sets*
The workflow consists of eleven tasks performed during digitisation. Table
Task | Sub-tasks | Quality Concerns |
1 Pre-digitisation curation |
|
Specimens are selected and prioritised for digitisation by collection curators. Some sheets may be damaged or fragile or specimens may need to be remounted to display relevant features. |
2 Imaging |
|
Equipment should be calibrated to minimise image postprocessing after digitisation. |
|
Identification, digitisation and [meta] data capture, so that images are correctly linked to the corresponding specimen records. | |
3 Image processing |
|
Verification of master image resolution format. Verification that derivatives adhere to quality standards. |
4 Imaging (alternate) |
|
Same as those for 2 and 3 above. |
5 Image processing (alternate) |
|
The task is simpler. However, the load increases considerably, from 5,000 to 25,000 weekly specimen image sets to process (400% increase). |
6 Store images |
|
Verify that master and derivative files are not corrupted in transfer to storage. |
7 Archive images |
|
Verify master is not corrupted in transfer and images are recoverable. |
8 Data transcription |
|
Verify readability of image data for transcription. Verification against reference image and recorded data before publishing. |
9 Data transcription (alternate) |
|
Verify readability of image data for transcription. |
10 Data transcription validation |
|
Verification against reference image and recorded data before publishing. |
11 Publish digital specimen |
|
Data, metadata, persistent identifiers and links are used to build stable long-lasting specimens which adhere to FAIR data principles. |
MBG digitisation workflow diagram. The circle shapes at the top and bottom indicate the start and end of the workflow. The rounded corner boxes represent workflow tasks (described in Table
The final output from this digitisation process is not a set of records, it is a set of digital specimens. The digital specimen concept is intended to define a representation (digital object) that brings together an array of heterogeneous data types, which are themselves alternative physical specimen representations. In this case, the digital specimen (DS) holds references to specimen data from a collection management system, images, 3D models, research articles, DNA sequences, collector information, amongst many other data types (
Pre-digitisation curation (1), Image storage (6), Archive Image (7) and Digital Specimen Publishing (11) are performed during both internal and outsourced digitisation, while the Imaging (2), Image Processing (3), Imaging Alternate (4) and Image Processing Alternate (5), Data Transcription (8), Data Transcription Alternate (9) and Data Transcription Validation Alternate (10) vary depending on the decision to perform imaging internally or outsource it. The main difference is that, in outsourced mode, the contractor digitises the specimens and produces the image derivatives. For this reason, there are fewer sub-tasks to perform as part of the processing image (alternate). The differences between these tasks will be further analysed in the following section as the details of the variation becomes clearer when describing the quality management activities.
Fig.
Quality management methods can be subdivided into two main areas: Quality Assurance (QA) and Quality Control (QC) (
Table
num | sub-task | type | dataset | state | ||
start | success | fail | ||||
1 | Check file name (Table 5) | AT, QA | TIFF set | names_ok | names_error | |
2 | Check tiff file size, image dimensions and resolution (Table 6) | AT, QA | TIFF set | names_ok | fssr_ok | fssr_error |
3 | Generate JPEG 2000 derivatives | AT, IH | TIFF set | fssr_ok | jp2_gen | jp2_gen_err |
JP2 set | jp2_gen | jp2_gen_err | ||||
4 | Generate jpeg derivatives | AT, IH | TIFF set | jp2_gen | jpg_gen | jpg_gen_err |
JPG set | jpg_gen | jpg_gen_err | ||||
5 | Check metadata file structure (Table 7) | AT, QC | TIFF set | jpg_gen | md5_ok | md5_error |
6 | Check duplicates (Table 8) | AT, QA | TIFF set | md5_ok | unique | duplicate |
7 | Check structure and file size (Table 9) | AT, QA | TIFF set | unique | fss_ok | fss_error |
JP2 set | jp2_gen | fss_ok | fss_error | |||
8 | Visual qc tiff files (Table 10) | MT, QC | TIFF set | fss_ok | vqc_ok | vqc_error |
9 | Check filename (Table 5) | AT, QA, IH | JPG set | jpg_gen | jpgn_ok | jpgn_error |
Sub-task Type: AT automated task, MT manual task, QA quality assurance task, QC quality control task, IH sub-task performed in-house only. |
num | sub-task | type | dataset | state | ||
start | success | fail | ||||
1 | Remove duplicates and bad crops (Table 11) | MT, QA | TIFF set | vqc_ok | dup_rmv | |
JP2 set | fss_ok | dup_rmv | ||||
JPG set | jpgn_ok | dup_rmv | ||||
2 | Copy files to archive | AT | JP2 set | dup_rmv | stg_ok | stg_error |
JPG set | dup_rmv | stg_ok | stg_error | |||
3 | Generate image viewers | AT | JP2 set | stg_ok | vwrg_ok | |
4 | Copy files to ftp server | AT | TIFF set | stg_ok | svrc_ok | svrc_error |
5 | Copy files to external archive | AT | TIFF set | svrc_ok | arc_ok | arc_error |
6 | Check jp2 and jpg sets stored (Table 12) | AT, QA | JP2 set | vwrg_ok | stgv_ok | stgv_err |
JPG set | stg_ok | stgv_ok | stgv_err | |||
7 | Clear buffer server (Table 13) | AT, QA | TIFF set | arc_ok | bufc_ok | bufc_err |
8 | Clear buffer server | AT | JP2 set | stgv_ok | bufc_ok | bufc_err |
JPG set | stgv_ok | bufc_ok | bufc_err | |||
Sub-task Type: AT automated task, MT manual task, QA quality assurance task. |
The following subsections will elaborate on the description and technical details of the sub-tasks that are directly related to quality management. The sub-tasks are presented in order of occurrence in the workflow. Additionally, each subsection includes a table with the technical details of each sub-task describing the agent that performs the sub-task, a brief description of the sub-task, the dependencies of the sub-task (required software, hardware) and the target entity of the sub-task (the specific datasets to be affected). This organisation is specifically designed to allow, in future, mapping the sub-tasks of the audit trail with a standard provenance model (such as PROV-O
In the MBG collection, herbarium sheets specimens are identified by a barcode label. These labels conform either to the UPC-A or Code 128 format. In line with current practice in herbarium management (
Agent | Check-barcode (script). |
Function | Verify that image file names structure is formed using the corresponding barcode. |
Dependencies | ZBAR open source library for reading barcodes from image files (http://zbar.sourceforge.net/). |
Target(s) |
|
Criteria | Each file name must conform to the format:
|
Success | Filenames are correctly formed (names_ok). |
Fail | Filenames are incorrect (names_err). |
Example | Valid file names for the images in the three sets corresponding to specimens with barcode from the example shown on Fig.
|
Exceptions | Herbarium sheets can contain more than one specimen and more than one barcode. These sheets may be flagged as incorrect and require manual processing. Additionally, herbarium sheets can have legacy barcodes from previous cataloguing efforts and, consequently, may have more than one barcode even when having only one specimen. If this is the case, the legacy barcode is removed, the image is deleted and the specimen is sent back for re-imaging. |
* Owing to the need to image collections of other herbaria and various subcollections, other filename formats have had to be accommodated. |
At MBG, curators follow the recommendation of placing barcodes as close to the bottom right corner of the sheet as possible. If the location of the barcode is known beforehand, ZBAR can be configured to read just that part of the image, speeding up processing time considerably. However, this would only work in a collection where there cannot be two specimens on the same sheet.
Most herbarium sheets have a standard size. Consequently, herbarium sheet imaging produces images of a size which fall within a predictable range. Based on this observation, MBG established a heuristic rule correlating file size, cropping and image resolution. This has helped in establishing the accepted file size limits for the different types of image files handled and generated during image processing. The technical details of this sub-task are summarised in Table
Agent | Check-tif-resol-and-size (script). |
Function | Utilise image file size to detect resolution and cropping. |
Dependencies | JHOVE: a file format identification, validation and characterisation tool ( |
Target(s) |
Master images of TIFF set (Image Processing sub-task 2). |
Criteria |
Each file size must be above 88 MB (average minimum file size, which is a consistent indicator of image dimensions). Additionally, the smallest and largest files of each batch are verified manually. |
Success | Correct file size indicates that cropping and resolution are within the acceptable range (fssr_ok). |
Fail | Incorrect file size may indicate cropping or resolution issues (fssr_err). The images need to be flagged for manual verification. |
Exceptions |
Some specimens can be preserved in non-standard size sheets, like the one shown on Fig. |
This process also includes verifying the width and height of the image which is also a dependable indicator for detecting bad crops and malformed images. The check file size and resolution sub-task relies on a script that used statistical data of past digitisation campaigns. The script collects the size of each image file and verifies that it falls within the expected range. Images outside this range are flagged as examples of bad cropping. Large files tend to be under-cropped and small files over-cropped. This automated size check can be relied on to test cropping on a large image set, but it is less sensitive than a visual cropping check. A visual cropping check can identify small cropping problems on images, but it is impractical for large datasets. Similarly, the resolution of the image in pixels is related to the file size. In a previous version, the script tested the resolution of the image in dots per inch (DPI), because it was part of the contract terms with the external contractors. However, testing the dimensions of the image in pixels is a simpler reliable test of resolution.
The MD5 algorithm is a hash function used to create a checksum for a file. It can be used to ensure that an image has not been corrupted. The hash function is created soon after the image is made and needs to be stored for later use. The check TIFF Metadata File Structure sub-task relies on a script that computes the MD5 values from the image files and verifies that it matches the value generated for the original file, shortly after digitisation (md5deep and hashdeep software packages, Table
Agent | Check-md5-meta (script). |
Function | Utilise md5 checksum to verify the integrity of images after transmission, storage and recovery operations. |
Dependencies | md5deep and hashdeep software packages to process verify the match between stored and computed md5 hash values (http://md5deep.sourceforge.net/). |
Target(s) |
Master images of TIFF set, only a subset is verified (Image Processing sub-task 5). |
Criteria | Calculated md5 hashset values must coincide with stored hashset values. |
Success | The image file has not changed, the copy is consistent with the original (md5_ok). |
Fail | The image file has been corrupted since its creation, original archive file is required to restore it (md5_err). |
Exceptions | If errors are detected in a sample, the process can be reverted to verify the full batch. |
The large numbers of specimens being processed in mass digitisation operations increases the risk of duplicate imaging. Additionally, as the description of the Check File Name sub-task explains, when a herbarium sheet contains more than one specimen, the image file is duplicated manually as many times as needed, to produce at least one image for each barcoded specimen on it.
MBG created a script that searches and flags image duplicates. The script uses barcode verification and md5deep (Table
Agent | check-dups (script). |
Function | Verify barcodes in a new batch against the ones already in the archive database. |
Dependencies | None. |
Target(s) |
Master images of TIFF set (Image Processing sub-task 6). |
Criteria | Checking eventual duplicates is done by a script which verifies that the barcodes in the batch have not been already used by looking up in the archive database. |
Success | The set does not contain duplicate images (unique). |
Fail | The set contain duplicate images which need to be further analysed to determine if they are valid duplicates or need to be flagged for removal from the set (duplicate). |
Exceptions | Some types of duplicates are allowed, but require the intervention of a human operator. |
For long-term archiving, the specimen images need to conform to well-known standards, in order to improve the chances of recovery in the future. The Tagged Image File Format (TIFF) format is recommended standard for long-term storage (
Agent | check-jp2-and-size (script). |
Function | Verify that the images conform to the standards selected by MBG for long-term storage (TIFF) and high-definition production images (JP2). |
Dependencies | JHOVE for analysing and checking that the images are well-formed (consistent with the basic requirements of the format) and valid (http://jhove.openpreservation.org/). Jpylyzer verifies if a JP2000 image really conforms to the format’s specifications (validation). It also reports the image’s technical characteristics (http://jpylyzer.openpreservation.org/). |
Target(s) |
Master images of TIFF set (Image Processing sub-task 7) production images on JPG and JP2 sets (Image Processing sub-task 7). |
Criteria |
TIFF images must conform to the TIFF 6.0 Specification. JP2000 images must conform to the JPEG 2000 image compression standard (ISO/IEC 15444-1). |
Success | The image files conform to the corresponding standard (fss_ok). |
Fail | The image files do not conform to the corresponding standard (fss_err). |
Exceptions | Legacy scans prior to the implementation of the audit trail procedures may not conform to the current standards selected. |
There is still a need to visually verify a subset of the images for quality control. This is particularly important when the digitisation process has been outsourced as this is the means for verifying that the provider meets the service level agreements. Table
Agent | Quality Manager (person). | |
Function | Verify image quality by visually inspecting a sample of the images in the batch. | |
Dependencies |
Calibrated high pixel density display (e.g. Retina 5K Apple) Image editing programme (e.g. GIMP ( Validation Checklist describing the steps of the inspection for selected images. |
|
Target(s) |
Master images of TIFF set, only a subset is verified (Image Processing sub-task 8). |
|
Criteria | focus | Edges of the elements (specimen, labels, charts) are well defined, the text is readable. |
cropping | All elements of the specimen are visible in the image frame, i.e. no parts seem to extend beyond the edge of the image. | |
exposure | Verify white balance using the white box of the colour chart and verify its average value:
The digitisation team established the limits by considering that, outside the colour chart, no details are visible if the level is lower than 12 for white or higher than 250 for black, these are equivalent to having 'holes' without data. The complementary values 225 and 18 are provided for reference. |
|
barcode | Verify that the name of the file is the same as the barcode on the sheet. | |
Success | Images meet visual quality criteria (vqc_ok). | |
Fail | Images do not meet visual quality criteria (vqc_err). In this case, the operator needs to verifty another sample to determine if the whole batch should be rejected. | |
Exceptions | Reference values need to be verified depending on the colour chart. usedSpecimens have been photographed with two types of colour chart: Standard CIE D50 Illuminant D50, Macbeth ColorChecker and ISA Golden Thread target. |
Removing potential duplicates and bad crops is a process that needs to be verified visually. When some images are flagged, the processing of a batch is not stopped, i.e. the batch is not rejected if errors are detected, unless the errors are found in more than 5% of the images. Instead, the erroneous images detected are removed from the batch before storing them locally or in the long-term repository. The details of this sub-task are shown in Table
Agent | Quality Manager (person). |
Function | Remove images flagged as bad crops or duplicates. |
Dependencies | Error log report with list of non-compliant images. |
Target(s) |
Master images of TIFF set (Store Image sub-task 1). Production images on JPG and JP2 sets (Store Image sub-task 1). |
Criteria | If an image in one of the sets is flagged (TIFF, JP2 or JPG), that image is removed from the set and all corresponding images in the other sets are also removed. |
Success | Flagged images have been removed (dup_rmv). |
Fail | Flagged images have been removed (dup_err). |
Exceptions |
If flagged images are part of the production set, the corresponding image from the master set needs to be validated to determine if the error was generated when the derivatives were produced or it is an imaging error. |
The production set consists of a set of high definition JP2 images and a set of lightweight JPG images. During processing and validation, these sets are temporarily stored on a buffer server. Once the production set has been completely processed and validated, the operator executes a task to verify that all compliant image derivatives have been copied to the image repository before the buffer is cleared in preparation for processing the next batch (Table
Agent | check-if-archived (script). |
Function | Verify that the production set images have been copied to the image repository and the back-up server. |
Dependencies | Logs containing the paths to the servers where the image sets are stored. Read access to server for verification of file paths. |
Target(s) |
Production images on JPG and JP2 sets (Store Image sub-task 6). |
Criteria |
File paths for the images in the production sets need to be valid and non-empty. |
Success | Image files are stored and backed up (stgv_ok). |
Fail |
Error in the image files store/back up process (stgv_err). Verify if storing procedure was performed and terminated with no errors. |
The master set contains the high definition uncompressed TIFF images which are intended for long-term storage preservation and future recovery purposes. During processing and validation, this set is temporarily stored on a buffer server. Once the master set has been completely processed and validated, the operator must verify that all compliant images have been copied to the external archive repository, before the buffer is cleared in preparation for processing the next batch (Table
Agent | del-dir-viaa (script). |
Function | Delete master set copy from the buffer server, once reception and archiving is confirmed. |
Dependencies | Confirmation from contractor of archiving of TIFF set. |
Target(s) |
Master images of TIFF set (Store Image subtask 7). |
Criteria | The acknowledge code from contractor indicates that the master set has been received and archived. |
Success | The image files are archived and the buffer has been cleared (bufc_ok). |
Fail |
Archiving of image files is not confirmed (bufc_err). Verify if copy files to archive task was performed and terminated with no errors. |
Exceptions | Retry copy files to archive sub-task. |
Initially, the MBG digitisation workflow supported the digitisation of 100,000 specimens (2.5% of the collection). The successful implementation and continuous operation of the workflow during the period from 2015-2018 has allowed digitising a further 1,300,000 specimens, raising the total number of specimens digitised to 1,400,000 (35% of the collection). Of these, 1,200,000 specimens where digitised in a mass digitisation campaign conducted over a year (2016-2017) with the collaboration of an external contractor. The other 100,000 specimens have been digitised as part of the continuous digitisation operations which have become part of MBG curation processes. The addition of external partners included the tendering and selection of an experienced digitisation company (Picturae), which supported the mass digitisation campaign. Similarly, a working relationship with VIAA (Flemish Institute for Archiving) for the long term archiving of master TIFF images was established and integrated as part of the continuous digitisation efforts. At the moment, MBG is in the midst of a second mass digitisation campaign, which will digitise an additional 1,400,000 specimens and which will get MBG closer to the target of a fully digitised herbarium, reaching 70% of the collection by 2020. The modular nature of the workflow has also allowed the testing of the inclusion of different providers. In this case, as part of the ICEDIG project, a pilot study analysed the requirements for using European Open Science Cloud infrastructures for long term storage (
The quality of the images created in-house and those outsourced are equivalent. Except for the metadata establishing which process was used for digitising, the quality criteria including image dimensions, resolution and image elements are the same. The images in Fig.
Example of results from in-house and outsourced digitisation. Image "a" corresponds to a specimen digitised in-house and image "b" corresponds to a specimen digitised by the contractor. Images "c" and "d" correspond to close-ups of the sections highlighted in "a" and "b", respectively, presented at 100% size (5x5 cm square).
Fig.
The digitisation workflow of the Royal Botanic Garden Edinburgh (RBGE) is an example of an in-house digitisation workflow. The main reasons for designing and developing this workflow in-house was the variation of funding and scale of digitisation campaigns (
The Natural History Museum, London (NHM) has also developed a workflow for the digitisation of microscope slides, which significantly improved their digitisation efficiency, as compared to previous semi-automated methods. They also found that their automated workflow reduced the number of errors (
Picturae is a digitisation provider with more than ten years of experience digitising libraries and museum collections (including cultural and natural history collections). The digitisation workflows, developed by Picturae (Fig.
The outsourced approach is effective in mass digitisation projects. There are similar examples of such projects involving other providers, such as the digitisation of the Moscow University Herbarium (
Image quality control is an essential aspect of the digitisation of biological collections. The imaging workflow combines physical movement of specimens with complex digital workflows and delays or acceleration of any part of the process can have negative consequences. In this workflow of many time-dependent steps, it is easy to overlook or to cut corners on quality control. However, if care is not taken at this stage, then it may impact the long-term usefulness of the images. Care needs to be taken at every stage of the process from the photography, the depositing of files on the image servers and the long-term archiving. The workflow and quality management tasks, described in this article, illustrate the process of creating high-quality digital specimen collections which are close to the ideal of the digital herbarium. The workflow tasks are constantly revised and updated to improve their speed and accuracy.
The authors thank the Vlaamse Regering for the digitisation funding grant entitled ‘Digitale Ontsluiting Erfgoedcollecties’ and the ICEDIG project (Horizon 2020 Framework Programme of the European Union – Grant Agreement No. 777483)
Images obtained from MBG data portal: http://www.botanicalcollections.be/specimen/BR0000013305871 and http://www.botanicalcollections.be/#/en/details/BR0000010148600
As stated above, an image set consists of at least three images per specimen: Archive Master Image (TIFF), High-Definition Production Image (JP2) and Lightweight Publishing Image (JPG). If a specimen requires more than one image, for instance including booklets attached to herbarium sheets, the image set grows proportionally.
Updated version of the workflow included in
The commercial price is provided for reference, using the publicly avaliable storage pricing list for Amazon Web Services (Amazon S3) as advertised during December 2019 (https://aws.amazon.com/s3/pricing/). Prices are converted from United States Dollars (USD), using an exchange rate of 0.91. In addtition to storage prices, Amazon also charges a fees for data access requests, data transfer and management and replication which all add up to the total cost of storing data. Other providers, such as Google and Microsoft, offer similar pricing schemes for storage (i.e. storage + transaction + management costs which vary according to size and access type).