iCollections methodology: workflow, results and lessons learned

Abstract The Natural History Museum, London (NHMUK) has embarked on an ambitious programme to digitise its collections. The first phase of this programme was to undertake a series of pilot projects to develop the workflows and infrastructure needed to support mass digitisation of very large scientific collections. This paper presents the results of one of the pilot projects – iCollections. This project digitised all the lepidopteran specimens usually considered as butterflies, 181,545 specimens representing 89 species from the British Isles and Ireland. The data digitised includes, species name, georeferenced location, collector and collection date - the what, where, who and when of specimen data. In addition, a digital image of each specimen was taken. A previous paper explained the way the data were obtained and the background to the collections that made up the project. The present paper describes the technical, logistical, and economic aspects of managing the project.


2
Initiative are roughly analogous to the Metamorfoze Extra Light standard and should be followed in herbarium sheet digitisation.

Natural history objects and 3D artefacts
The document deals primarily with 3D natural history objects and artefacts from the NHM's collection. Unlike cultural materials, there are no current standards developed for imaging natural history objects. Standards should be established following current recommendations on "per project" basis based on the intended use of images, necessary equipment and techniques, and limiting factors for the project. For each particular project, the minimum necessary quality criteria must be established and followed. The highest level for each quality criteria, which does not compromise objectives of the project, should be selected.

Use of images
The use of images may include but not limited to: Quality levels For each level speed (S), perceptual quality (P) and research (R) integrity is assessed: -: not acceptable, *: acceptable under certain circumstances, **: good; ***: excellent; n: irrelevant 1 2 3

Master files
Kept safe and unchanged, for all manipulations use a copy of data S**/P***/R***

Lens distortion
Acceptable on a copy of original file S*/P***/R** Global adjustments of brightness and contrast, levels, and gamma Acceptable on a copy of original file as long as no information is lost (intensity histogram is not overstretched) S**/P***/R** Acceptable on original file as long as editing is non-destructive S**/P**/R** Cropping Acceptable as long as does not affect software-estimated scales S*/P***/R** Local adjustments (dodging or burning) Not acceptable S*/P***/R-Local edits (clone, eraser) Not acceptable unless for a strong reason; must be documented S*/P***/R-Scaling (changing of size of image in pixels) Not acceptable S**/P***/R-Artificial colouring Acceptable for greyscale images; must be documented S*/P***/R*

Image quality criteria General principles
In order to provide acceptable and repeatable image quality two main principles must be followed: consistency and documentation.

Consistency
Image acquisition conditions, post-processing protocols, metadata schemas and data management processes must be as identical as possible through the lifetime of the project, unless there is a strong reason otherwise.

Documentation
Initial protocols of image acquisition, processing, metadata collection and data handling must be documented and kept with project data. Any changes to protocols must be recorded and original documentation amended to reflect this.

Composition
If possible, the whole sensor (frame) should be use to capture the object, empty space should be minimised. However, for the sake of speed it might be practical to capture images at the same magnification (zoom) even when they vary slightly in size. In the case of digitisation projects it is good practice to capture the specimen together with its collection label(s). When it is not possible, labels must be captured separately and linked to the object images using special workflows.

Focusing
It is recommended to use manual focusing; autofocus should be avoided, except when no other options are available (automatic slide scanning, for example). When the depth of focus of an instrument is inadequate, a series of images must be captured at different, slightly overlapping (~15%) focal planes and submitted for post-processing to produce an extended depth of field image (EDF, see below).

Illumination
Preferably daylight, simulating as close as possible to D50/D65 illuminant (colour temperature 5003K/horizon light or 6504K/midday light in Western Europe) should be used for photography. Full spectrum daylight fluorescent bulbs and high-CRI (colour render index) daylight LED are good choices.
Mixing different light sources must be avoided.

White balance
White balance must be setup manually except when this function is not available and accurate colour rendering is not critical. White balance and colour calibration (if applicable) must be adjusted each time the illumination changes.

ISO
Native sensor sensitivity (ISO) (Canon: 100, Nikon: 160, Fuji and most mirrorless cameras: 200) must be used unless it compromises image capture (imaging fast moving objects or at very low light). Higher or lower values of ISO are results of in-camera processing of original data (opto-electronic conversion function) and may lead to lower signal-to-noise ration. Capture at higher sensitivity or use of different methods to increase gain (e.g. binning) may introduce digital artefacts.

Exposure
Exposure must be estimated correctly, overexposed or underexposed images are not acceptable particularly if not saved in RAW format since they lead to irretrievable information loss. Use of the histogram is the most convenient and accurate way to calculate exposure.

Scales and targets
Scales and targets (such as colour checker) should be visible on the image next to the object whenever possible. However, in cases when the size of the object prevents it, scales should be calculated through capturing separate images of a physical scale at the same magnification. Microscopes, as a rule, are calibrated, and capture software can insert virtual scales automatically. See also notes on Extended Depth of Field.

Colour space
Colour space is a particular organisation of colours that allows translation between different devices (reproducible representation of colours on different devices). Wider colour space does not necessarily mean better quality image.
sRGB Used on most displays, cameras, scanners and printers, as well as the Internet. The sRGB color space is well specified and is designed to match typical home and office viewing conditions, rather than the darker environment typically used for commercial color matching. If the colour space of the image is unknown, sRGB is the safe choice.
Adobe RGB (1998) The Adobe RGB (1998) color space was designed to encompass most of the colors achievable on CMYK color printers and as such is the best choice for images that will be printed on photo inkjet printers. The Adobe RGB (1998) color space encompasses roughly 50% of the visible colors, so it is slightly wider than sRGB.

ProPhoto RGB
The ProPhoto RGB color space encompasses over 90% of possible surface colors in the CIE L*a*b* color space, and 100% of likely occurring real world surface colors. It is important to keep in mind, however, that ProPhoto RGB should not be used for 8-bit per channel images.

Bit depth
JPEG files are 8-bit (24-bit, counting the three colour channels). Raw data from the sensor of a camera or a scanner can be 12-or 14-bit. To use the raw images, they have to be converted into 8-bit JPEG or TIFF images or 16-bit TIFFs (48-bit). While 16-bit images may contain more colour and dynamic range information, these are imperceptible to the human eye, as a rule. Accurate conversion algorithm will not compromise the quality of the image used for visualisation of object features. If images are used for quantitative analysis, reducing bit depth is not recommended.

File formats
JPEG files from the camera must be saved at the highest available quality. 8-bit TIFF files can be compressed using LZW or ZIP algorithms. This saves up to 60% of disk space and does not affect image quality; although may increase processing times slightly. For 16-bit TIFF files, ZIP produces much smaller files that LZW. Layered TIFF files should be avoided. The file size is very large, saving time is very long, and most printers will have trouble handling layered TIFF files. Adobe Photoshop PSD file format is efficient for layers.

Aperture
Aperture, NA, f -stop, etc. are related concepts referring to the size of the opening of the image-forming lens. The wider it is the more light reaches the sensor, and therefore the easier is to record the image. However, larger opening (high NA or low f -stop) results in a shallow depth of field and increased vignetting. High values of aperture increase a depth of field (see below) but may deteriorate resolution due to diffraction. As a good compromise it is recommended to keep f -stop between 5.6 and 11. For best results check the MTF (modulation transfer function) charts for a particular lens.

Depth of field
Depth of field or focus range is the distance between the closest and furthest object or detail that appears sufficiently sharp. Depth of field is roughly inversely proportional to aperture and size of the sensor (that is the larger the aperture and sensor size the shallower is the depth of field). Magnification and focal length effectively change the numerical aperture, therefore they are also inversely proportional to the depth of field. Extended depth of field (focus stacking, z-stacking, EFI) technique can be used if the necessary depth of field cannot be achieved optically (see below).

Resolution
Resolution, the ability of an instrument to capture fine detail, must not be confused with pixel count (what most people following salesmen call resolution) nor sampling rate (printing/scanning resolution). Resolution is determined by optical properties of the instrument's image forming lens (mainly numerical aperture) and physical qualities of the sensor (size of matrix, pixel count, presence and structure of colour array filter [Bayer or other types] etc.). In order for a detail to be discernable it has to have a length of ~2.3 pixels (see Nyquist/Shannon sampling theorem for details). This means that a sensor with 6,000 pixel wide would record only ~ 3,000 black-and-white line pairs if a perfect optical system is used.

Pixel array
The pixel array is the physical number of individual sensor elements in a sensor matrix. The number of pixels has only a secondary effect on resolution: if the optical system of the instrument is not able to resolve fine details, they will not be recorded by the sensor. On the other hand, the sufficient sampling rate (at least 2 pixels per dot) must be adequate to the full advantage of optical system.

Sampling rate (scanners)
During scanning (or printing) the object, the ability of an imaging instrument to resolve fine details does not depend on the distance between the object and the instrument, and thus, can be expressed in absolute figures (ppi or dpi, pixels or dots per inch). In the case of photography and photomicrography the distance to the object is not set, so only the angular size of details tell us about the resolving power of an instrument. In these cases it is more intuitive to think of resolution as the minimum resolved distance. It is important to distinguish between the real resolution of the instrument (data recorded by a sensor) and interpreted resolution (result of post-processing of raw data). Although absolute numbers for the latter may be higher, it may not reveal additional details compared with former.

Ethical guidelines for post-processing
It is recommended to follow Ethical guidelines for post-processing when dealing with research 3 images. In short, the following principles should be adhered to: • Minimise manipulation • Edit non-destructively • Processing should be consistent and documented

Natural and artificial colour
Artificial colouring is generally acceptable only for images taken in greyscale (SEM, CT, laser scanning etc.). All cases of colouring must be documented.

Focus stacking
Focus stacking is necessary when the depth of an object exceeds the depth of field of the selected imaging method and instrument and all details of the object must be sharp. In such cases multiple (up to several hundred) images may be taken at consecutive focus planes from which a completely focused picture can be reconstructed in special software (see Appendix 1). For best results depths of field of neighbouring images should overlap slightly (~10-15%). During reconstruction images must be aligned and scaled if necessary; this can happen automatically. Helicon Focus and other software oriented on photography does not take into account perspective distortion (does not rescale image to make sure that objects in the foreground and background are the same size). Therefore EDF images from photography-oriented EDF programs (Helicon Focus, Zerine Stacker, CombineZ, Photoshop, etc.) are more pleasing aesthetically but less accurate scientifically. Physical scales, if used, must be located at the bottom focal plane of the specimen. If precise measurements are to be obtained from stacked images, EDF algorithms bundled with scientific acquisition software may be more reliable. Thorough comparison of performance of f the various focus stacking packages available has not yet been carried out.

Tiling
3D reconstruction 3D reconstruction methods may include photogrammetry, reconstruction from stereo pairs (or tilting), CT, confocal microscopy, thin section reconstruction and surface scanning (or laser scanning).

Metadata quality criteria
The operator should record metadata at the time of acquisition, or as soon as possible after acquisition, (e.g. at the end of the day). Three categories of metadata are necessary:

Audubon Core
The Audubon Core Multimedia Resource Metadata Schema (simply "Audubon Core" or "AC") is a representation-free vocabulary for the description of biodiversity multimedia resources and collections. Its implementation in the NHM will depend on future development of KE EMu and MAMS.

Data management principles
Data management conventions are not set in stone, and can be changed from project to project if necessary. However it is vital to agree on these beforehand and maintain them through the life cycle of the project, unless change will significantly improve productivity. Below are some examples of data management agreements from current DCP projects.

Naming conventions
iCollections Files are named IMGXXXX_NAME (where XXXX: sequenced number assigned by the camera; NAME: operator). A script renames files to BMNHE_XXXXXXX for specimen images and BMNHE_XXXXXXX_label for cut-out images of labels.

Archiving
Images of digitisation projects should be stored until successful ingest to KE EMu. Research images (including original RAW files) should be archived on UNIFIED storage. These can be stored temporarily (for the duration of the project if immediate access to files from multiple locations is absolutely necessary) upon consultation with TS.

Ingest
If file naming and folder structure conventions are maintained consistently, automatic ingest to KE EMu and MAMS is possible. Images must be moved to KE EMu staging area, and the administrator will initiate ingest. Multimedia records for each image will be created and linked to the corresponding Catalogue record, which in turn is linked to the Taxon record and Location record. In case the Catalogue record does not exist, it will be created at this point.

Images and EMu and Data Portal
Images ingested into KE EMu will be published on the Museum Data Portal. Typically they should appear within one week. If publication is not desirable, for example, active research is being conducted on the data, an embargo period of up to 12 month can be granted.

QA/QC procedures
Quality assurance should be implemented by following the above standards consistently. It is necessary to select required quality levels for each criterion at the planning stage of the project and to adhere to them at all times, unless overwhelming evidence is obtained that they should be reviewed. All deviations must be documented immediately. Regular device-level calibration is important to ensure correct functioning of imaging equipment.
Quality control for larger projects should be tier-based.
• Immediate control is performed by an operator at the end of the day or stage (preferably on other's results, when multiple operators available). • Weekly random check by a supervisor or independent colleague. Percentage of checked images should be agreed in project specifications.