Biodiversity Data Journal :
Forum Paper
|
Corresponding author: Quentin Groom (quentin.groom@plantentuinmeise.be)
Academic editor: Daniel Mietchen
Received: 12 Jul 2023 | Accepted: 24 Oct 2023 | Published: 30 Nov 2023
© 2023 Quentin Groom, Mathias Dillen, Wouter Addink, Arturo H. Ariño, Christian Bölling, Pierre Bonnet, Lorenzo Cecchi, Elizabeth R. Ellwood, Rui Figueira, Pierre-Yves Gagnier, Olwen Grace, Anton Güntsch, Helen Hardy, Pieter Huybrechts, Roger Hyam, Alexis Joly, Vamsi Krishna Kommineni, Isabel Larridon, Laurence Livermore, Ricardo Jorge Lopes, Sofie Meeus, Jeremy Miller, Kenzo Milleville, Renato Panda, Marc Pignal, Jorrit Poelen, Blagoj Ristevski, Tim Robertson, Ana Rufino, Joaquim Santos, Maarten Schermer, Ben Scott, Katja Seltmann, Heliana Teixeira, Maarten Trekels, Jitendra Gaikwad
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Groom Q, Dillen M, Addink W, Ariño AHH, Bölling C, Bonnet P, Cecchi L, Ellwood ER, Figueira R, Gagnier P-Y, Grace OM, Güntsch A, Hardy H, Huybrechts P, Hyam R, Joly AAJ, Kommineni VK, Larridon I, Livermore L, Lopes RJ, Meeus S, Miller JA, Milleville K, Panda R, Pignal M, Poelen J, Ristevski B, Robertson T, Rufino AC, Santos J, Schermer M, Scott B, Seltmann KC, Teixeira H, Trekels M, Gaikwad J (2023) Envisaging a global infrastructure to exploit the potential of digitised collections. Biodiversity Data Journal 11: e109439. https://doi.org/10.3897/BDJ.11.e109439
|
Tens of millions of images from biological collections have become available online over the last two decades. In parallel, there has been a dramatic increase in the capabilities of image analysis technologies, especially those involving machine learning and computer vision. While image analysis has become mainstream in consumer applications, it is still used only on an artisanal basis in the biological collections community, largely because the image corpora are dispersed. Yet, there is massive untapped potential for novel applications and research if images of collection objects could be made accessible in a single corpus. In this paper, we make the case for infrastructure that could support image analysis of collection objects. We show that such infrastructure is entirely feasible and well worth investing in.
machine learning, functional traits, species identification, biodiversity, specimens, computer vision
Owing to their crucial role in documenting the Earth's biodiversity, global biological collections are likely to contain samples representing most known macro-biodiversity. These collections serve as invaluable assets for various research fields including ecology, conservation, natural history and epidemiology (
To keep up with demand for access to collections, digital imaging of biological collections has progressed at pace (Fig.
Progress in digitising natural history collections. A growing number of images are accessible from the Global Biodiversity Information Facility, iDigBio or BioCaSE. To examine the rate and volume of digitisation, we used six snapshots of these databases taken since 2019, using Preston, a biodiversity dataset tracker (
With this increase in digital images, it is not surprising that computer vision techniques are now being applied to them. In recent years, machine learning, in particular, has become mainstream and has been built into workflows that start with digital images and metadata and result in statements about what is shown. Such workflows can extract information about biological specimens from typed or handwritten labels (
Improving online access is important because collections are physically dispersed, yet interconnected (
Unified access to specimen images is particularly important because image files are comparatively large and image analysis pipelines are demanding on processor time. Current internet bandwidth makes transferring large numbers of files a bottleneck, particularly if they need to be moved multiple times. Therefore, it makes sense to store large numbers of images close to where processing will occur. While such infrastructure exists for other data types (e.g. Copernicus for remote sensing and WLCG for the Large Hadron Collider), no such support exists for biological collections-based image processing. Researchers amass images and process them independently, which is unscalable and is unsuitable for dynamic image corpora and workflows intended to run multiple times.
The Vision
We envisage a data space for biological collections with a centrally accessible image corpus with built-in processing. This will allow anyone to access digitised images of specimens, without having to concentrate on the logistics of corpus creation and maintenance. Building accessible interfaces would also remove technological barriers that prevent taxonomists, ecologists and others from using advanced analysis tools. Through supervised expert contributions, the system could integrate knowledge from many disciplines. Such a corpus would constantly be furnished with new images from publishing collections and support citation and reproducibility of workflows and their underlying collections, in alignment with FAIR Data Principles (
The Scope
Images from living organisms are not considered here, nor other media, such as sounds, though they are undoubtedly useful and deserve attention. Though the Al challenges of images of living organisms are different, their numbers are at least two orders of magnitude larger and increasing more rapidly than digitised preserved specimens and dedicated infrastructures already exist to process them, such as Pl@ntNet and iNaturalist. The creators of such images are also more varied, as are the relevant licensing requirements. An exception might be images of living organisms in situ before they were preserved. Such images give additional context to the specimen and can potentially be used alongside the preserved specimen for human and computational comparison (
In this paper, we present the purposes for a unified infrastructure of specimen images and envisage what it might look like. We answer the questions: what research could be done with such an infrastructure, who would use it, what functionality would be needed and what are the architectural requirements?
We imagine a future where we can search across global collections for such things as the pattern of a butterfly’s wing, the shape of a leaf, the logo of a specific collection, or for examples of someone’s handwriting.
Infrastructure needs to justify its costs through benefits, not just for science, but wider society. We also need to understand the users and other beneficiaries. Below, we outline some uses and users for an imaging infrastructure for collections; there are undoubtedly more we have yet to imagine.
Species identification
Most experiments with species identification from specimen images have focused on herbarium specimens (
The state of preservation, uniformity and distinctiveness of pollen grains also makes them good targets for automated identification, whether they are from preserved collections or fresh. Indeed, pollen is well preserved as fossils and sub-fossils making them useful targets to analyse evolutionary and ecological change (
The main advantage of automated identification of digital images of preserved specimens is not accuracy, but potential for high throughput. Accessing large numbers of images in a suitable computational environment remains a critical factor to mainstreaming automatic specimen identification across collections.
Extracting trait data
Morphological, phenological and colourimetric traits are often visible on specimen images (e.g. Fig.
Zoological specimen labels contain rich data.
Functional traits
Morphological functional traits have been used to predict impacts of climate change on ecosystem functioning (
Leaf morphological traits are particularly amenable to extraction from herbarium sheets, because they are laid flat and do not necessarily require magnification (
In the case of fish, the large number of species globally, enormous number of morphological traits and substantial variation mean we can only hope to fill gaps in our knowledge of traits if preserved specimens are used (
Using well-documented algorithms for extracting traits from specimens would be much more efficient if a single large corpus were available for analysis and measurements could be less prone to error and more reproducible if source code and training data are open and shared (
Collection practices have changed considerably over more than four centuries (
Phenology
A trait of particular interest for climate change studies is phenology. Changes in seasonal temperatures and rainfall affect hatching or emergence of dormant animals and maturation of leaves, flowers and fruits. Such changes may lead to a mismatch in seasonality amongst organisms (
Species interactions
Organisms are in constant conflict with predators, parasites and pathogens. Specimens provide a record of this, revealing long-term changes related to environmental change, such as the introduction of non-native species (
Collections care, curation and management
Information is also needed for curation, organisation, storage and management of collections. An example is the need to identify specimens treated with toxic substances, such as mercuric chloride formally used to prevent insect damage. Over time, mercuric chloride leaves stains on mounting paper.
One can imagine image analysis workflows that detect the type of mounting strategy and preservation state of specimens. This would help curators triage remounting or other forms of curational care.
Visual features of the specimen
Image segmentation and object separation
Image segmentation is a fundamental image-processing task to facilitate higher-level tasks, such as object detection and recognition (
In an infrastructure built for image analysis, standard segmentation workflows could be run and optimised to avoid researchers repeating these steps and users could choose whether to analyse the whole image, all segments or specific classes of segment.
Labels
Specimens are usually annotated with information on labels. In the case of plants, these labels are on the mounting paper; for insects, they are on the mounting pin; while for larger zoological and plant specimens, labels might be tied to the specimen or on, or in, specimen jars. Therefore, as images of specimens often contain text, it is useful to provide printed and handwritten text recognition as part of an image processing pipeline. If text can be recognised, these additional metadata can be used to enrich items of the collection and automatically perform cross-collection linking. Furthermore, recognised text can aid in the digitisation process and validation of metadata, reducing manual input and improving data quality (
Although state-of-the-art text recognition performs well on printed text, accurately recognising handwritten text is still a challenge. Older handwritten text might contain unique style, but even such cases can still provide valuable information, for example, text written by the same author could be automatically clustered, based on visual similarity and used to identify the collection and reduce manual validation.
Besides text, secondary data hidden in handwriting, ink colour, mounting paper, label shape and printed label decorations (Figs
Labels of specimens from Meise Botanic Garden contain secondary data features, such as handwriting, ink colour, label shape and label decorations.
Embossed crests and stamps on herbarium specimens. A Lion and crown signifying ownership by the Botanical Garden of Brussels BR0000013433048 of BR Herbarium (CC-BY-SA 4.0). B Stamp of the A.C. Moore Herbarium at the University of South Carolina as on specimen USCH0030719 (image in public domain). C Stamp of the Watson Botanical Exchange Club on specimen E00809288 of the Royal Botanic Garden Edinburgh Herbarium (public domain). D Stamp of the A. C. Moore Herbarium at the University of South Carolina, USCH0030719 (public domain). E Stamp of the Botanical Exchange Club of the British Isles on specimen E00919066 of the Royal Botanic Garden Edinburgh Herbarium (public domain). F Stamp with handwriting is evidence of a loan from the BR Herbarium to the Herbarium Musei Parisiensis, P, on specimen BR0000017682725 of Meise Botanic Garden (CC-BY-SA 4.0). G Printed crest, P00605317 held by Museum National d’Histoire Naturelle (CC-BY 4.0). H A stamp on specimen LISC036829 held by the LISC Herbarium of the Instituto de Investigação Científica Tropical. l a crest used by the Muséum National d’Histoire Naturelle (MNHN - Paris), on specimen PC0702930. (licensed under CC-By 4.0). J A stamped star with unknown meaning on the same specimen as (B). K A stamp belonging to the Herbarium I. Thériot, on specimen PC0702930 at the Herbarium of the Muséum National d’Histoire Naturelle. (CC-BY 4.0). L A stamp belonging to the Universidad Estatal Amazónica, now housed in the Missouri Botanical Garden Herbarium under catalogue number 101178648 (CC-BY-SA 4.0).
Rulers and colour checkers
Another element often seen on digitised specimen images are rulers, scale bars and colour checkers. These are very varied, for example, in size, often customised for particular imaging campaigns. Colour checkers are used to validate colour fidelity of specimen images, while rulers provide a reference to the actual specimen size. Especially when digitising with a digital camera, it can be complex to calculate the actual dimensions of the specimen, as it depends on the lens and individual camera parameters. Therefore, detection of rulers and colour checkers on digital images can prove useful to estimate the actual sizes and correct colour balance. A generic object detection or instance segmentation model can be trained to detect these common objects. If all rulers in a collection are of a fixed size, the length of the detected ruler can be used to calculate a transformation from pixels to the ruler’s unit of measurement (e.g. cm, mm). This can then be combined with specimen segmentation models, to automatically extract dimensions and specimen traits (
Finding stamps and signatures
Specimens are often stamped, printed or embossed with crests that indicate provenance or ownership (Fig.
Many specimens are signed, either by their collector, determiner or both (Figs
Unsupervised learning
The stacked layers of deep neural networks can be regarded as a set of transformations that learn useful representations of the starting data. Using representations of specimen images learned by neural networks, rather than extracted metadata, would allow content-based interaction with and comparison between images. Such interaction is useful for tasks where a high-quality labelled dataset does not currently exist or where the characteristics of a specimen that are important to a task are not well-defined. For instance,
Some tasks require researchers to inspect and compare specimen images individually. The reduced dimensionality of deep representations in combination with scalable nearest-neighbour search (
Recently, interest in learning useful representations from unlabelled data has surged (
Unlocking the potential for machine learning in natural history collections is contingent on technical infrastructure which is easy-to-use, interoperable with regional and global biodiversity data platforms and accessible to the global scientific community. Here, we present a conceptual framework conceived as a roadmap for building such infrastructure. Although the infrastructure could be implemented in different ways (e.g. distributed or centralised), we describe three core technical components, coordinated by the orchestration logic: (1) the repository to index data and metadata; (2) the storage of images, models and data; and (3) the processing of images to generate new data, annotations and models (Fig.
Component 1: The Repository
A dedicated repository is needed which will reference and index information, such as specimen metadata, image metadata and annotations, alongside machine-learning models with their performance metrics and outputs (Fig.
Image metadata in the repository will include a reference to the image object located in the storage layer (Component 2), along with annotated training image data. Different kinds of image annotations will be supported, including geometric-based regions of interest (ROI), taxonomic or ecological traits and textual representations of label data. For interoperability, data standards supporting machine readability of these annotations are required. As different standards exist for these annotations and not all are equally suitable for any model, the platform should ensure support for multiple standards, such as COCO (JSON), Pascal VOC (XML) and image masks (rasterised or vectorised images). Multiple annotations can be made on a single specimen record, making persistent record identifiers vital. Metadata indexed in the repository will facilitate findability of suitable annotations, for instance, to serve as training data. A feedback mechanism may be implemented to correct and/or update annotations.
Pre-trained machine-learning models will be stored in the repository and made available for reuse, along with accuracy metrics and model outputs, such as segmented features or species metadata. To ensure findability, models should be classified by use-case through the use of keywords, since they are often trained for very specific use-cases, but could later be reused in other contexts. As part of the metadata, suitability scores will facilitate comparison of models in terms of their efficacy, possibly through community feedback or by analytics that take standardised model performance metrics into account. These results should be linked to the original images used in the training of the model (on the platform) and also to the images that were analysed in the use case. Some of this might be achieve by implementing the International Image Interoperability Framework (IIIF); for example, a IIIF compliant server could provide the segments of images dynamically (
Persistent identifiers, such as Digital Object Identifiers (DOIs) or hash-based content identification (e.g. Software Heritage PIDs for code or simple SHA-256 hashes for images), will be assigned to digital objects produced during the use of the infrastructure, to make them citable. It will also be possible to assign persistent identifiers to versions, reflecting any subsequent updates to the digital objects. The repository will display citations of the persistent identifiers, including links to publications in which they are included, as well as any instances of their reuse in other projects within the repository. It is not only important to make the digital objects or outcomes openly available, but also under appropriate licences (e.g. Creative Commons) as indicated by the FAIR for research software (FAIR4RS) working group and
Managed through the orchestration logic, the repository is connected to a storage system and the processing unit, while having features, such as a content-based search engine, to browse not only on the traditional human-annotated metadata (e.g. date and place of observation, taxonomy and others), but also on information extracted from the images themselves. Advanced features can be built into the system, such as the ability for users to upload an image and search the catalogue by similarity (e.g. similar handwritten signatures) or query and filter collections of data using indexed metadata extracted from observations, either humanly or automatically annotated. In general terms, such functionality can be summarised as the ability to aggregate to each specimen media record all the information that is extracted from it either manually or automatically and indexed making it available to query.
Some good examples of similar content-based systems exist in production today. Pl@ntNet, BeeMachine and iNaturalist provide species identification of living organisms from photographs. Results can be refined by providing the user’s location, limiting possible results to the most likely matches. A more general example is Google Image Search, where anyone can search images using either a keyword (e.g. dog) or using an image as the search term. This function is also available on Google Photos, where a user can search their personal photos for specific people, different types of objects, places, ceremonies and so on. Although different, all those systems share similar logic: (1) they include models trained for specific tasks (e.g. object detection) that have been created offline using massive datasets in large GPU clusters (e.g. Model Zoo and COCO dataset); (2) when a new image is added to the collection (or possibly all, when new models are deployed), in addition to the submitted user tags, the images are processed with these models (inference/prediction pipeline) and tags are extracted; (3) the extracted information is saved and indexed and made available as searchable data. The envisioned system should provide similar functionality, with the added complexity of the myriad of different models and images illustrated by the use cases in the previous section.
Component 2: The Storage
The storage component (Fig.
Whether images are mirrored from their original source on to the platform or only downloaded temporarily on to the platform when needed, is a technical design question that should be answered during implementation. While this choice has no functional impact, it does have profound technical implications, as well as budgetary consequences. Locally mirroring all images referenced in the repository guarantees availability and predictable speed of access, but will also require extensive management to accurately reflect changes made to the source material and will take up an increasingly large storage volume. On the other hand, while downloading images on-the-fly greatly diminishes the required storage volume, it implies less control over availability and carries the risk of images becoming unavailable over time.
Scientists are already used to large communal storage infrasturecures, such as Dryad and Zenodo. Zenodo was developed under the European Organisation for Nuclear Research (CERN) and supports open science by providing a platform for researchers to share and archive their data and other research outputs.
Storage of training images
Images to use in training are discovered through the repository component, which functions as a central index of images, metadata, models and results. Actual image files might be hosted on the platform, or remotely, on servers of associated parties. In case of the latter, because of the technical requirements (i.e. high throughput, guaranteed availability, low latency), these images must be downloaded to the platform and be made available locally to be used in the training of models. Image selection is done in the repository and the orchestration logic functions as a broker between the repository and remote hosting facilities, taking care of downloading images. The storage component is responsible for the local storage of these files. This includes facilitating access control (i.e. keeping track of what images belong with which training jobs) and making images available to the processing component, where the actual training takes place. In the scenario where the local storage of training images is temporary, the images will be deleted once the training cycle of a model has been completed, while only the references in the repository to those images are retained with the resulting model. The handling of images while stored in the system, including their accessibility and deletion policies, is subordinate to the platform’s governance policies.
Storage of models
Once a model is deemed suitable for use, it may be published as such in the repository. The repository functions as a central index that allows researchers to find suitable models, while the actual code that makes up a model will be stored in the storage component. Once a model has been selected for use (see also next section), it is retrieved from storage and copied to the processing component. A similar scenario applies when a stored model is used as the basis from which to further train a new model or a new version of the same model (transfer learning). Since there are no specific performance requirements for storing a model, they will be stored in the archive section of the media storage component. Besides models that have been trained locally, the platform can also host and publish models that were trained elsewhere. From the point of view of storage, these models are treated as identical to ones trained locally. As with images, availability of and access to models stored on the platform is subject to governance policies.
Storage of images for analysis
Another function of the processing component is using ‘finished’ models for image analysis, resulting in annotation of newly-uploaded images with or without metadata (such as classification or identified regions of interest). For this purpose, images will be uploaded by researchers, after having selected a model or models from the repository to run on the images. Uploaded images will be stored in the storage component and kept there for the duration of the experiment. Responsibility for running these experiments, including the loading and execution of the selected models, lies with the processing component. Actively making available the images to the models is facilitated by orchestration logic.
Once experiments have been completed, these images will be moved to a low-performance part of the media storage component (archive storage), where they are stored with the newly-acquired metadata, in line with relevant governance policies. These archived images and their annotations are registered in the repository component, so as to make them findable. If, at a later stage, someone wants to perform further analysis on them, these images can be moved back to the active storage area.
The technical requirements for analysis processes are far less demanding than those of training processes, especially with regards to the need for constant high throughput. It is, therefore, conceivable that the platform will allow access to stored models through an API, in which case no images are stored locally.
Storage of model results
User value is gained from access to results derived from the models on the platform. These results might be produced as described hitherto or by use of a model remotely, either via API access or even by entirely running a model remotely. The form of these results can be manifold; besides previously mentioned examples, such as classification or the identification of regions of interest, they can also include more generalised performance characteristics of a model, such as the average recall and precision for a given set of images in case of a classification experiment. Uploading such results, in whatever format they might take and associating them with the models that generated them is the responsibility of the repository component, while the physical storage of data is taken care of by the storage component. Negotiation between the two components, both when storing and when retrieving, is performed by the orchestration logic. Again, all handling of these results follows the platform’s governance policies.
Component 3: The Processing
The processing component encompasses all the services and pipelines to compute tasks on batches of data, incoming or already existing in the system, such as those stored in the repository and storage components (Fig.
This component requires a considerable amount of computing power to handle all the scheduled tasks in the system, which can even be elastic (i.e. cloud principles) given the fluctuating demand. These are delegated by the orchestration logic component, a set of services that are responsible for handling external requests, such as those from front-end applications or other external services using public APIs, serving as both gateway and manager to the main internal components – repository, storage and processing (Fig.
The processing component and the tasks and services supporting it, should be able to scale vertically, that is, to handle more tasks by adding more RAM, more CPU cores or a better GPU to a cluster node, but preferentially also able to scale horizontally, namely, by adding more nodes, hence able to process multiple independent tasks in parallel.
The processing component can be organised into sub-components, amongst which are: (1) Data ingestion; (2) Machine-learning models and analytics services (such as image segmentation, objection detection and image classification); (3) Analytics pipelines (processes or programming scripts built to provide analytical services); (4) Data integration; and (5) Data export, which helps to deal with any given use case, such as depositing new images and metadata, annotating the images and depositing trained deep-learning models.
Data ingestion
Data ingestion is the process of adding new data to the system, encompassing tasks, such as crawling, parsing, validating and transforming information to be indexed. This process includes several data types, including metadata, images, annotations, analytics pipelines (which includes services and models) and so on. To this end, specific tools should handle incoming data to the infrastructure, following different paths depending on the data’s source and type.
When a new dataset is submitted, each entry undergoes a series of tasks to parse, validate and transform the information to facilitate a standardised entry. This may include crawling additional data from external services like Wikidata or to compute metrics, validate geographic coordinates and map them to locations. Additionally, this process will check for duplicate entries, based on the existing data in the infrastructure.
Image annotations
One of the key features of the system will be the ability to provide annotations for the existing images. When a set of annotations is supplied, these need to be ingested, validated and transformed into standard data types and structures, depending on the problem (e.g. classification, object detection, natural language processing and optical character recognition). After preprocessing, the set of annotations will be additionally validated to find whether they duplicate existing annotations, if the attached labels make sense, if the tagged region falls inside of the image and so on. This information will then be indexed and provided by the repository component and can be included in datasets, which will serve to improve existing inference tools and develop new ones.
Machine-learning models and analytics services
The same applies to other tasks, such as submitting a new analysis pipeline. New pipelines include data and metadata; machine-learning models; source code; service containers; automated workflow and service provisioning information as code; results and others. Each of these must be verified and tested, before being included as part of the analytics toolset.
An analytics pipeline sub-component comprising a set of services and functionalities is responsible for processing images or other media, to automatically infer information that would otherwise be manually tagged, for example, identifying a specific trait. To this end, each service provides specific functionality and comprises a sequence of instructions, from using multiple pre-trained models, to image transformations or other solutions, depending on the problem at hand. For instance, when ingesting a dataset, for each given specimen image, various analytics pipelines will be scheduled to run, each made of different steps and deep-learning models trained for specific tasks (e.g. detect mercuric chloride stains, identifying specific traits, extracting label information).
Build machine-learning models and services
Analytics pipelines are built of pre-trained models, as well as containerised applications and services previously built. The most computationally intensive part of the infrastructure will be training, building and updating these. It should be possible to schedule the execution of these heavy tasks, including data preparation (e.g. resize, augmentation), configuring the environment and parameters, training the models, assessing the performance and building, testing and packaging the services.
The system must allow the definition of service workflows as code, from the infrastructure, to model training and application packaging. This requires two parts. First, fully documenting modelling experiments to guarantee reproducibility, such that anyone can rerun the experiment and obtain the exact model and results. This involves the system indexing the data (i.e. link to the exact dataset) and code with the exact environment (e.g. by using conda and venv under Python or renv in R), the pre-trained models and all the required parameters, hyperparameters and similar, as well as controlling the randomness of such models (e.g. initialising seed state).
Secondly, the entire analytics pipeline should be documented as code, from infrastructure to application level. This allows for the exact replication of the build, test, package and deployment. Over the last decade, several technologies and sets of practices have appeared to attain such goals, normally linked to software development concepts, such as DevOps, MLOps and GitOps. GitHub provides Actions to attain continuous integration and deployment, allowing the automation of the entire workflow of a software service, from building to testing and deploying, based on simple text files (YAML). On the other hand, Docker images and similar solutions allow services to be containerised using similar simple definitions and shared across various environments, enhancing consistency and portability, while simplifying deployment and scaling processes. Going a step further, it is nowadays possible to define both the infrastructure and how services interact as code too (e.g. used under Docker compose or with Terraform and Kubernetes).
Such concepts must be exploited by the processing component, allowing submission of novel analytics pipelines. As the number of annotated datasets grows over time, the system might schedule the retraining of models and associated pipelines, reporting results and, if desired, replacing the existing analytics pipelines. Moreover, all the details, code and pre-trained models can be provided, so anyone can reuse them anywhere. Given the computation power needed, possibly requiring several GPUs for bursts of work, hybrid solutions offloading part of this work to cloud providers could be implemented, as an alternative to hosting and managing GPU clusters.
Data integration
Data integration will push the data generated by the above-mentioned sub-components to the respective parts of the system - the repository (e.g. metadata registry of the trained models and images, datasets, annotated data etc.) and the storage (e.g. image files and their derivatives, pre-trained models, metadata packages etc.).
Data export
The system will catalogue millions of specimens, each with variable amounts of metadata. These data can be filtered with complex queries, based on several parameters and fields. As an example, a user might want to search for records of a specific species, containing images and annotate them regarding the presence of signatures within a specific timespan. Requesting the generation of an image dataset, based on the result of such query, requires several processing tasks for scheduling, from the extraction and merging of the relevant metadata into the desired format, to resizing images if needed, assigning a persistent identifier, generating a dataset page and notifying the user. Moreover, if images and annotations for the same search criteria are updated in the following months, the user might request the dataset to be updated, generating a second version and assigning a new or versioned persistent identifier. Part of this functionality is already demonstrated by GBIF, which uses background jobs to export datasets on user request (excluding images and DOIs, but allowing the export of metadata, based on queries). Moreover, this sub-component may also be responsible for exporting machine-learning datasets to public platforms, such as the Registry of Open Data on AWS or Google Datasets, allowing users to easily mount them on external cloud solutions.
The 21st century is already seeing catastrophic changes in global biodiversity. The resources needed to monitor and address these changes are far greater than the cadre of professional ecologists and taxonomists can provide. Machine learning promises to dramatically increase our collective capacity and, in complementary fashion, prioritise the attention of human taxonomists where it is most needed.
There are direct benefits of our envisaged infrastructure to biodiversity and research into artificial intelligence, but there are also positive impacts for society, the economy, the environment and for collection-holding institutions, for example, in support for more evidence-based environmental policy; improved pest detection and biosecurity; better monitoring of endangered species and better environmental forecasting to name just a few (also see
Making images accessible in a common infrastructure is an opportunity for collections with limited resources to gain access to tools that would otherwise be unavailable to them. Indeed, Open Access for all researchers, including those from the Global South, is critical to ensure that collections fulfil their obligations to access and benefit sharing. As a large percentage of the world’s natural history specimens are housed in the Global North, scientists from the Global South are excluded from data on their own countries unless suitable access is provided (
Such an infrastructure aligns with the European Strategy for Data (
Opportunity, obstacles and risks to realising a shared infrastructure for natural history collections
Given the many use cases, the large number and diversity of stakeholders and the potential for innovative services and research, what is holding us back from creating the proposed infrastructure? One clear issue is that experts in machine learning are not always aware of the needs or potential of biological collections. These communities should be brought together to find the areas where collections can benefit from generalised approaches. A lack of standardisation and consequent lack of interoperability further impedes progress (
We suggest that the most intractable obstacles to a shared, global infrastructure are socio-political. We envisage an infrastructure without institutional and national borders, in which people, organisations and nations are co-beneficiaries of a system, in which knowledge, skills, financing and other resourcing are acknowledged (
Experiments so far lack scalability, often have manual bottlenecks and experience significant time lag in production of results due to limited access to computational and physical resources and to human resources to create and curate training datasets (
The establishment of a new paradigm in research on collections impacts the frameworks and workflows currently used in collection curation and the research based on them and can, therefore, be disruptive. One of the greatest risks is introducing inherent errors and biases that are derived from the algorithms and prejudices that may be embedded unknowingly in training data (
The institutions that hold collections have safeguarded this rich resource of information about biodiversity and natural history. They are major stakeholders for these materials to be preserved and associated data to become available for researchers and society. Paradoxically, making the data accessible digitally might create the illusion that there is no need to maintain the collections physically. In fact, the more information we can extract and link, the more valuable physical collections become for any future technology that can be applied to them. It is, therefore, critical to guarantee the link between the digital and physical specimen to ensure neither becomes obsolete, risking the real value attached to both.
The future
Objects in natural history collections represent one of the most important tools to understand life on our planet. Mobilising the capacity to analyse billions of objects with the help of machine learning is essential to meet the challenge of conserving and sustainably using biodiversity. This paper is written to emphasise the huge potential and the challenges. The main limitation to achieving our vision is not the software for machine learning, nor the ideas for using it, but the accessibility of data and images of specimens in a computational environment where they can be processed efficiently.
Many additional uses can be imagined for the analysis of non-specimen data, that is, the additional information that is linked to the physical object, either when directly written on attached labels or linked to inventories, catalogues or spreadsheets (
This work was supported by European Cooperation in Science and Technology (COST) as part of the Mobilise Action CA17106 on Mobilising Data, Experts and Policies in Scientific Collections. Heliana Teixeira was supported by CESAM - FCT/MCTES UIDB/50017/2020+UIDP/50017/2020. Renato Panda was supported by Ci2 - FCT/MCTES UIDP/05567/2020. Elizabeth Ellwood is supported by the National Science Foundation (DBI 2027654). This work was also facilitated by the Research Foundation – Flanders research infrastructure under grant number FWO I001721N, the BiCIKL (grant agreement No 101007492) and SYNTHESYS+ (grant agreement No 823827) projects of the European Union’s Horizon 2020 Research and Innovation action.
QG conceived the paper; QG, MD, AA, WA, CB, PB, LC, RF, PG, AG, PH, RH, AJ, VK, LL, RL, SM, JM, KM, RPa, MPi, BR, AR, TR, JS, MS, BS, HT, MT, JG, conducted the initial analysis of the subject; QG, JS found funding to resource the research; All authors contributed to writing the text; All authors reviewed the manuscript; QG, HH, OG, IL and JG conducted the final editing; JP designed the method and wrote the software to visualise Figure 1. PH created Figures 2-4, JG, MD, MS, BS, RPa, KM, AJ, VK conceived and designed Figure 5 and the creation of the model.