A system for automated analysis of subsea movies using citizen science and machine learning

This paper describes a data system to analyse large amounts of subsea movie data for marine ecological research. The system consists of three distinct modules for data management and archiving, citizen science, and machine learning in a high performance computation environment. It allows scientists to upload underwater footage to a customised citizen science website hosted by Zooniverse, where volunteers from the public classify the footage. Classifications with high agreement among citizen scientists are then used to train machine learning algorithms. An application programming interface allows researchers to test the algorithms and track biological objects in new footage. We tested our system using recordings from remotely operated vehicles (ROVs) in a Marine Protected Area, the Kosterhavet National Park in Sweden. Results indicate a strong decline of cold-water corals in the park over a period of 15 years, showing that our system allows to effectively extract valuable occurrence and abundance data for key ecological species from underwater footage. We argue that the combination of citizen science tools, machine learning, and high performance computers are key to successfully analyse large amounts of image data in the future, suggesting that these services should be consolidated and interlinked by national and international research infrastructures.
 Novel information system to analyse marine underwater footage.


Introduction
Biological observation techniques in the marine environment need to improve radically to serve our understanding of marine ecosystems under long-term global change and under the influence of multiple stressors (Benedetti-Cecchi et al. 2018). Today biologists increasingly have access to autonomously operated technologies for data collection, offering the opportunity to generate enormous volumes of data. This is especially the case for high-definition optical imagery recorded by ROV's (remotely operated vehicles), AUVs (autonomous underwater vehicles), drop-cameras, video plankton recorders, and drones (Bean et al. 2017, Danovaro et al. 2016). Although such image-based observations may revolutionize the fields of marine biology and biodiversity monitoring, these methods also impose completely new demands for data management and processing on researchers. Hence, such in-situ monitoring systems need to be coupled to data services that allow for swift exploration, processing, and long-term storage (Guidi et al. 2020). Some of these services already exist including the EcoTaxa application for analysis of large amounts of plankton imagery (M et al. 2017), the CoralNet tool that allows researchers to analyse coral images , and FathomNet a system offering machine learning algorithms and training data to analyse deep sea footage. Although these platforms have pioneered the daily use of image analysis tools in marine science, they can not provide all functions needed by the fast growing community of users. Archiving functions for example often fall under national responsibility and can not be provided only by a single global system. Also, training datasets need to be specifically developed for the region of interest and for the scientific question at hand. For this reason, it is important to establish local image analysis services, which can interoperate with larger and global platforms in the future.
Here we present a system for managing, processing, and analysing large amounts of subsea movie data for marine ecological research. The system allows scientists to upload underwater movies to a customised citizen science website hosted by Zooniverse, where volunteers from the public classify the footage. Classifications with high agreement among citizen scientists are used to train machine learning algorithms. These algorithms can be accessed through an application programming interface (API) allowing users to extract species observations from new footage and test the performance of the algorithms under different confidence and overlapping thresholds.

Project description
Title: Analysis of spatio-temporal trends in deep water habitat builders (Use case) Study area description: We piloted our data system to extract data on spatial distribution and relative abundance of habitat-building species from deep-water recordings in a Marine Protected Area, the Kosterhavets National Park in Sweden. The park contains a highly diverse and unique marine ecosystem that has been under active protection and management since 2009. The seafloor in the deeper waters of the park has direct oceanographic connections to the open Atlantic ocean and hence contains much of the bottom-dwelling fauna, which is otherwise only found in deep oceanic waters (Lavaleye et al. 2009). This fauna includes large habitat-building species (Costello et al. 2005) such as sponges (e.g Geodia baretti, Phakellia ventilabrum) and cold-water corals (e.g. Lophelia pertusa) as well as other large species which can be easily identified from camera footage (e.g. starfish Porania pulvillus, Crossaster papposus, Echinus esculentus).
Design description: Our data system is divided into three main modules: data management, citizen science, and machine learning with high performance computing ( Fig.  1). •

Module 1: Data management (Anton et al. 2019)
In the data management module researchers store and process the data in ways that maximises efficiency, convenience, and opportunities for sharing and collaboration.
To store and access the raw data we use long term and short term storage servers. The long term storage server, or cold storage, archives large amounts of files that don't need to be accessed frequently. These include recordings from Remotely Operated Vehicles (ROVs) managed by the Tjärnö Marine Biological Laboratory, Sweden. The movies (mp4 and mov formats) are on average 1-2 hours long and are systematically collected from all expeditions since the late 1990s ( Fig. 1).
The short term storage server, or hot storage, stores a small proportion of files that are frequently used. Here, we transferred 60 movies from the cold storage to the projectspecific short term storage server (Fig. 1). The number of movies we selected was a compromise between selecting a representative sample and efficiently using the limited storage of our server. This "hot server" was Linux-based and hosted by Chalmers University of Technology, Gothenburg. The specifications of this High Performance Computing server consisted of a GTX2080Ti GPU with 2 x 8 core Intel(R) Core(TM) i9-9900 CPU @ 3.10GHz (total 16 cores) and 2GB DDR4 RAM.
We created a SQLite database to link all information related to the movies and the classifications provided by both citizen scientists and machine learning algorithms (Fig. 1). The database has seven interconnected tables (Fig. 2). The "movies", "sites", and "species" tables have project-specific information from the underwater movie metadata, as well as the species choices available for citizen scientists to annotate the clips, retrieved from Zooniverse. The "agg_annotations_frame" and "agg_annotations_clip" tables contain information related to the annotations provided by citizen scientists. The "subjects" table has information related to the clips and frames uploaded to the Koster Seafloor Observatory. The "model_annotations" table holds information related to the annotations provided by the machine learning algorithms. The database followed the Darwin Core (DwC) standards to maximise the sharing, use and reuse of open-access biodiversity data. •

Module 2: Citizen science (Anton et al. 2019)
In the citizen science module researchers and citizen scientists work together to efficiently and accurately annotate raw data. To identify the species recorded in our footage we have created a citizen science website, the Koster Seafloor Observatory. The Koster Seafloor Observatory is hosted in Zooniverse, the largest citizen science platform in the world. The website contains rich supporting material (e.g. background, tutorials, field guides) and features two workflows that help citizen scientists to correctly classify biological objects in video (workflow 1) and locate these in still images (workflow 2).
Workflow 1 (species identification): Citizen scientists are presented with 10s clips of underwater footage and need to select at least one of the 27 available choices (Fig. 3). The choices include species of scientific importance, animals grouped at different taxonomic levels (e.g. "gastropods" or "fish"), as well as a few miscellaneous options ("Nothing here", "Human objects"). If citizen scientists select a species or animal, they also need to specify the number of individuals of the taxon selected and the time (in seconds) when any of the individuals fully appears on the screen. Each clip is annotated by at least three different citizen scientists. If the annotations provided by the different citizen scientists match, the clip is "retired" from the website, meaning the clip is not displayed to new users anymore. If the annotations did not match, the clip is annotated by more citizen scientists (up to nine citizen scientists per clip).
Workflow 2 (object location): Citizen scientists are presented with a still image of a species of interest. To annotate the image, citizen scientists need to draw rectangles around the individuals of the species of interest (Fig. 4). If citizen scientists can not identify any individual of the species of interest in the frame they will not draw any rectangle. Each still image is annotated by five different citizen scientists before it is "retired" from the website.
We used a four-stage image processing framework to upload clips and still images to the Koster Seafloor Observatory and download the annotations provided by citizen scientists (Fig. 5).
Stage 1: Generate and upload clips (Fig. 5, circle a). In this stage we split the +1 hour long movies into 10s clips. After the clips were created, we randomly selected 5,702 clips from the original 60 movies and uploaded them to the workflow 1 of the Koster Seafloor Observatory.
Stage 2: Process clip annotations (Fig. 5, circle b). We retrieved the annotations provided by citizen scientists in workflow 1 and aggregated them on a per-clip basis. To aggregate the workflow 1 annotations, we grouped the annotations each clip received and retained only those choices that were selected by at least 80% of the citizen scientists that annotated the clip (Table 1). In our study, there were 194 clips for which cold-water coral was identified at least by 80% of the citizen scientists. We also averaged the answers from citizen scientists to the question "When is the first time the species appears fully in the video?".
Stage 3: Generate and upload frames (Fig. 5, circle c). We extracted up to three frames per clip from the 194 clips containing cold-water corals, and extracted one frame per second after the first time the species fully appeared in the clip. After extracting 533 frames, we then uploaded them to the workflow 2 of the Koster Seafloor Observatory.
Stage 4: Process frame annotations (Fig. 5, circle d). We retrieved the workflow 2 annotations provided by citizen scientists and aggregated them on a per-frame basis. To aggregate the workflow 2 annotations, we retained the area of overlapping between those rectangles drawn by 80% of the citizen scientists who annotated the frame. Once we aggregated the annotations, we formatted them appropriately to train YOLOv3 algorithms Redmon and Farhadi 2018 (Table 2) •

Module 3: Machine learning and High Performance Computing (Germishuys et al. 2019)
In the machine learning and High Performance Computing module researchers train, test and expose state-of-the-art machine learning models. The aggregated citizen scientist annotations are used to train object detection models in tracking and identifying the species of interest. In our case study we used 544 user-annotated ground-truth frames obtained from workflow 2 (Suppl. material 1) to train an algorithm to identify deep water corals (Lophelia pertusa). We augmented this data by using a frame tracker which filled subsequent movie frames with bounding boxes with the highest probability of containing the object of interest. This typically increased the amount of data by a factor of 10. The frames were then pre-processed to remove background distortion (as far as possible), since colours often lose intensity underwater, mainly due to poor visibility. Three datasets were then created, one for training the model, another for validation (which is used to tune the model hyperparameters) and finally a testing set. Once the data were prepared, the model training was done until satisfactory metrics were achieved on evaluation measures including F1, Recall, Precision and mAP@0.5 (Table 3). The trained model could then be used to detect species of interest in new footage, whether it be from a webcam, video or a still image.
We made the trained model available through an application programming interface (API), where it can be used by researchers to run predictions of the species of interest in new recordings (Fig. 1). To this end we used FastAPI (Ramírez 2020) as it provides the speed, scalability and reliability required to have multiple users making use of the service at the same time. The API was also supplied with a user-friendly front-end, using the Streamlit (Teixeira 2020) framework, allowing a broader audience of scientific users (i.e. ecologists, ROV and AUV-pilots, students) to access the service through a web application. The interface allows researchers to browse through already-classified footage, or upload their own footage as either images or video. Once the media has been uploaded/selected, users are able to manipulate hyperparameter thresholds (IOU threshold, confidence threshold) and interactively see the impact on the model output. The API is described by Germishuys et al. (2019).
The last component of this module is a data visualization toolkit that enables researchers to explore and visualise the ecological data extracted from the outputs of the machine learning model (Fig. 1). In our case study, we tested the ability of the model to estimate the relative abundance of cold-water corals over time. To this end, we created 20 standardized recordings (i.e. same length) from a revisited area in the Kosterhavets National Park during the years 2000-2015. These movies were analysed with the trained model to obtain a timeline for the relative abundance of coral heads in the area before and after the national park was established (Fig. 6).

Discussion
The system described here has been tested in a scientific case study, estimating abundance levels of the habitat builder Lophelia pertusa in a Marine Protected Area across a period of 15 years (Fig. 6). Results show a steep decline of this key ecological species in the national park during the investigated period. We also found no indication of recovery of corals as a consequence of the establishment of the national park in 2009. These results suggest that physical protection alone may not lead to recovery of the coral stocks, and that external climatic pressures, changes in water quality, and oceanographic connectivity may strongly impact the coral populations in the park.
Our scientific case study exemplifies how the presented system can be used to extract ecological data on abundance and distribution for many benthic species from underwater recordings. Underwater recordings are routinely collected by research institutes, which may allow for a concerted analysis of such data over broad spatial and temporal scales in the future. Such analyses may calculate data products for biological state variables on regional or even global level, so-called Essential Biodiversity Variables or EBVs (Pereira et al. 2013, Hardisty et al. 2019. A recent study by Kissling et al. (2018) suggests that imagebased sensor networks are promising candidates for EBVs. Our case study provides empirical support that these data products can indeed be derived from image-based sensors, and importantly from marine environments which are particularly difficult to access and survey. The case study also shows that hincasts of species occurrence and abundance are possible if archived video material is available.
In order to scale up analysis of underwater imagery in the future to extract ecological data for larger regions, longer time periods, and more species several technical bottlenecks have to be addressed. Currently, underwater recordings are locally archived and often can not be discovered. We suggest making underwater recordings discoverable by publishing metadata in national and/or European archives and data portals (e.g. European Marine Data Archive, EMODnet portal). Another important technical bottleneck is the disconnection between many essential data services that need to interact to successfully analyse image data. We suggest that seamless links should be developed especially between citizen science platforms (for training of machine learning models) and high-performance computation services (for extracting ecological data from large amounts of imagery). In this development national and European research infrastructures should take a leading role.

IP rights notes:
The system is open for use in research, as well as public and academic education for analysis of community composition in marine ecosystems. processing workflow and model). MO worked with the public contributions to the Zooniverse site. All three authors contributed equally to the writing and revision of the manuscript.

Figure 1.
High-level overview of Koster Seafloor Observatory process.

Figure 2.
Entity relationship diagram of the SQLite database used by the Koster Seafloor Observatory.
Screenshot of the Zooniverse annotation interface. On the left, display of the clips. On the right, species choices available. Example of a frame containing deep water coral displayed to the citizen scientists (left) and the same frame with annotated rectangles provided by a citizen scientist (right).
Four-stage image processing framework used to identify species of interest.  Table 1.