LifeWatch observatory data: phytoplankton observations in the Belgian Part of the North Sea

Abstract Background This paper describes a phytoplankton data series generated through systematic observations in the Belgian Part of the North Sea (BPNS). Phytoplankton samples were collected during multidisciplinary sampling campaigns, visiting nine nearshore stations with monthly frequency and an additional eight offshore stations on a seasonal basis. New information The data series contain taxon-specific phytoplankton densities determined by analysis with the Flow Cytometer And Microscope (FlowCAM®) and associated image-based classification. The classification is performed by two separate semi-automated classification systems, followed by manual validation by taxonomic experts. To date, 637,819 biological particles have been collected and identified, yielding a large dataset of validated phytoplankton images. The collection and processing of the 2017–2018 dataset are described, along with its data curation, quality control and data storage. In addition, the classification of images using image classification algorithms, based on convolutional neural networks (CNN) from 2019 onwards, is also described. Data are published in a standardised format together with environmental parameters, accompanied by extensive metadata descriptions and finally labelled with digital identifiers for traceability. The data are published under a CC‐BY 4.0 licence, allowing the use of the data under the condition of providing the reference to the source.


Introduction
Phytoplankton contributes to almost half of the Earth's total primary production (Field et al. 1998), it is the base of the marine food web and alterations to its composition and abundance often have repercussions on higher trophic levels, including those of economic importance (Richardson and Schoeman 2004). In addition, harmful algal blooms cause economic losses to aquaculture, fisheries and tourism (Hallegraeff 1993, Wells et al. 2020, Anderson et al. 2012. Furthermore, phytoplankton has an important role as carbon pump sequestering carbon dioxide from the surface sinking it in the deep sea (Buesseler et al. 2007, Hutchins andFu 2017). Due to their small size, short generation times and large population numbers, phytoplankton are indicators of marine ecosystem change (Margalef 1978).
The availability of long-term phytoplankton observational data for the Belgian Part of the North Sea (BPNS) is limited. In the last decades, several studies have described the Belgian phytoplankton community structure (Muylaert et al. 2006, Muylaert et al. 2009, Gasparini et al. 2000, Breton et al. 2006. The 4DEMON project integrated dispersedlygathered phytoplankton abundance data from research projects in the BPNS between 1968(Nohe et al. 2018. However, as most of the sampling was limited in time and orientated towards single sampling locations, information on the spatial dynamics of the phytoplankton in the BPNS remains scarce (Muylaert et al. 2006).
In general, long-term time series of phytoplankton are hard to come by (Edwards et al. 2001, Suikkanen et al. 2007, Edwards et al. 2010) because its species composition and abundance are highly variable (Suikkanen et al. 2007) and characterising them using traditional methods is tedious, time-consuming and expensive (Lund et al. 1958, Zingone et al. 2015. Over the past decades, there has been a proliferation of imaging systems to count and measure plankton in a faster and more efficient manner (e.g. Cytobuoy, FlowCytobot or FlowCAM) (Benfield et al. 2007, Haraguchi et al. 2018, Álvarez et al. 2014. Digital flow cytometry using FlowCAM® (Fluid Imaging Technologies, Scarborough, Maine U.S.A.) has gained attention as a means of rapid cell counting of phytoplankton since first used by Sieracki et al. (1998).

General description
Purpose: In response to the identified data gap for the BPNS and taking into account the availability of the newest imaging technology, a long-term phytoplankton observation effort was initiated as part of the Flemish contribution to LifeWatch. Multidisciplinary sampling campaigns are organised in the BPNS on a regular basis, collecting phytoplankton samples that are processed with a digital imaging flow cytometer (FlowCAM). The procedures put in place for automated processing and manual validation manifest a durable approach for the generation of a long-term high-quality phytoplankton time series.

Study area description:
The BPNS is located in the Southern Bight of the North Sea. It is characterised by shallow waters (< 40 m) and strong semi-diurnal tidal currents resulting in a vertically homogeneous water column (Lee 1980, Muylaert et al. 2006. Its waters are influenced by freshwater discharges (from Yzer, Scheldt, Meus, Seine) and saltwater inflow (Atlantic water, coming in through the English Channel), resulting in an on-offshore gradient (Lancelot et al. 1987, Lacroix et al. 2004). In addition, the BPNS is an area heavily impacted by the introduction of non-indigenous species, industrial and agricultural pollution, overfishing and trawling, dredging, human-induced eutrophication, sand and gravel extraction, offshore construction and heavy shipping traffic (Emeis et al. 2015).
Design description: Stations are visited in the course of one to three-day sampling cruises with the RV Simon Stevin on a monthly or seasonal frequency. Sampling activities onboard are registered in the Marine Information Data Acquisition System (MIDAS). Through MIDAS, scientists can record the metadata of their scientific actions (e.g. time, coordinates, action type, start and stop of the action, station, status of deployment and notes). MIDAS also registers the navigation (heading, current time, latitude, longitude, speed, course over ground, navigation depth and draught), together with meteorological (air temperature, relative humidity, wind direction and speed) and oceanographic data (sea surface water temperature, salinity, chlorophyll-a and sound velocity). This information is synchronised with the VLIZ ICT network every 24 hours and is made available online through the VLIZ website.

Sampling methods
Study extent: A spatial grid of 17 stations, spread over the BPNS, is being sampled since May 2017 (Fig. 1). Nine nearshore stations are sampled on a monthly basis. Eight additional stations, positioned further offshore, are sampled only with seasonal frequency (Fig. 2). The stations are part of the LifeWatch marine observatory (http://www.lifewatch.be) that forms a dense net of sensor networks and observation stations in the Belgian coastal waters and sandbank system, a designated site in the Long Term Ecological Research (LTER) network (Muelbert et al. 2019).
Quality control: The output of both classification processes are manually validated by an experienced taxonomist to remove the errors of the automatic prediction. In this step, the taxonomist checks that all the imaged particles have been assigned to the correct category by the automatic classification, if not, the particles are manually changed to the right category. The taxonomist evaluates 2 times all the particles to correct the possible misclassifications. The species identification is done with the help of Tomas (1997), Kraberg et al. (2010) and Alfred Wegener Institute for Polar and Marine Research (AWI) (2020). A summary with the morphological description of the categories found in the dataset and example FlowCAM images is available upon request. All manual input towards the databases is guided by forms and fields with associated input rules avoiding the most common editing errors. Taxon names are linked to the corresponding AphiaID's of WoRMS (WoRMS Editorial Board 2020), hereby linking to the most recent accepted names and authorities.
Step description: Sampling at sea The phytoplankton samples are collected with a stainless steel bucket. In total, either 50 or 70 litres of surface water are hauled up onboard and poured into an Apstein net (1.2 m long, 55 µm mesh size and 50 cm diameter). The volume of water collected is documented in MIDAS. The sample is concentrated in a plastic jar at the cod-end of the net, where the sample and rinsing water escapes through a 55 µm mesh window. Immediately afterwards, the sample is preserved in acid Lugol's solution at a 5% final concentration and stored onboard in dark conditions at 4°C. At the end of the sampling campaign, the samples are transported and stored in the Marine Station Ostende (MSO) at 4°C until processing. The remaining sample material after processing is available to researchers for re-use.

FlowCAM processing
Within three months after collection, the samples are processed using the FlowCAM VS-4 (Fluid Imaging Technologies, Yarmouth, Maine, U.S.A.) and the software VisualSpreadsheet® Version 4.2.52. FlowCAM combines the technologies of flow cytometry, microscopy and image analysis (Sieracki et al. 1998). It counts and photographs particles moving in a fluid flow. The sample passes through a flow cell, drawn by the associated syringe pump of the particular flow cell. A digital grey-scale camera captures the particles as they pass in front of the microscope (Álvarez et al. 2011). The output is a collection of pictures, combined in collages that constitute the output of VisualSpreadSheet (Álvarez et al. 2012). In addition, a List File contains the particle properties of each targeted particle (Camoying and Yñiguez 2016).
For this dataset, the 300-µm deep flow cell with the 4X objective and the 5 ml syringe pump are used. This combination maximises the taxonomic resolution for the size range of interest without compromising the running time. Sample preservation with Lugol negates the ability to discriminate cells from detritus through the detection of chlorophyll (Graham et al. 2018). Therefore, samples are processed using the AutoImage working mode imaging particles in a user-defined number of frames per second (FPS) (here, 20 FPS) and a flow rate of 1.7 ml min . The setting of choice in VisualSpreadsheet is a Basic Size Acquisition Filter selecting particles, based on the ESD (70-300 ESD in 2017; and 55-300 ESD from 2018 onwards). The setting of the focus is done directly on the sample, instead of using the focus beads, since this practice is more time effective. Then, a 1.5 ml subsample is run to obtain information on the particle concentration. If the concentration is too high, the sample is diluted to a concentration of < 600 particles ml to reduce the chance of overlapping particles in the captured frames.
Attachment of diatoms with spines to the flow cell wall (e.g. Chaetoceros Ehrenberg) and aggregation of chain-forming diatoms (e.g. Bellerochea) often interfere with the sample processing. To minimise clogging and to increase the durability of the flow cell, each sample is pre-filtered in a 300-µm mesh-size net (Álvarez et al. 2011, Álvarez et al. 2012). A periodic pinch of the flow cell tubing by the operator reduces clogging, thus assuring a constant flow of particles (Poulton and Martin 2010). To reduce the variability, each sample has three technical replicates, each of them capturing a maximum of 1,500 particles or covering a total Sample Volume Processed of 5 ml in 2017 and 8 ml from 2018 onwards. When the sample is processed, the flow cell is cleaned with two cycles of 5 ml of Milli-Q® water; ethanol (70%), leaving little air in between fluids; and finishing with Milli-Q® water. To convert from cell counts in the FlowCAM to phytoplankton Abundance (cell l ), we used the following formula: were Abundance is defined as the number of cells in a litre of the unfiltered water sample, Vol. imaged is the volume in the field of view of each sample, Vol. filtered is the volume poured into the Apstein net and Vol. sample is the remaining sample after the filtration in the Apstein net.

Semi-automatic classification with VisualSpreadsheet (2017-2018)
A reference library with phytoplankton images for the Southern Bight of the North Sea is created using the autoclassification tool of VisualSpreadsheet and the manual validation. Following software recommendations, the reference library consists of various categories, each containing 10 -20 images (regions of interest; ROIs) for each category and covers a species or higher taxon group in case identification at species level is not possible. This is called "class" in the VisualSpreadsheet and, based on those images per library, filters are defined. A category can contain several filters to represent different orientations or developmental stages of the same taxon (e.g. Chaetoceros in valve view or girdle view). The combination of categories with its filters are stored as a learning set that is used to run an Auto Classification and assign the sample particles to different categories and taxon groups. In addition, separate library categories are also created for non-phytoplanktonic particles (e.g. crustacea, eggs, detritus…). Due to the large diversity of taxa in the samples and the variation in species composition over the year, the combination of used categories in the learning set needs to be adapted regularly. Only the categories of the taxa expected to be present are used. Categories with its filters are applied following the order of the most abundant taxa to least abundant. The obtained classification is validated manually by taxonomic experts.

Semi-automatic classification with CNNs (2019 -current)
Since 2019, the classification of our FlowCAM images is facilitated by using deep learning classifiers, more specifically CNNs. One of the prerequisites for allowing the use of deep learning classifiers is the availability of a large training dataset. Once our validated FlowCAM dataset (2017-2018) was sufficiently large, it became possible to shift towards CNNs for class prediction of the images. The main benefit of using CNNs is the increased classification accuracy, reducing the time spent by trained taxonomists to validate the data afterwards. Consequently, this also allows the data to be released to the public sooner.
The current iteration of the CNN in use is the one provided and trained by Instituto de Física de Cantabria (IFCA, Spain) (Lloret et al. 2018). The classifier is trained in detecting 53 microplankton classes, compromising 42 genera. The training dataset was sampled from the entire FlowCAM dataset, but limiting the maximum number of images per category at 30,000. For every category, 90% of the images were used as training data and 5% each for validating and testing. The trained model predicts for each image the -1 probability it belongs to each defined category. By using the prediction with the highest probability, the current CNN approach reaches a classification accuracy of 90.7%. A 99.4% accuracy is reached when allowing the correct label to be in the top five highest probability predictions. However, there are still difficulties with the classification of rare taxa that hold hardly any validated ROIs. These rare taxa prevent the use of this classifier as a fully autonomous classification system. Human validation remains therefore imperative.
Moving towards a new classification methodology also offers opportunities to further automate and standardise our FlowCAM data processing pipeline. In the new setup, raw output files from the FlowCAM are directly processed by a set of python scripts. The typical "FlowCAM-collages" are cropped into separate ROIs, a clean data table describing all ROIs is generated and additional sample processing metadata is incorporated into the output directory. This avoids the use of VisualSpreadsheet, allowing more and easy control over the data, as well as enabling automation of the dataflow. The generated files are uploaded to a MongoDB server where they are classified by the CNN.

Geographic coverage
Description: Data were collected in 17 stations over the BPNS (Fig. 1).
Coordinates: 51°5'21.5"N and 51°52'34"N Latitude;3°22'13.4"E and 2°14'8"E Longitude. Taxonomic coverage Description: The dataset is composed of 55 categories identified at species level or higher taxon group if the identification at species level is not possible. Bacillariophyceae (33 taxa) and Dinophyceae (7 taxa) are the most abundant phytoplankton classes in the dataset, the rest of the dataset being formed by non-phytoplanktonic categories (15)   May be a global unique identifier or an identifier specific to the dataset. waterBody The name of the water body in which the Location occurs. country The name of the country or major administrative unit in which the Location occurs. countryCode The standard code for the country in which the Location occurs.
minimumDepthInMeters The lesser depth of a range of depth below the local surface, in metres.
maximumDepthInMeters The greater depth of a range of depth below the local surface, in metres. decimalLatitude The geographic latitude (in decimal degrees, using the spatial reference system given in occurrenceID An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.

measurementType
The nature of the measurement, fact, characteristic or assertion.

Dataset location and format
Data are made available through the LifeWatch data explorer  where users can access, visualise and download the quality-controlled data table that includes the Trip action ID, Date (Time), Station, Taxon, Abundance (Density) and additional metadata. Each sample, with its unique Trip action ID, presents several rows, one for the abundance of each Taxon. In the background, all particle data including cropped pictures, taxonomic annotation and associated sample and particle metadata are stored in a MongoDB data system and are not downloadable, but they are accessible upon request. This database is replicated as a back-up on servers of Instituto de Física de Cantabria (IFCA) in Santander, Spain. For long-term preservation, the original data files are archived in the Marine Data Archive. The quality-controlled classification files and the cropped pictures of the FlowCAM are archived to a network archive on the VLIZ servers and linked to its metadata in the MIDAS system. This database is uploaded on to the IFCA server (Santander, Spain). For further redistribution and exchange with European and global data systems, the data are integrated in the European node of the Ocean Biogeographic Information System (EurOBIS) and the Biology portal of the European Marine Observation and Data Network (EMODnet). The inputs to these networks are currently done through yearly exports, but procedures enabling higher data exchange frequencies are under development. The data exchange requires reformatting in accordance with the OBIS-ENV DATA format, which is an adaptation of the Darwin Core Archive (DwC-A) schema, developed for sample-based marine biological data (De Pooter et al. 2017). In the OBIS-ENV DATA standard, the DwC-A file contains three main structural elements: an Event core linked to an Occurrence extension and an ExtendedMeasurementOrFact extension (eMoF). The Event core stores information on sampling location, time and depth. The Occurrence extension stores the presence/absence data of the taxa. The EMoF contains the abundance data, the environmental data at the moment of the sampling, the sampling equipment and the protocols. The EMOF data is standardised following controlled vocabularies managed by the British Oceanographic Datacentre and the European SeaDataNet project. Fixed versions of the database are distributed annually (e.g. Flanders Marine Institute 2020). A metadata record is created in the dataset catalogue of the Integrated Marine Information System and dataset versions are labelled with a Digital Object Identifier (DOI). The complete data pathway is given in Fig. 4 (pre-2019) and Fig. 5 (post-2019).

Current usage and future perspectives
Monitoring of phytoplankton via the FlowCAM is part of a long term ESFRI initiative. Regular updates of the validated data are accessible on the LifeWatch data explorer and a yearly dataset is published on MDA. Valorisation of this data is ongoing in the framework of MSFD and in light of the blue economy supporting research, for example, fouling management, nature-based solutions, aquaculture etc. and is part of an artificial intelligence application study.  Schematic overview of the data-flow from ship to user, since 2019.