Building a database for long-term monitoring of benthic macrofauna in the Pertuis-Charentais (2004-2014)

Abstract Background Long-term benthic monitoring is rewarding in terms of science, but labour-intensive, whether in the field, the laboratory, or behind the computer. Building and managing databases require multiple skills, including consistency over time as well as organisation via a systematic approach. Here, we introduce and share our spatially explicit benthic database, comprising 11 years of benthic data. It is the result of intensive benthic sampling that has been conducted on a regular grid (259 stations) covering the intertidal mudflats of the Pertuis-Charentais (Marennes-Oléron Bay and Aiguillon Bay). Samples were taken by foot or by boats during winter depending on tidal height, from December 2003 to February 2014. The present dataset includes abundances and biomass densities of all mollusc species of the study regions and principal polychaetes as well as their length, accessibility to shorebirds, energy content and shell mass when appropriate and available. This database has supported many studies dealing with the spatial distribution of benthic invertebrates and temporal variations in food resources for shorebird species as well as latitudinal comparisons with other databases. In this paper, we introduce our benthos monitoring, share our data, and present a "guide of good practices" for building, cleaning and using it efficiently, providing examples of results with associated R code. New information The dataset has been formatted into a geo-referenced relational database, using PostgreSQL open-source DBMS. We provide density information, measurements, energy content and accessibility of thirteen bivalve, nine gastropod and two polychaete taxa (a total of 66,620 individuals) for 11 consecutive winters. Figures and maps are provided to describe how the dataset was built, cleaned, and how it can be used. This dataset can again support studies concerning spatial and temporal variations in species abundance, interspecific interactions as well as evaluations of the availability of food resources for small- and medium size shorebirds and, potentially, conservation and impact assessment studies.


Introduction
Genesis: This benthos monitoring was initiated in winter 2003-2004 with the aim of describing feeding resources for overwintering shorebird species (e.g. Curlews, Grey Plovers, Bar-tailed Godwits, Black-tailed Godwits, Red knots, Dunlins, Oystercatchers, Redshanks and one duck species: Shelducks). At first, spatial studies were initiated and led to reference papers in ecosystem comparisons in the domain of molluscan studies (Bocher et al. 2007, Compton et al. 2009 or shorebird ecology e.g. Quaintenne et al. (2011. Following the example of a long term program at the NIOZ Institute (Netherlands Institute for Sea Research) through the SIBES program in the Dutch Wadden Sea and a sampling of the Banc d'Arguin (Mauritania) and Roebuck Bay (Australia) muflats, sampling was continued on an annual basis at 259 stations.

Objectives:
The initial purpose of the monitoring was to study the spatio-temporal variations of main macrobenthic species as available resources for shorebirds. Knowing individuals to the species level, their length, shell mass, accessibility (the top fraction was analysed separately) and flesh energy content, one can analyse for example: • the spatial variability of densities of benthos prey, including comparisons with other countries (Bocher et al. 2007); • the fraction available for Calidris canutus and Calidris alpina (sandpiper species), since the top 4 cm of the cores were analysed separately and shells were measured (Philippe et al. 2016); • based on the quality of the molluscs (flesh to shell ratio) it is possible to predict the diets of red knots Calidris canutus using a digestive rate model derived from type II functional response curve, depending on the site and the year (van Gils et al. 2004Quaintenne et al. 2011).

Project description
Title: Long-term molluscs and annelids monitoring in the Pertuis-Charentais, France Personnel: The monitoring was conducted every year under the responsibility of Pierrick Bocher with constant participation of Philippe Pineau and Nicolas Lachaussée and managers of the National Nature Reserves of Aiguillon Bay and Marennes-Oléron Bay. Additional help in the field was provided by multiple other colleagues and PhD candidates throughout the years.

Study area description:
Sampling was performed on intertidal mudflats located in national nature reserves: RNN Aiguillon Bay and RNN Moëze-Oléron.
Design description: Systematic sampling was performed on four regular 250 m grids, composed of 64 stations each (except for the subsite "Oléron" (OL) which contained 67 stations).
Funding: The monitoring was funded by the University of La Rochelle and the CNRS via laboratory donations and staff time. Financial support was received from the Région Poitou-Charentes through a thesis grant to Anne Philippe (2013)(2014)(2015)(2016). LIENSs laboratory and the DYFEA team also provided help with field work costs. The Office National de la Chasse et de la Faune Sauvage (ONCFS) and the Ligue pour la Protection des Oiseaux (LPO) supported this monitoring via staff time and nautical resources dedicated to sample collections.

Sampling methods
Sampling description: Benthic macrofauna was collected over a predetermined 250 m grid following a proven sampling protocol (Bocher et al. 2007, Bijleveld et al. 2012), see (Fig. 1). Each station was located with a handheld GPS device using WGS84 geodetic datum. Out of 259 stations sampled, only a minority (46%) was sampled by foot (during low tide), taking sediment cores covering an area of 1/56 m² down to a depth of 20 to 25 cm. The top fraction (first 4 cm of the sediment) was separated from the bottom fraction to be able to segregate the accessible benthos fraction for the two main shorebirds species: the red knot Calidris canutus and the dunlin Calidris alpina (Fig. 2). We took an additional core (70 mm diameter) covering 1/263 m² to a depth of 4 cm for sampling exclusively the very abundant mudsnail Hydrobia ulvae (Pennant) (Bocher et al. 2007). When the tide covered the mudflats with water (0.4-2.0 m) and for the very soft and inaccessible lower intertidal areas, sampling was done from boats using inflatable zodiacs or other small vessels. From the boats, two mud cores (100 mm diameter) covering a total of 1/56 m² to a depth of 20 to 25 cm were taken. Only one core was taken into account for H. ulvae, and both were taken into account for any other macrobenthic species.  The harvestable fraction corresponding to the mean bill length of the sandpipers, the dunlin Calidris alpina and the red knot Calidris canutus and composed of the accessible fraction of benthos (found in the top 4 cm of the mud core) and of ingestible sizes (not too large and not too small). Ingestible sizes are species specific, and depend on the shape of the mollusc.
Sizes available for C. canutus and C. alpina species are reported in Philippe et al. (2016).
The cores were sieved through a 1 mm mesh, except for the mudsnail H. ulvae cores, which were sieved over a 0.5 mm mesh to prevent the loss of individuals smaller than 1mm (from the apex to the aperture). All living molluscs were collected in plastic bags and frozen (-18°C) until laboratory treatment (Fig. 3). Polychaetes were preserved in 70% ethanol. Living specimens were determined using the identifaction key described in Hayward and Ryland (1996).

Sample processing:
In the laboratory, molluscs were identified under the supervision of a benthic taxonomist. Individuals were counted, and their maximum length was measured to the nearest 0.1 mm using Vernier calipers, width was also measured for bivalves. Hydrobiid mudsnails were size-categorised from 0 mm up to 6 mm (e.g. size class 2 consisted of individuals with lengths ranging from 2 to 2.99 mm).
The flesh of every mollusc specimen except H. ulvae was detached from the shell and placed individually in crucibles (pooled by size class when sizes were smaller than 8.0 mm, flesh and shell together). Crucibles containing molluscs were placed in a ventilated oven at 55 to 60°C for three days until constant mass and then weighed (DM ±0.01 mg). Dried specimens were then incinerated at 550°C for 4 h to determine their ash mass and then a proxy of their energy content: the ash free dry mass ( For bivalves larger than 8.0 mm, shells were placed in adequate numbered stalls and dried in a ventilated oven at 55 to 60°C (DM ±0.01 mg) for three days. For H. ulvae individuals, shell was not separated from the flesh, and shell dry mass was estimated from the total biomass using the following regression: DM = 5.5902 × AFDM Processing of benthic samples in the laboratory.
Building a database for long-term monitoring of benthic macrofauna in the ...
In the present dataset, we removed all epibenthic species (e.g. crustaceans) since they are nearly absent from the feeding regime of shorebirds in our study area (Bocher et al. (2014). Polychaetes were also identified and counted, but their length and AFDM were not determined due to insufficient numbers of entire individuals to build regression equations. Nearly all polychaete species were removed from the database because the precision of sample sorting and determination varied widely between the years depending on the laboratory operator. Only individuals from the Genus Nephtys (more than 90% of which were Nepthys hombergii) and individuals of the Family Nereididae (more than 90% of which were either Alitta succinea or Hediste diversicolor) were kept because they represent more than 80% of polychaetes in our study sites and comprise the biggest annelids and therefore biomass and prey for shorebirds (pers. observation). They are mentioned under the abbreviations "nepsp" (for Nephtys genus) and "neresp" (for Nereididae family) in the database. For these two polychaete taxa, the columns "length", "width", as well as "AFDM", "Biomass_dens" and "NewTBH" are not available, only densities are available.

Geographic coverage
Description: In winter 2003-2004, an extensive benthos sampling (864 stations) was conducted following a grid over the whole intertidal zone of Aiguillon Bay and Marennes-Oléron Bay (Bocher et al. 2007). In the following years, the grid was reduced for practical reasons to a 259 points grid spread in four subsites covering three different types of mudflats: Aiguillon Bay on the Charente-Maritime side (AIC, bare mudflat), Aiguillon Bay on the Vendée side (AIV, bare mudflat), Moëze intertidal area (MO, bare mudflat with runnels system) and Oléron Island intertidal area (OL, sandy mudflat covered with seagrass). The grids were defined as a square shape when possible to facilitate movements and navigation on the mud and at sea (Fig. 1). Because this sampling effort was designed primarily for the study of shorebird ecology, grids were placed in protected areas (National Nature Reserve of Moëze-Oléron, and National Nature Reserve of Aiguillon Bay) managed by the LPO (Ligue pour la Protection des Oiseaux) and ONCFS (Office nationale de la Chasse et de la Faune Sauvage), matching wetlands international annual shorebird counting areas. Sampling stations are referenced by spatial coordinates using the WGS84 geodetic datum. The four sampled subsites present contrasted characteristics that are described in Degré et al. (2006) and Bocher et al. (2007).
Due to field complexities or bad weather, some stations were not sampled some years. Namely in 2004, stations 2131-2133 and 2136-2140; in 2006, stations 2103 and 1953; in 2005, stations 1860, 1391 and 1395; in 2008, station 1300; in 2010, station 1958; in 2012, station 2137; and in 2014, stations 2137 and 1311. To this list, all the stations from Marennes-Oléron (site 'OL' and 'MO') in years 2010 and 2011 should be added. The value "NA" (i.e. not available) is given in the data set to avoid misinterpreting these cases for true "absence".

Taxonomic coverage
Description: The dataset includes all occurrences of individuals for the taxa listed in Table  2. Annelids were grouped under the Genus Nephtys of the Family Nerididae. Ruditapes sp. was essentially individuals of the species Ruditapes philippinarum. Scientific names and corresponding AphiaID were derived from the World Register of Marine Species (WoRMS). The combination of both scientific name and AphiaID prevents any confusion when names change over time. Indeed, in ten years of collecting data, various operators have used a list of species that was an extract of a taxonomy (with description of phylum, subphylum, class, subclass, infraclass, superorder, order, suborder, infraorder, superfamily, family, genus, species, and authority). However, classification evolves throughout time, and our dataset needed to be backed to an official registry of species linked to the semantic Web. The World Register of Marine Species (WoRMS, World Register of Marine Species 2016b) provides unique AphiaID for identifying marine species, and maintains a historic of changes of naming or classification. Furthermore, it provides semantic export of species description in RDF compatible with Darwin Core or Dublin Core metadata standards. We used this register in order to disambiguate the identification of species found in the dataset. For that, we have written a Python program querying the Web Service (World Register of Marine Species 2016a) provided by the World Register of Marine Species (WoRMS), for searching and resolving AphiaIDs of our species list, using the WSDL interface (World Register of Marine Species 2016c). For each record in our table, the program sends the name of the species in parameter of the querymatchAphiaRecordsByNames and the service finds for us matching taxa; the program asks then for the full AphiaRecord of the taxon ( getAphiaRecordByID), parses it and checks it against our own classification. When a AphiaRecord matches our own table record, we retain it, else we log the uncertain case in order to search manually in the web interface. Only few cases (around 20), identified in the logs of our program, have been searched and solved manually using the web interface of WoRMS. The result is that we provide within our dataset the actual taxonomy of our species extracted from WoRMS, and most importantly their corresponding AphiaID in WoRMS.      genus Genus associated with the individual.In the present datapaper the column refers to the column genus'.

English name
English common name associated with the individual.In the present datapaper the column refers to the column 'English name'. authority The person credited to have formally named the species for the first time.In the present datapaper the column refers to the column 'authority'.
nr Total individuals of this species in the database.In the present datapaper the column refers to the column 'nr'.

Additional information
Steps of database building The dataset was built via the association of different original tables at different steps of the collection, sorting, measuring and weighing of samples. After each sampling session, we filled in a table "CORE" with columns: "year", "Poskey", "car_mod_id" and "car_top_bottom" as well as unique identifier "core_id". Subsequently, in the laboratory while determining and measuring the samples, we filled in a table "BENTHOS" with columns: "abr", "number", "length", "width", "class_length", "T.B.H", "AFDM", "Shell_DM" and "core_id". A unique idendifier "id" was associated to each row in this table "BENTHOS".
Preliminary to the tables "CORE" and "BENTHOS", a table "SPECIES" with the scientific name associated to the abbrevation "abr" was built, together with a table "STATIONS" with coordinates, site and subsite associated to the sampling station identifier "Poskey".
All these tables were associated together, in strict respect of integrity criteria specified through 'foreign keys' and 'primary keys' constraints, to form a relational database (Fig. 4) and later the flattened database "benthos" presented in this datapaper. Foreign key "core_id" in table "BENTHOS" had to match primary key "core_id" in table "CORE".
Primary key "Poskey" of table "STATIONS" had to match foreign key "Poskey" of table "CORE". Primary key "abr" in table "SPECIES" had to match foreign key "abr" in table "BENTHOS". Benthic monitoring database building steps, from data collection to the production of results

Cleaning process
Step 1. Check integrity constraints, misspellings, uniqueness: The preparation of the dataset (e.g. checking for data types, misspelling, uniqueness of identifiers and integrity constraints as described in Table 2) was performed through building the relational database under PostGreSQL, following the principles described in Chapman 2005. The work was performed using SQL scripts, during and after the import of all raw CSV tables (BENTHOS, CORE, STATIONS, SPECIES), that were not linked together at the beginning, nor respecting all constraints for attributes type for instance. Some insights of the work led during this step are given in Table 3.
Step 1 Step 2. Outliers and NA, using regression models on data: Once the relational database was built, and integrity criteria met, the database was cleaned in a systematic way, species by species, site by site and year by year. This step was done using R statistical software (R Development Core Team 2015). The first step aimed at unfolding the dataframe (i.e. producing one row for each sampled individual). Then the column "length" was cleaned from its outliers or potential typing errors using common methods (Hawkins 1980). Subsequently, using extrapolation curves, outliers and missing values were derived from the column "length" for "width", "DM " and "AFDM" following the R code (Suppl.
Integrity checks, data cleaning and inferring of missing data material 1), see (Fig. 5). This process led to substantial improvement of the dataset, see Table 3.
Assigning sampling area to each station in each year, depending on the sampling method: The columns "area" and "area_hyd" were filled in according to the sampling method (either by boat "car_mod_id"= 2 or by foot "car_mod_id"=1). By boat, two cores are taken using a corer with a radius of 0.050 m (area= 2 x (PI x 0.050²)); by foot, one core was taken with radius of 0.075 m (area= PI x 0.075²). Hydrobia ulvae were counted in one of the two cores taken by boat (area= PI x 0.050²), and if sampling was made by foot an additional smaller core was taken, with a radius of 0.035 m (area= PI x 0.035²), see Table 1.
Calculating abundance and biomass densities: The next step consisted of deriving abundance densities ("dens_ind") and biomass densities ("biomass_dens"), see Suppl. material 3 and Table 1 Extrapolating top and bottom fractions: The last step of data cleaning aimed at assigning the sampled individuals to the top or the bottom fraction of the sediment (i.e. whether or not it was accessible to a sandpiper's bill). When samples were taken by foot ("car_mod_id"=1), the top 4 cm were always separated from the bottom fraction ("car_top_bottom"=TRUE), and the sampled individuals always received a "top" or "bott" or "hyd" value in the column "T.B.H", "hyd" corresponding to the top fraction of the sediment since H. ulvae are always accessible to sandpipers. However, when samples were taken from boats ("car_mod_id"=2), the top was not separated from the bottom fraction ("car_top_bottom"=FALSE), and the sampled individuals did not receive a "top" or "bott" value in the column "T.B.H". In that case, a script (Suppl. material 2) was applied to infer for each individual whether it was more likely to belong to the top or the bottom fraction of the sediment based on species-site-year-length specific probabilities (except Hydrobia individuals that would always be accessible and annelid species for which lengths were not available). Data cleaning using extrapolation curves: an example of shell dry mass extrapolation based on shell length in the bivalve Macoma balthica.

Discussion
Opportunities: There are multiple examples of the potential uses of those data for further exploitation. The multidimensional dataset can be explored for various questions and following spatial or temporal dimension. For instance, in Fig. 6, the plots exemplify how spatial and temporal dimensions are associated, since we have time series for each station of each site. We can also examine how the harvestable biomass for sandpipers C. canutus or C. alpina (which means accessible because found in the "top" and respecting certain criteria for length (Fig. 2) is distributed amongst the various species, on a site for a given year (Fig. 7). The plot on the left shows that mean total biomass of the bivalve S. plana was 0.24 g. m in 2014, varying between 0.10 and 0.25 g. m following site stations, with a mean value lower than biomass of M. bathica, but with also a smaller dispersion. The dataset allows us to analyse also how harvestable mean biomass varies according to time on this site for S. plana (Fig. 7 (on the right)) and allows mapping of densities or species composition (Figs 6, 8).
All the results presented in this datapaper can be obtained running the provided R code associated, see Suppl. material 3.

Limits of data use:
The data was designed to estimate temporal changes and spatial differences in biomass and densities, as well as possible changes in quality, depth or community composition in our study sites. However, for predictions at unsampled locations, additional analyses are required, such as investigating spatial auto-correlation (Bijleveld et al. 2012, Kraan et al. 2010. Regression lines derived from the present dataset shall be valid only for the particular species, site and year and should not be used to extrapolate missing data in any other context. However, the scripts we provide in supplementary materials can be adapted to other database architectures to clean the data and produce results.

Author contributions
Data collection, sorting and processing of samples as well as database integrity was initiated and managed by Dr. Pierrick Bocher (Senior Scientist at the LIENSs laboratory, University of La Rochelle) in 2003. One Assistant engineer (Philippe Pineau) and one technician (Nicolas Lachaussée) helped collecting the data every year. Anne Philippe helped in data collection, laboratory work, data uploading and was responsible for the design and integrity of the database. Transfer of the database into PostGreSQL format and preparation of the figures and scripts accompanying the present datapaper was performed by Dr. Christine Plumejeaud-Perreau (Research engineer at the LIENSs laboratory) with