Kakila database: Towards a FAIR community approved database of cetacean presence in the waters of the Guadeloupe Archipelago, based on citizen science

Abstract Background In the French West Indies, more than 20 species of cetaceans have been observed over the last decades. The recognition of this hotspot of biodiversity of marine mammals, observed in the French Exclusive Economic Zone of the West Indies, motivated the French government to create in 2010 a marine protected area (MPA) dedicated to the conservation of marine mammals: the Agoa Sanctuary. Threats that cetacean populations face are multiple, but well-documented. Cetacean conservation can only be achieved if relevant and reliable data are available, starting by occurrence data. In the Guadeloupe Archipelago and in addition to some data collected by the Agoa Sanctuary, occurrence data are mainly available through the contribution of citizen science and of local stakeholders (i.e. non-profit organisations (NPO) and whale-watchers). However, no observation network has been coordinated and no standards exist for cetacean presence data collection and management. New information In recent years, several whale watchers and NPOs regularly collected cetacean observation data around the Guadeloupe Archipelago. Our objective was to gather datasets from three Guadeloupean whale watchers, two NPOs and the Agoa Sanctuary, that agreed to share their data. These heterogeneous data went through a careful process of curation and standardisation in order to create a new extended database, using a newly-designed metadata set. This aggregated dataset contains a total of 4,704 records of 21 species collected in the Guadeloupe Archipelago from 2000 to 2019. The database was called Kakila ("who is there?" in Guadeloupean Creole). The Kakila database was developed following the FAIR principles with the ultimate objective of ensuring sustainability. All these data were transferred into the PNDB repository (Pöle National de Données de Biodiversité, Biodiversity French Data Hub, https://www.pndb.fr). In the Agoa Sanctuary and surrounding waters, marine mammals have to interact with increasing anthropogenic pressure from growing human activities. In this context, the Kakila database fulfils the need for an organised system to structure marine mammal occurrences collected by multiple local stakeholders with a common objective: contribute to the knowledge and conservation of cetaceans living in the French Antilles waters. Much needed data analysis will enable us to identify high cetacean presence areas, to document the presence of rarer species and to determine areas of possible negative interactions with anthropogenic activities.


Abstract Background
In the French West Indies, more than 20 species of cetaceans have been observed over the last decades. The recognition of this hotspot of biodiversity of marine mammals, observed in the French Exclusive Economic Zone of the West Indies, motivated the French government to create in 2010 a marine protected area (MPA) dedicated to the conservation of marine mammals: the Agoa Sanctuary. Threats that cetacean populations face are multiple, but well-documented. Cetacean conservation can only be achieved if relevant and reliable data are available, starting by occurrence data. In the Guadeloupe Archipelago and in addition to some data collected by the Agoa Sanctuary, occurrence data are mainly available through the contribution of citizen science and of local stakeholders (i.e. nonprofit organisations (NPO) and whale-watchers). However, no observation network has been coordinated and no standards exist for cetacean presence data collection and management.

New information
In recent years, several whale watchers and NPOs regularly collected cetacean observation data around the Guadeloupe Archipelago. Our objective was to gather datasets from three Guadeloupean whale watchers, two NPOs and the Agoa Sanctuary, that agreed to share their data. These heterogeneous data went through a careful process of curation and standardisation in order to create a new extended database, using a newlydesigned metadata set. This aggregated dataset contains a total of 4,704 records of 21 species collected in the Guadeloupe Archipelago from 2000 to 2019. The database was called Kakila ("who is there?" in Guadeloupean Creole). The Kakila database was developed following the FAIR principles with the ultimate objective of ensuring sustainability. All these data were transferred into the PNDB repository (Pöle National de Données de Biodiversité, Biodiversity French Data Hub, https://www.pndb.fr).

Introduction
Roughly 40% of the world's human population live within 100 km of a coast* and its growth is putting an unprecedented pressure on coastal and marine ecosystems and their organisms (Burke et al. 2001, Halpern et al. 2015. In particular, shipping now accounts for more than 90% of global trade, it is constantly increasing, resulting in an expanding consumption of coastal land and a continuous increase in the intensity of maritime traffic and the size of its vessels (Sèbe 2020, UNCTAD 2018, Walker et al. 2019). If we want to mitigate the consequences of these changes, it is essential to monitor our impacts on the oceans and their ecosystems and to collect relevant data for this purpose. In particular, the monitoring of marine mammal populations may contribute to a better understanding of the interactions between the growing pressure of human maritime activities and their environment. Indeed, cetacean populations are considered as sentinel and umbrella species, because their presence testifies to the functional importance of the marine realm for the conservation of the environment (Hooker and Gerber 2004, Jung and Madon In press). However, scientific surveys generally require costly human and financial resources to implement the sampling protocols that are required to estimate robust relative abundance and density of cetacean species at sufficiently fine spatial and temporal scales (Laran et al. 2017, Pennino et al. 2017, Rone et al. 2016. To address these constraints, complementary methods are needed to extend spatial and temporal coverage and to collect additional data. In this context, citizen-science, in which part of the research is conducted by volunteer non-professional scientists, represents a highly relevant alternative to scientific surveys to acquire additional data at lower cost and often at larger spatial and temporal scales. Thus, in many situations and places where scientific data cannot be collected, data provided by citizens is an invaluable source of information. For marine mammals, relevant examples are, for instance, the Monicet platform in the Azores (http://www.monicet.net), the Flukebook catalogue ( Levenson et al. 2015), the Gotham Whale project near New York City (https://gothamwhale.org), the Intercet platform in the northern Tyrrhenian Sea (http://www.intercet.it), the network Obsenmer in some places of French waters (https://www.obsenmer.org) or the recently published data obtained in Kenya (Mwango'mbe et al. 2021). Although the data acquired by citizen science can be opportunistic and ultimately heterogeneous, it has been shown that it can reveal the same trends as those highlighted by data obtained through scientific surveys (Harvey et al. 2018, Jung et al. 2009, Stelle 2017, Van Strien et al. 2013).
The Guadeloupe Archipelago is a hotspot of marine biodiversity where understanding the interactions between cetaceans and human activities is essential. It has also led the French government to create a marine protected area dedicated to marine mammals within the French Exclusive Economic Zone of the West Indies: the Agoa Sanctuary. However, adequate cetacean conservation can only be achieved if relevant and reliable data are available. In the Guadeloupe Archipelago, besides a PhD thesis (Gandilhon 2012) and few scientific observation surveys (Boisseau et al. 2006, Laran et al. 2019), occurrence data are only available thanks to the contribution of dedicated local citizen-science stakeholders (i.e. NPOs and whale-watching companies). These data are highly valuable, often made by experienced observers able to accurately distinguish 1 between species and some of them were used for scientific targeted studies (Heenehan et al. 2019, Kennedy et al. 2014, Stevick et al. 2016, Stevick et al. 2018. By their very heterogeneous nature, citizen science data are challenging to analyse (Van Strien et al. 2013). That is why it makes sense to integrate them into a database complying with the FAIR principles (Wilkinson et al. 2016) using a step-by-step community approach ) and a pragmatic method taking into account the constraints of the stakeholders . All of this is with the aim of promoting their sharing and dissemination within the scientific community interested in marine mammals and marine spatial planning.
This data paper presents the process of structuring heterogeneous multi-source data in order to build a robust and standardised database of cetacean observations around the Guadeloupe Archipelago (Fig. 1). Observations collected over several years by local NPOs or whale watchers (Figs 2, 3) have been integrated into a database named "Kakila" (namely "who is there" in the Guadeloupean Creole language). The data processing steps, their curation protocol, quality assurance processes and the methods and tools that enable the long-term integrity and comprehension of data are presented. The Kakila database has been added into the PNDB repository (Pôle National de Données de Biodiversité, Biodiversity French Data Hub, https://www.pndb.fr). Area of study. Perimeter of the the Agoa Sanctuary, which corresponds to the French Economic Zone in the West Indies and localisation of the Guadeloupe Archipelago (data sources: map base, http://www.caribbeanmarineatlas.net; Agoa protection zone, https://inpn.mnhn.fr).

Project description
Design description: The FAIRification process of the Kakila database (Table 1). The key goal of our project was to group heterogeneous, but scientifically significant datasets of cetacean observations in the Guadeloupe Archipelago into a single database and to make it open access. To achieve this goal, we followed the FAIR guiding principles (Wilkinson et al. 2016). According to the European and International Open Science dynamic, the French National Plan for Open Science (Ministère de l'Enseignement Supérieur, de la Recherche et de l'Innovation 2018) aims to ensure that data produced by government-funded research in France are gradually structured to comply with the FAIR Data Principles (Findable, Accessible, Interoperable and Reusable) (Wilkinson et al. 2016). We also followed the "as open as possible, as closed as necessary" principle of the H2020 Programme Guidelines on FAIR Data (Landi et al. 2020), by deleting, from this shared version, the observer names to avoid the dissemination of personal data. As a consequence, the chosen strategy for the FAIRification process mainly used the recommendations of the Sharing Rewards FAIR assessment decision-tree criteria and lessons learned for the gradual implementation of FAIR criteria ).

FAIR principles ( Wilkinson et al. 2016)
FAIRness assessment criteria used for the Kakila database FINDABLE -Using unique identifiers for each observation occurrence, observer, boat excursion, taxon, collector organism and geographic sectors.
-Making persistent metadata and datasets thanks to the deposit to the French Pôle National de données de Biodiversité (PNDB, https://www.pndb.fr/) which is a national infrastructure data repository.
-Providing a data dictionary to guarantee the reusability of the database.
-Using the Ecological Metadata Language (EML) internationally recognised standard to describe the database metadata and its associated projects, including standardised search keywords.
-Using a metadata format validator thanks to the MetaShARK (Arnaud et al. 2020).
-Using a versioning system to allow future updates.
-Generating a Darwin Core Archive from the Kakila database. The Darwin Core Standard (DwC) offers a stable, straightforward and flexible framework for compiling biodiversity data, notably occurrences, from varied and variable sources (Wieczorek et al. 2012).

ACCESSIBLE
-Storing data in the PNDB repository with respect to the guidelines for quality standards (e.g. use of EML).
-Efficient and rich services for various uses and users provided by the PNDB.
-Working to adapt the Kakila database in order to integrate it in the GBIF.

INTEROPERABLE
-Using standard vocabularies for some fields (e.g. Beaufort Wind Scale for the wind speed).
-Using keywords of international thesaurus, such as GEMET/INSPIRE (GEMET 2008) and AGROVOC (Food and Agriculture Organization of the United Nations 1980).
-Using a data dictionary including the Darwin Core mapping.
-Associating a Darwin Core archive with the Kakila database. The Darwin Core Standard (DwC) offers a stable, straightforward and flexible framework for compiling biodiversity data from varied and variable sources (Wieczorek et al. 2012).

REUSABLE
-Using an open format for the dataset (Tab Separated Values .tsv and OpenDocument .ods for the original database) and open source software to reuse it.
-Including in the EML metadata the provenance for raw and derived data.
-Explaining in this data paper the data processing steps, the data curation protocol, the data quality assurance processes, the methods and tools that permit long term integrity and understandability of data.
-Using a time range clearly mentioned in the EML metadata and in this data paper. The same applies for geographical and taxonomic coverages and the CC-BY licence and rules for large reuse.
-Using a Darwin Core Archive to facilitate the reusability of the Kakila database, because it enables the publication into the GBIF. This compact package (a ZIP file) contains interconnected text files and enables users to share their data using common terminology.   Sampling description: Sampling consisted, in a first phase, in conducting a preliminary survey of the different NPOs and professional whale-watchers known to record cetacean observation data around the Guadeloupean Archipelago and whose expertise was previously recognised: for example, co-authorship of scientific publications (Barragán- Barrera et al. 2019, Stevick et al. 2016, Heenehan et al. 2019, Stevick et al. 2018), participation to a PhD (Gandilhon 2012) and book publication (e.g. Mon école ma baleine 2019). We established contacts to collaborate and to agree on the terms of use and fair sharing of the data into a common database. Following this first survey, an informal invitation to open and contribute their dataset was sent to each organisation. All agreed to share and open the data once the aggregated database would be finalised.  Table 3.
Data dictionary -metadata repository -of the Kakila DB. Datasets and Column labels are also presented in the "Data resources" part. The Darwin core data standards are described in Wieczorek et al. (2012). Numeric decimalLatitude Geographic Longitude (in decimal degree, using the spatial reference system in "Reference system") longitude Longitude of the observation expressed in decimal degrees.

Datasets and Column labels
Numeric decimalLongitude Geographic Latitude (in decimal degree, using the spatial reference system in "Reference system") Quality control: An effort to centralise and harmonise siloed data was made by controlling the join keys (eg. "code_observation", "code_sortie" etc.) between linked tables using dynamic pivot tables. Content quality controls were also used, such as a controlled dropdown menu for many fields that avoid potential input errors. Geolocations, often transformed into decimal degrees, were verified using the Geographic Information System QGIS 3.10 (long-term release) software.

Datasets and Column labels
In addition, data were checked for errors: 10% of the entries were randomly selected and checked by two persons. One person carried out the random draw from the "observation" table and the other operator checked the selected lines in the database against the original datasets provided by the data owners. The data entry was invalidated if it contained an error in any field. The error rate was calculated as follows: the proportion of the number of data entries containing an error on the total number of checked data entries and was estimated at 0.073 in the Kakila database.
Step description: the structure of the Kakila database was based on the original structures of the datasets and on the functional dependencies between the data. New fields of the Kakila database were defined and approved by the data providers. Then a data dictionary was defined (Table 3). The aim of this dictionary was to produce a precise definition or description of each of the fields, based on validated scientific frameworks. The data dictionary is essential to guarantee the reusability of the database. In particular, the data dictionary ensures a clear definition of fields and limits input errors for future data entry.
The overall structure of the Kakila database was then designed to allow the establishment of relationships between the variables within the database. Kakila contains six main tables (Fig. 4): -The table "observateur" (observer) lists the volunteers and whale watchers who made the observations, together with a level of expertise (from beginner "débutant" to expert "expert") for each of them.
-The table "organisme" (organisation) lists the data providers, NPOs and whale watchers. Overall structure of the Kakila database, based on six tables (observateur, organisme, sortie, observation, taxon, secteur _geog; see text for translation and description of each term).
-The table "sortie" (field trip) lists the field trips recorded in the Kakila database (n = 3249), and contains information on the date and duration of trips, observer(s) on board, sea state and visibility.
-The table "observation" (observation) lists the observations of marine mammal species recorded during the corresponding field trip. Place and time of the observation are recorded, as well as the taxon identified (see table "code_taxon") and the number of individuals observed. The availability of a picture for the observation is stated.
-The table "taxon" (taxa) lists the marine mammal taxa recorded (e.g. species, genus, family ...), including scientific and common names, as well as the TAXREF code.
-The table "secteur_geog" (geographical place) lists the geographical area that observers used to localise their observation in preference to GPS data. The geographical areas were defined using the initials of the name of the closest town or locality on the sea coast and the direction between the observation site and the locality.

Geographic coverage
Description: Our study focuses on the coastal waters surrounding the Guadeloupean Archipelago ( Fig. 1). Guadeloupe is a French Island located in the West Indies. It is part of the Agoa Sanctuary, which corresponds to the French Exclusive Economic Zone of the West Indies. All observations were recorded from boats, during trips close to the coast (the most distant observation from the coast was located 35 miles (ca. 55 km) off the Island of Marie Galante).

Taxonomic coverage
Description: The observation consisted, whenever possible, in a taxonomic identification at the species level. Twenty-one species of cetaceans have been observed and identified. Some observations did not allow us to identify the species; in these cases, the identification was done at the family level or at the suborder level (Table 2).

Discussion and foresight
Threats that cetacean populations face are multiple, but well-documented (Bedriñana-Romano et al. 2021, Campana et al. 2015, David et al. 2011, de Stephanis et al. 2013, Garcia-Cegarra et al. 2021, Gero and Whitehead 2016, Huntington 2009, Jepson et al. 2016, Jung and Madon In press, Lusseau et al. 2009, Sèbe et al. 2019, Van Waerebeek and Leaper 2008. Citizen science can play an important role in the acquisition of ecological data (e.g. Harvey et al. 2018). This is especially true for the marine megafauna, whose observation and species identification require a huge amount of time spent at sea by researchers and marine biologists, for performing accurate identifications. Large-scale scientific surveys dedicated to the study of marine mammals have proved to deliver valuable information, for example, the SCANS or the REMMOA surveys (Laran et al. 2019, SCANS-II 2008). However, the financial costs of such scientific surveys prevent their organisation at a sufficient interval of time required to complete and optimise the list of species, to identify fine-scale trends and to take into account mobile species not present throughout the year. Recurrent monitoring of marine mammal populations over long-time periods can only be supported by permanently present local stakeholders, such as NPOs and professionals, i.e. whale watchers. In the Guadeloupe Archipelago, local stakeholders play a major role in recording the presence of and monitoring local marine mammal populations (e.g. Gandilhon 2012, Heenehan et al. 2019, Kennedy et al. 2014, Mon école ma baleine 2019, Rinaldi 2016, Rinaldi et al. 2006, Stevick et al. 2016, Stevick et al. 2018. NPOs and whale watchers have a unique knowledge and they already collaborate on scientific studies focused on specific species (Barragán- Barrera et al. 2019, Gandilhon 2012, Stevick et al. 2016. The Kakila project aimed at taking a step further by gathering all local knowledge into a single database. This was only made possible with the involvement of all data owners in the development of the database. The process was based on a long-term collaboration between the NPO OMMAG and scientist co-authors of this paper. This allowed us to undertake a mapping of the local stakeholders, experts in the field and who may be interested in the project. They were then approached by the scientists to explain the long-term goals of the initiative. The engagement process focused on ensuring equitable contributions and mitigating any tensions related to the use of the data. Once agreements and data were provided, the project undertook the delicate phase of data curation, harmonisation, standardisation and development of the database architecture. Each collector had his/her own tabulated file for entering observations with no central data store and access interface. However, all these datasets share common variables that constituted the common basis for the Kakila database construction. Data owners were involved in this technical process and their feedback was requested and taken into account (e.g. naming fields) to foster a sense of ownership and ensure the long-term usage of the database.
Providing metadata has been eased by a development version of MetaShARK. Since this application was maturing, some parts of the data description had to be handled manually: turning the files encoding from Windows-1252 to UTF-8 and correcting EML Assembly Line templates when needed.
The Kakila database is the first attempt at gathering all available local knowledge on cetacean presence in the Guadeloupe Archipelago. Clearly the long-term strategy to maintain and enrich the Kakila database must focus on careful monitoring of stakeholders' interests, motivations and ultimate expectations. One of its first scientific valorisations will be to help detect and identify key areas of interaction between cetaceans and marine traffic in the Guadeloupe Archipelago in the framework of the TRAFIC project * . In addition, we hope to be able to develop such a database for other small island countries and territories of the Greater Caribbean Area.