Biodiversity Data Journal :
Software Description
|
Corresponding author: Dagmar Triebel (triebel@snsb.de)
Academic editor: Lorenzo Peruzzi
Received: 01 Jun 2022 | Accepted: 04 Sep 2022 | Published: 14 Oct 2022
© 2022 Petr Novotný, Stefan Seifert, Martin Rohn, Wolfgang Diewald, Milan Štech, Dagmar Triebel
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Novotný P, Seifert S, Rohn M, Diewald W, Štech M, Triebel D (2022) Software infrastructure and data pipelines established for technical interoperability within a cross-border cooperation for the flora of the Bohemian Forest. Biodiversity Data Journal 10: e87254. https://doi.org/10.3897/BDJ.10.e87254
|
|
The timely and geographical resolutions, as well as the quantity and taxon concepts of records on the occurrence of plants near national borders is often ambiguous. This is due to the regional focus and different approaches of the contributing national and regional databases and networks of the neighbouring countries. Careful data transformation between national data providers is essential for understanding distribution patterns and its dynamics for organisms in areas along the national borders. Sharing occurrence data through the international data aggregator Global Biodiversity Information Facility (GBIF) is also complicated and has to consider that the underlying taxonomic concept and geographic information system of each single GBIF dataset might be different. In addition, some regional data providers have a restrictive (non-cc) licensing policy which does not allow data publication via the GBIF network. Therefore, it is necessary to investigate new ways to make data fit for use for a better and comprehensive understanding of the Flora of the Bohemian Forest.
In this paper, we present a bilateral technical interoperability solution for vascular plant occurrence data for the area between the Czech Republic and Bavaria. We describe the initial state of data providers in both countries and the factual and technical challenges in finding a sustainable concept to establish mutual data sharing. The resulting solution for a functional infrastructure and an agreed data pipeline is described in a step-by-step approach. The new distributed infrastructure allows botanists and other stakeholders from both countries to work within the cross-border context of historical and current plants' distribution.
Flora Silvae Gabretae, Bohemian Forest, occurrence data portal, ABCD occurrence data sources, taxon names services, Pladias, web portal, Diversity Workbench repositories.
The Bohemian Forest is a biologically extraordinary region in Central Europe (see Fig.
The research on the flora of the Bohemian Forest (Flora Silvae Gabretae, FSG) has a long tradition on both sides of the border (
As discussed below, it was necessary to establish a consensus (compromise) FSG taxonomic and nomenclatural approach linking the two national databases for such an output. Such output itself is of great importance for further conceptual planning and a crucial source of information for research and policy (
Combining taxon distribution information from both countries and rather the establishment of the software infrastructure and data pipelines helped with the discovery of a number of interesting distribution patterns. Two botanical examples in an otherwise technical article: The occurrence pattern of Teucrium scorodonia is completely different on both sides of the border (see Fig.
Distribution of Teucrium scorodonia according to the joint Czech and Bavarian data. Green rectangles represent grid quadrants (= 1/4 CEBA) with confirmed occurrence, a thick black line demarcates the study area. The frequency of occurrence on the Czech part does not correspond with the information from Bavaria. From https://www.florasilvaegabretae.eu/en/taxon/info/Teucrium%20scorodonia.
Distribution of Antennaria dioica according to the joint Czech and Bavarian data. Green rectangles represent grid quadrants (= 1/4 CEBA) with occurrence confirmed after year 2000, purple-hatched squares represent grid quadrants with occurrence confirmed only before 2000. A thick black line demarcates the study area. Patterns of species decline in both countries differ. From https://www.florasilvaegabretae.eu/en/taxon/info/Antennaria%20dioica.
The Czech botanical database Pladias aggregates critically revised vascular plant occurrence data from a dozen local fragmented data providers from the Czech Republic (
The occurrences are limited strictly to the Czech Republic area, allowing only a technical buffer (50 m) across the country border. The platform had been developed with this limitation to store only those data which can be validated and curated by local experts (
The Pladias database and system is hosted at the IT Center of the Czech Academy of Sciences, Institute of Botany, Průhonice. The services providing interoperability with Bavarian data are hosted at the University of South Bohemia, České Budějovice.
There are two Bavarian partners involved in the cross-border cooperation, the National Park Bayerischer Wald and the Bavarian Natural History Collections (SNSB). Both partners are strongly involved in the Flora of Bavaria (BFL) initiative, whose data originate from a number of historical and current projects (see history*
In consequence, the data which are generated, mobilised and quality-controlled as a part of the FSG initiative*
Software infrastructure and data pipelines established within a cross-border cooperation for the flora of the Bohemian Forest.
Vascular plant flora of the Bohemian Forest.
The reason to choose the study area was because the Bohemian Forest and its extraordinary flora are regarded as a collective natural heritage of the Czechs and Germans, formed by migration and extinction events of natural as well as human origin. The research on the flora has a long tradition on both sides of the border. A comprehensive survey, however, was challenging as the methodical approaches and the intensity of research in the single subregions varied and the cooperation between the experts was - because of language and political reasons - limited. In consequence, the aim of the project Flora of the Bohemian Forest (= Flora Silvae Gabretae) was to establish a viable integration between two well-known disparate database systems. The two-source database platforms were developed with slightly different ambitions, amongst others, described below. The Czech Pladias is intended for exclusively one group of organisms, i.e. vascular plants, with a checklist codified by the informal authority of administrators. This serves as a protection against "taxon concept inflation" (
The following is a more detailed description of both partners´databases, data collections and context.
Area of interest, data sources and calculated data
The total size of the study area is 2764 km2, of which two thirds are the Czech part, one third is the Bavarian part with a small piece of the area belonging to Austria. An overview is given in Table
Overview of data sources for the Bohemian Forest region used in the cross border cooperation of the Flora Silvae Gabretae (FSG) project (status Dec 2021).
Area (covered by technical platform, km2) | Area (covered by FSG Project, km2) | Number of FSG Occurrence Records (polygon) | Number of FSG Occurrence Records under CC-BY licence | Number of High Quality Occurrence Records (GIS referenced or exact locality indicated) | Number of Taxa (taxon concept agreed amongst FSG partners) | |
Pladias | 78,871 km2 | 1,846 km2 | 13,833 | 3,263 | 576,673 | 1,615 |
Flora von Bayern | 70,550 km2 | 918 km2 | 361,173 | 361,173 | 176,161 | 1,381 |
Austria | unknown | 231 km2 | unknown | unknown | unknown | unknown |
As of December 2021, there are about 576,673 point-referenced individual occurrence records and 13,833 grid-referenced records from the study area in Pladias. The study area comprises 126 of the regular 2,552 1/4 CEBA topographic Grid raster units in the Czech Republic. These data are aggregated from 11 individual Czech providers, in the vast majority with a reserved licence. There are 1,615 taxa in Pladias with taxon concepts accepted for the study area and agreed amongst the FSG partners (see Table
The individual occurrences have a diverse origin. They are from literature excerpts, as well as records based on revisions of herbarium specimens. All are incorporated in datasets of particular providers. A significant part comes from field observations. The most recent field observations are collected by specialists of critical groups. They build the main proportion of the data under the CC-BY licence. This open licence is newly used in the Pladias database and its usage was started by the recent joint FSG project. We hope that, in the future, significantly more data will be transferred under this licence, thus allowing easier international cooperation and data sharing.
The study area in Bavaria (called Bavarian Forest), comprises 50 of the regular 2,285 topographic Grid raster units in Bavaria. These TK25/4 "quadrants" are equivalent to CEBA quadrants (
The data of the first data source "Flora of Bavaria ─ occurrence data online"*
The other main source called "Floristic records from survey studies of the Bayerisches Landesamt für Umwelt"*
As of February 2022, both FSG sources together provided 361,173 records. The occurrence area was calculated by GIS shapes of the Bavarian Forest region. The shapes were deposited in the DWB cloud repository*
The data pipelines and other IT services established to achieve the objectives.
Technical concepts of Pladias
The Pladias software infrastructure is a result of the specific situation in the Czech Republic, where dozens of individual providers of data exist. These providers have to (for legislative reasons) or want to keep internal primary evidence of occurrence data on their own. Although all partners involved are aware of the need to share biodiversity information, there is no support for merging the databases into a single entity. Pladias, therefore, acts also as the primary source of data (directly uploaded by users), but most of the data are secondary, taken from partner institutions. For those records that are downloaded, we provide feedback to the primary data provider on expert validations, but it is at their discretion as to how they handle them.
The basis of the infrastructure is a relational database (= RDBMS; Relational Database Management System) PostgreSQL with PostGIS extension allowing us to process spatial data. A description of the database information model is provided by
Pladias does not offer standardised APIs; rather, exports are performed manually by direct SQL querying of the database or by database views. It also does not have an interface compliant to the GBIF network, mainly because of the already-mentioned large fragmentation of the licensing conditions of the different data providers. Cooperation with the Bavarian partners thus represents the first integration effort beyond the borders of the Czech Republic.
Technical concepts of the BFL instance of DWB
The modularised Diversity Workbench (= DWB) represents a freely-available open source tool suite for the management of life and environmental sciences data (see description in biotools*
The BFL instance of DWB currently consists of installations of the modules DiversityCollection and DiversityTaxonNames with supportive data collections in DiversityAgents, DiversitySamplingPlots and the project management undertaken in DiversityProjects. The dynamic access to (cloud-based) web services of the DWB user community, as well as to external data resources, is realised. With that wider concept, the Bayernflora initiative follows the Linked Data approach as visualised by
Data repositories, interoperability and data policies
There are major differences in interoperability and data policies between the platforms of Pladias and the data repository of the Flora of Bavaria initiative. The data management strategy and agreed policies, as well as guidelines for data publication in the Bavarian part of the FSG project, are those of the Flora of Bavaria initiative*
One of the primary objectives of Pladias was to provide Czech botanists with a tool for validation of occurrence data and creation of the basis for the forthcoming atlas of vascular plant distribution in the Czech Republic. For this reason, it contains a system of roles, taxon-user expert associations and a number of tools that allow us to classify the validity of records or communicate during the preparation of data for map publication. These mechanisms are based on the needs of the local community and, although by their nature they correspond to international standards, they were not created a priori out of an ambition to meet the requirements of the standards, but the requirements of the community in relation to the nature of the data available.
Challenges of different taxon reference lists, dynamic taxon concepts and persistent identifiers
Although the Czech Republic and Germany are neighbouring countries and share almost the same flora and vegetation in the region of interest, the concept of critical groups differs. This is caused by the deviating regional view on the dynamics of local plant populations, on major trends in modern systematics, changing taxon concepts and the setting of different research focus, as well as nature conservation focus on single plant groups of interest.
The Bayernflora DWB management instance uses the independent reference checklists and taxonomies of
Pladias is using the reference monograph of
For comparison of the taxa from both checklists, a taxon converter with two parts was set up and curated by Štech & Diewald *
Example of many-to-many records in taxa mapping between Pladias and Bayernflora (BFL) taxa, resulting in compromise FSG taxon Aconitum plicatum.
Pladias taxa | FSG taxon | BFL taxa with BFL TaxRef ID |
Aconitum plicatum [ID_FSG = 14] | Aconitum napellus subsp. lusitanicum Rouy [ID_BFL TaxRef = 20111] | |
Aconitum napellus agg. [ID_PLADIAS = 1238] | Aconitum plicatum [ID_FSG = 14] | Aconitum napellus L. s. l. [ID_BFL TaxRef = 52] |
Aconitum plicatum [ID_PLADIAS = 1444] | Aconitum plicatum [ID_FSG = 14] | Aconitum napellus subsp. napellus [ID_BFL TaxRef = 6539] |
Aconitum plicatum [ID_FSG = 14] | Aconitum plicatum Köhler ex Rchb.[ID_BFL TaxRef = 14276] |
Challenges of different floristic status systems, validity/reliability and origin status systems and basis of record systems
Each Pladias record holds one of four reliability statuses (reliable/uncertain/erroneous/not yet revised), one origin status (native/non-native/planted/not set) and the herbarium origin status (true/false) (see
The DWB BFL occurrences data sources, as far as published as ABCD2.1 XML zip-archives (BFLportal01, BFLportal04), are per definition categorised as validated, i.e. reliable sensu Pladias. The non-validated DWB BFL data records are not published via ABCD2.1, but stored in the internal RDMS instance. Each single "in situ" occurrence record has either a floristic status category assigned by the observer and/or a processed floristic status category assigned by a later editor. The procedure is explained in the BFL Wiki page on floristic status*
The translation of the BFL floristic status system with 13 status categories to the Pladias origin status system and to the status categories "native", "introduced" and "cultivated", as defined by the TDWG pre-standard POSS (
Assignment of categories of BFL/BIB floristic "in situ" status for valid present occurrences to the Pladias taxon origin categories and POSS pre-standard categories (for BFL/BIB status definitions see,*
BFL/BIB floristic "in situ" status category | Pladias taxon origin status category | POSS native status category | POSS introduced status category | POSS cultivated status category |
indigenous (I) | native | native | not introduced | not cultivated |
"normal status" (*) | native | assumed to be native | not introduced | not cultivated |
established (E) | non-native | not native | introduced | not cultivated |
permanently established (D) | without equivalent | doubtfully native | assumed to be introduced | not cultivated |
casual (U) | without equivalent | not native | introduced | not cultivated |
tendency towards establishment (T) | non-native | not native | introduced | not cultivated |
re-introduced / naturally casual (W) | without equivalent | native | introduced | not cultivated |
cultivated (K) | planted | not native | none of the above | cultivated outdoors |
synanthropic (S) | without equivalent | not native | none of the above | no information |
deliberately introduced (A) | planted | not native | none of the above | cultivated outdoors |
culture relic (R) | planted | not native | none of the above | cultivated outdoors |
status completely unclear (?) | without equivalent | no information | no information | no information |
dubious if native (Z) | without equivalent | doubtfully native | doubtfully introduced | not cultivated |
Challenges of establishing agreed data pipelines
The aim of the integration effort was to put in place a functioning mechanism for batch sharing of cross-border data and data sources. There were several reasons why we were not striving for a complete continuous process. One is that both countries have established independent data mobilisation, integration and publication processes ensuring sustainable and quality-controlled data sources for both countries, Czech Republic and Bavaria. Another one is the dynamics of taxonomic concepts and classifications as explained above and which is different in both countries. Assessing the compatibility of currently-accepted taxon concepts is a matter of professional review which is the most important phase in the data-sharing process and cannot be done other than by discussion of experts from both countries. An agreed dynamic data pipeline as a result is also challenging because of the tens of millions of BFL and Pladias records to be handled and the required functional stability of such a network. The target result is, therefore, a functional infrastructure that allows experts to quality-check, align and map data sources, come across trilingual aspects and transform data towards the FSG target system.
We also have to emphasise the question of the primary storage of converted data. Pladias cannot store data from Bavaria as it has a deeply embedded system constraint on a fixed polygon. The DWB instance of the BFL also cannot store Czech data as it does not meet its thematic focus and licensing requirements. Furthermore, the aggregated data is bound to a compromise FSG checklist, which again prevents its smooth incorporation into one of the participating systems. It was, therefore, necessary to create a separate data-linking technical service, running at South Bohemian University, which has connectivity to both partners and handles the integration processes.
Functional infrastructure step by step
A. Data generation and integration
Step 1. Occurrences for the FSG project from the Pladias project and external resources.
Occurrence records for the FSG project are imported into Pladias by individual users in the form of a standardised MS Excel spreadsheet. In addition to the floristic record, the user also indicates the licence of the record and its assignment to the specific project, like the FSG project, within which the data were created/are published. The data are created either as a direct result of floristic research or are excerpted from herbarium sheets, publications or come through separate import processes from data providers.
The first check of occurrence status validity is done by automatic mechanisms (e.g. compliance with the specified municipality polygon). Subsequently, the assigned auditors can indicate the reliability of the record (for reliability status system, see above).
Step 2. Occurrences for the FSG project by DiversityMobile and CSV files from FSG partners and the BFL project.
The FSG partners from Bavaria routinely used a windows phone app DiversityMobile and workflow as described by
Step 3. Occurrence data storage in DWB, provision via ABCD XML and customisation for Pladias.
The BFL occurrence data in DWB are long-term curated according the DWB schema explained under "Technical concepts of the Bayernflora (BFL) instance of DWB" and "Data repositories, interoperability and data policies" (Fig.
The initial technical connection between the dynamic DTN BFL backbone and the open access publication of the two BFL data sources as ABCD2.1 XML zip-archives (BFLportal01, BFLportal04) is regarded as a first benefit of database integration and allows for repeatable independent steps for the unification of a cross-border nomenclatural and systematic view of biodiversity.
The files prepared in this way are then imported into the dedicated FSG database. For this purpose, we developed a custom XSLT template suitable for efficient conversion of a reduced set of ABCD fields into SQL "INSERT" queries, leading to one SQL file from each of the packages of BFLportal01coll*
Step 4. Spatial criteria postprocessing.
The imported data may cover an area larger than necessary. Therefore, they are intersected against two polygons: i) the outer (study area + the surrounding strip of lower altitude landscape, which is a source of potential future increase in the biodiversity of the core area) and ii) the inner (= study area, the Bohemian Forest), see Fig.
Area of Bohemian Forest. CZ = Czech Republic, DE = Germany/Bayern, AT = Austria. Blue filled polygon represents the outer polygon serving as the pool of candidate taxa for the study area. Green filled polygon's records are used for integration. Small grey part belongs to Austria and is not fully covered by current infrastructure.
B. Data transformation
Step 5. Processing of BFL taxon reference list.
Raw imported records are post-processed using the DTN REST Web service (
Amongst others, the following DTN REST Web service end points are used for processing taxon names:
From this point on, we are able to link DWB and Pladias taxa via a two-part taxon converter*
Step 6. Postprocessing of floristic and origin status and other record metadata.
The different approaches for describing nature conservation information, defining basis of record categories and floristic status are apparent after the records are linked to the agreed taxa. Both database systems have well-documented concepts to dynamically assign floristic status, occurrence validity, origin status (native/non-native/planted/unknown) or herbarium origin status and basis of record categories, respectively. We developed several mappings of categories for a more or less wide interoperability of plant occurrence and status categories (see Table
C. Data publishing
Step 7 Data portal with Flora of the Bohemian Forest (FSG) taxon distribution maps.
Joined taxon distribution is presented on the FSG web portal in the form of dynamic maps composed by the OpenLayers library from a Web Map Service (WMS) provided by a local instance of Geoserver (for examples, see Fig. 2 and Fig. 3). Data are generalised on the level of CEBA quadrants (= TK25/4) and labelled by the highest reached reliablity status in each quadrant.
Both partner databases are specialised data repositories hosted by large institutions. They have fixed validation processes and data publishing pipelines, just as the FSG pipeline has a fixed update frequency. In order to be able to guarantee the timeliness of the presented maps, regardless of the dynamics of all involved components, the data portal includes an additional function for manually entering the quadrant-taxon reliability status not based on current data.
The audience for the flora of the Bohemian Forest portal*
The implementation of the described infrastructure brought the cross-border local communities and the regional experiences together and supports the intercultural exchange with emphasis. At the same time, the new technical services rely on long-term exchange of digital information along established data pipelines in the Czech Republic and Bavaria. In the future, it would be very useful to expand the network and include the Austrian finds in the database for complete coverage of the study area.
Funding
Since 1990, the European Union is supporting territorial cooperations between regions and cities through the programme "Interreg" of the European Regional Development Fund (ERDF). The ERDF focuses its investments on projects on infrastructure, cooperation between public utilities, collaborative actions of companies or activities in the field of environment protection, education, land-use planning and culture. The fifth funding period of the programme, i.e. ETC "European Territorial Cooperation" INTERREG V, has been set up from 2014 to 2020.
The project "Flora of the Bohemian Forest (Flora Silvae Gabretae, FSG)" is an action to establish a cross-border cooperation between the Free State of Bavaria and the Czech Republic (Ziel ETZ, INTERREG A; EU-project number 216). The applicable project period is from 01-01-2019 to 30-06-2022. For Bavaria, the action is managed by the Government of Lower Bavaria. The focus on nature conservation, protection of the environment and planning of resource efficiency.
The financial support, called “Ziel ETZ”, is approx. 1 million € (in total). There is a co-financing of 15% (from the partner country/region, public support).
Additional funding is provided by the Bavarian Environment Agency (LfU) and the German Research Foundation (DFG) with GFBio and NFDI4Biodiversity, a consortium of the German National Research Data Infrastructure (NFDI).
We thank the SNSB data team outside of the FSG project, especially Wolfgang Reichert, Tanja Weibulat, Dr. Markus Weiss and Dr. Julia Wellsow for support. We are grateful to the Pladias team outside of the FSG project, especially Zdeněk Kaplan, Jan Wild and Milan Chytrý.