Corresponding author: Petr Novotný (
Academic editor: Quentin Groom
Digitising and aggregating local floristic data is a critical step in the study of biodiversity. The integrative web-based platform Pladias, designed to cover a wide range of data on vascular plants, was recently developed in the Czech Republic. The combination of occurrence data with species characteristics opens many opportunities for data analysis and synthesis.
This article describes the relational structure of the Pladias database service (PladiasDB) and the context of the platform architecture. The structure is relatively complex, as our goal was to cover: (i) species occurrence records, including their management, validation and export of revised species distribution maps, (ii) data on species characteristics with quality control tools using defined data types and (iii) separate user interfaces (UI) for professionals and the general public. We discuss the approaches chosen to model individual elements in PladiasDB and summarise the experience gained during the first five years of operation of the Pladias platform.
A new botanical database platform called Pladias has recently been established in the Czech Republic. It aggregates data on spontaneously occurring vascular plants, including species occurrence records from a dozen institutional data providers and hundreds of scientists and hobby botanists (
The Czech Republic has a long history of botanical research (
The creation of biological databases is a relatively common component of scientific projects, but it is often associated with subsequent problems. More than half of core biological databases face financial difficulties (
In designing the Pladias database, we considered the following key elements:
the interoperability of species occurrence records, data on species characteristics and vegetation data; the ability to work dynamically with a hierarchical system of taxa to transfer values (e.g. if a species has only one subspecies in the Czech Republic, but data are collected for only one of these two hierarchical ranks); a unified, centrally managed taxon checklist allowing direct use of data for analysis, without requiring conversions between taxon concepts; optimisation for plant data at the expense of extensibility to other taxa; preservation of the history of changes in database records; consistent use of open-source technologies to ensure long-term financial sustainability.
Some of these design requirements differ from the principles accepted in the Biodiversity Information Standards data model (
The aim of this article is to describe the internal structure of PladiasDB and to identify and discuss specific solutions to problems commonly encountered in biodiversity databases. We focus on the PladiasDB data schema, which is an expression of the general concept of the database and can serve as inspiration for further work.
Database of botanical biodiversity at the national scale (Czech Republic).
The Pladias platform consists of several interconnected services designed as a web application. The public interface is based on the Nette framework (PHP) and is available at
More than 80 tables of PladiasDB are divided into five table schemas:
The full SQL schema and the ER diagram showing the relationships amongst described tables are published online in the Git repository at
Schema
The remaining tables relate to record management and preparation of final versions of distribution maps for professional publications (e.g.
Spatial data, except for the location of species occurrence records, are stored immutably in the
The administrative division of the country is a hierarchical list, similar to other lists in PladiasDB, for example, the lists of taxa and syntaxa (see Schema 5
it features high reading speed and is very good at handling aggregation within subtrees, so that all operations in the hierarchy of core lists can be performed on the fly; it is able to easily query the hierarchy when mining data at the level of handwritten SQL queries; it is also understandable for non-technical users.
The third point is very important. Exports of tables, such as taxon checklists, include the key columns (
Besides the occurrence data, the second essential part of the database is the module containing plant characteristics (traits, environmental associations and other information). These data are stored in the
To fill in missing values, we use the taxonomic tree (see
original = data imported by users inherited = data from hierarchically superior taxa (e.g. species data transferred to a subspecies); this is only used if the conditions defined by the data type, inheritance type and taxon hierarchy constellation are met aggregated = similar to inherited, but originating from hierarchically subordinate taxa composite = a combination of all the above points
Data reproducibility is important for practical scientific use of stored data, but in shared databases, data on plant characteristics are constantly being corrected and added. Several times we used the low-level approach of creating a snapshot of the entire database and storing it as a reference state for a particular analysis, but we have reached the limits of this approach. The snapshot of the database contains complete data, including users' contacts, their activities and other data that would need to be deleted before providing them for peer-reviews. Not only the stored data, but also partial changes in the database structure evolve, complicating automated processes in creating the snapshots described above. For data on plant characteristics, we therefore created the table
The
At the moment, all functions are dedicated to processing the
The default PostgreSQL schema is used to store the core lists of the entire platform – data tables for user accounts, checklists of taxa (
Taxon and syntaxon checklists are mandatory and no alternative concepts are allowed. This standardisation allows for efficient data management, which is particularly necessary because of the relational link to plant characteristics data at the database layer. To solve the issue of occasional changes in taxon names and delimitations, we adopted a synonym-derived solution. We arrange names representing different concepts in the table
Some of the data stored in PladiasDB are original data, but most come from partially-aggregated sources (see
Homepage:
Platform: PostgreSQL 12 + PostGIS 2.5
Programming language: SQL
Operational system: Linux
Type: Git
Location:
Other
The SQL structure of PladiasDB is licensed under
MIT Licence
Copyright (c) 2022 Pladias www.pladias.cz
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to dealing with the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense and/or sell copies of the Software and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Tables that model basic objects are named with the plural form of the entity name. Tables modelling M:N (many-to-many) relationships use the naming convention "table1_table2". The SourceSQL code, the ER diagram and additional information are published in the Git repository.
The institutions forming the Pladias consortium include the Department of Botany and Zoology, Faculty of Science, Masaryk University; Institute of Botany of the Czech Academy of Sciences; Department of Botany, Faculty of Science, University of South Bohemia. The development of the Pladias database was funded by the Centre of Excellence PLADIAS, project no. 14-36079G from the Czech Science Foundation. JB, JW, VK and ZK were also supported by the long-term research development project no. RVO 67985939 from the Czech Academy of Sciences.
Screenshot of a taxon overview on the public portal pladias.cz.
Screenshot of the online application with restricted access for researchers to work with the data at pladias.ibot.cas.cz.
Pladias platform infrastructure. PladiasDB is stored on two servers with streaming replication to optimise server load. Data are edited through the pladias.ibot.cas.cz web application. Low-level read-only access is available to researchers via a direct connection to a secondary database server. Logos are the property of their respective owners. Sources:
Phytogeographical districts of the Czech Republic and a buffer zone used for handling the records with coordinates falling slightly outside the national border. The districts at the national border are shown in colour.
Nested set hierarchy model.
For reading efficiency, a redundant attribute
Database tables stored in the
|
|
|
|
authors | persons who recorded plant occurrence in the field | 4 | 14,283 |
batch | metadata of the occurrence data import batch | 6 | 18,503 |
comments | users' comments on species occurrence records | 11 | 25,647 |
csv_map_details | additional information used for map rendering in the map publishing workflow | 6 | 941 |
excel | Excel files containing original species occurrence records (validated/imported) | 9 | 17,452 |
herbariums | list of excerpted herbaria | 13 | 267 |
institutions_users | administrator of a cooperating institution | 2 | 0 |
pdf_map | PDF files containing distribution maps generated for use in printed publications | 6 | 2,651 |
projects | set of batches/species occurrence records sharing the same source of funding or licensing conditions | 6 | 15 |
projects_users | users allowed to import within a project | 3 | 230 |
record_originality_status | list of states for species occurrence records' originality status | 4 | 4 |
record_validation_status | list of states for species occurrence records' validity status | 4 | 4 |
records | species occurrence records | 40 | 13,635,402 |
records_authors | M:N link table | 3 | 13,333,320 |
records_herbariums | M:N link table | 2 | 506,159 |
records_history | log of any editing of species occurrence records after importing into the application | 9 | 5,051,082 |
records_quadrants | M:N link table; |
2 | 13,543,850 |
records_squares | M:N link table; |
2 | 38,604 |
taxon_mapsettings | settings and progress in the map publishing workflow | 16 | 5,673 |
taxon_mapsettings_publication | list of available states of progress in the map publication | 2 | 5 |
taxon_mapsettings_revision | list of available states of progress in the map revision | 2 | 7 |
taxons_users | M:N link table; |
3 | 1,413 |
users_comments | users being not asssigned as a map administrator can only comment on species occurrence records and propose changes for the revisers | 2 | 4,925 |
Database tables stored in the
|
|
|
|
districts | administrative division of the Czech Republic | 12 | 41,916 |
districts_depth | list of available states for administrative district hierarchy level | 3 | 8 |
phytochorions | phytogeographical districts of the Czech Republic | 7 | 215 |
phytochorions_outside_cz | approximation of phytogeographical division outside the country borders | 4 | 89 |
quadrants_full | grid of mapping quadrants | 10 | 40,000 |
regions | polygons of regions for specific projects | 6 | 2 |
squares_full | grid of basic mapping fields ("squares") | 4 | 10,000 |
Database tables stored in the
|
|
|
|
data_boolean | data on plant characteristics with Boolean data type | 4 | 59,957 |
data_comment | pseudo-values for data on plant characteristics allowing comments on all data types | 4 | 0 |
data_enum | data on plant characteristics with nominal or ordinal data types | 8 | 10,296,633 |
data_enum_syntaxons | data on plant characteristics with specific data type that contains linking to the |
7 | 37,688 |
data_integer | data on plant characteristics with numeric data types | 6 | 196,510 |
data_interval_avg | extension of the previous numeric data type used for storing a broader set of values | 9 | 165,804 |
data_month | data on plant characteristics with month data type | 6 | 9,603 |
data_occurrence_frequency | data on species characteristics generated based on species occurrence records from the |
6 | 27,792 |
data_percentage | data on plant characteristics with percentage data type | 4 | 48,188 |
data_real | data on plant characteristics with decimal number values | 4 | 53,514 |
data_real_multi | data on plant characteristics with decimal number values and multiplicit values per taxon | 5 | 287,866 |
data_taxon_taxon_real | data on plant characteristics with data type storing numeric (real) value for a set of two taxa | 5 | 129,476 |
data_unmeasurable | pseudo-values for data on plant characteristics allowing to mark values that are not measurable in the given context (for example, flower colour for ferns) | 3 | 2,820 |
data_year | data on plant characteristics with year datatype | 7 | 1,996 |
datatypes | list of implemented data types for data on plant characteristics | 13 | 14 |
enumerates | metadata for nominal or ordinal lists of available values | 6 | 104 |
enumerates_values | list of available values for nominal or ordinal data types | 9 | 1,044 |
features | metadata for plant characteristics | 25 | 291 |
inheritances | list of implemented inheritances, i.e. mechanisms for transferring values across a taxonomic tree | 4 | 11 |
sections | hierarchical structure of features | 10 | 38 |
trait_export_snapshots | storage for backups of data on plant characteristics used for reproducibility of analysis, flattened into a 2D structure and Excel file format | 5 | 11 |
trait_visibility_status | list of available states for availability of data on plant characteristics in various export/publishing services of the Pladias platform | 3 | 3 |
traits | metadata of specific series of data on plant characteristics | 14 | 400 |
units | list of available units of measurement | 6 | 16 |
PL/pgSQL functions in the
|
|
descendant_taxon() | provides the entire subtree of the taxon, including itself |
get_parents_if_singleton() | recursive function that returns a continuous series of parent taxa that are monotypic |
get_taxon_cloud() | aggregates descendant_taxon() and get_parents_if_singleton() function results; used for rendering maps of taxa aggregating records with different level of identification accuracy |
mptt_syntaxons_appendchild() | add new syntaxon |
mptt_taxons_appendchild() | add new taxon |
mptt_taxons_delete_leaf() | delete taxon with no subordinate taxa |
mptt_taxons_delete_subtree() | delete taxon and its subordinate taxa |
mptt_taxons_get_depth() | numeric approach to reach |
mptt_taxons_get_error_code() | help for error messaging when using PL/pgSQL functions |
mptt_taxons_move_subtree_before() | change the order of taxa belonging to a specific node. This function is used when changing the order of species listing inside one genus or other taxon; the parent (genus) remains the same, but the tree must be recalculated to change the order of species |
mptt_taxons_move_subtree_real() | move a taxon subtree within a specific node; this function allows rebuilding the taxon tree by moving a taxon and all its subtaxa to a new parent (hierarchically superior taxon) |
mptt_taxons_repair_depth() | recalculation of taxon nodes' |
Database tables stored in the
|
|
|
|
downloads | static data provided in the web application | 11 | 3 |
institutions | institutions providing the plant occurrence data | 4 | 11 |
licenses | list of available licences for species occurrence records | 3 | 6 |
publications | essential recent overview publications on the Czech flora | 10 | 5 |
syntaxon_ranks | list of syntaxon hierarchy levels | 5 | 5 |
syntaxons | core hierarchical list of syntaxa | 33 | 674 |
taxon_ranks | list of taxon hierarchy levels | 10 | 58 |
taxons | core hierarchical list of taxa | 18 | 6,948 |
taxons_synonyms | taxon synonyms and invalidated taxon concept crosswalks mapped to the |
7 | 18,485 |
user_activities | list of logged activities | 2 | 21 |
user_activity_log | logging users' activity storage | 7 | 4,170,445 |
user_settings | users' individual settings for web application | 3 | 7,541 |
users | web application users | 15 | 232 |