Biodiversity Data Journal :
Research Article
|
Corresponding author: Irene Martorelli (i.martorelli@liacs.leidenuniv.nl)
Academic editor: Renan Barbosa
Received: 26 Jan 2024 | Accepted: 06 Jun 2024 | Published: 18 Jun 2024
© 2024 Irene Martorelli, Aram Pooryousefi, Haike van Thiel, Floris Sicking, Guus Ramackers, Vincent Merckx, Fons Verbeek
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Martorelli I, Pooryousefi A, van Thiel H, Sicking F, Ramackers G, Merckx V, Verbeek F (2024) Multiple graphical views for automatically generating SQL for the MycoDiversity DB; making fungal biodiversity studies accessible. Biodiversity Data Journal 12: e119660. https://doi.org/10.3897/BDJ.12.e119660
|
|
Fungi is a highly diverse group of eukaryotic organisms that live under an extremely wide range of environmental conditions. Nowadays, there is a fundamental focus on observing how biodiversity varies on different spatial scales, in addition to understanding the environmental factors which drive fungal biodiversity. Metabarcoding is a high-throughput DNA sequencing technology that has positively contributed to observing fungal communities in environments. While the DNA sequencing data generated from metabarcoding studies are available in public archives, this valuable data resource is not directly usable for fungal biodiversity investigation. Additionally, due to its fragmented storage and distributed nature, it is not immediately accessible through a single user interface. We developed the MycoDiversity DataBase User Interface (https://mycodiversity.liacs.nl) to provide direct access and retrieval of fungal data that was previously inaccessible in the public domain. The user interface provides multiple graphical views of the data components used to reveal fungal biodiversity. These components include reliable geo-location terms, the reference taxonomic scientific names associated with fungal species and the standard features describing the environment where they occur. Direct observation of the public DNA sequencing data in association with fungi is accessible through SQL search queries created by interactively manipulating topological maps and dynamic hierarchical tree views. The search results are presented in configurable data table views that can be downloaded for further use. With the MycoDiversity DataBase User Interface, we make fungal biodiversity data accessible, assisting researchers and other stakeholders in using metabarcoding studies for assessing fungal biodiversity.
fungal distribution, fungal biodiversity, biogeography, environmental DNA, mycodiversity, geospatial maps, information visualisation, dynamic hierarchical data, accessibility, reusability, database, FAIR data, controlled vocabulary terms
Most of the spatial diversity and distribution of fungi is poorly known (
The MycoDiversity Database User Interface (MDDB-UI) (https://mycodiversity.liacs.nl) we present here exhibits that the data provided by other researchers in the public repositories, namely in the scientific literature (
The important feature of the MDDB-UI Biodiversity Search tool is the multidimensional structure of these data elements, presented to users in the form of dynamic hierarchical tree views. This method guarantees the direct observation of fungal biodiversity data by using various taxonomic levels. Furthermore, it grants the exploration of fungal distribution data along different spatial terrestrial scales. Moreover, the environment element enables the exploration of how species richness varies under different intervals of conditions, such as acidic ranges in soil (
The MDDB-UI provides to users the element values for predefined SQL queries. The query results are presented in topological maps and configurable table views, retrievable for further fungal biodiversity investigations and for supporting the evolutionary and ecological research on fungi.
Measuring biodiversity over large spatial areas, varying from regions to countries, or a global scale, usually involves combining collections of species observations obtained from environmental space (
Currently, there is a meaningful focus on sharing data (
Unifying collections of species observations has a long history. The digitisation era, in which we are currently living, permits direct access to large-scale species observations. These recorded observations are based on the effort of integrating previous descriptions of human observations, particularly on specimen samples. The Global Biodiversity Information Facility (GBIF) (
Another crucial aspect to consider is that data repositories, in general, evolve to adjust to the data they host. Occasionally, these repositories need to be redesigned in order to incorporate different and novel data sources. In line with making data open (
Regarding fungal biodiversity accessibility obtained from HTS source, considerable work is provided by the GlobalFungi database (
The main areas of design criteria for the MDDB-UI system are the conceptual data design, the user interactive design and the technical design. These principles define the mapping of the MDDB back-end to the front-end interactive tools as illustrated in Fig.
The main conceptual data design criteria of MDDB-UI are the availability, dimensionality and connectivity of the fungal biodiversity components.
The elements of the Biodiversity Search tool displayed to the user are the Location, Taxonomy and Environment. These represent the key components for the user to interact and initiate the search (Fig.
Illustrations of the available elements used for unlocking DNA barcode data. (a) Location, Taxonomy and Environment are the key elements of the Biodiversity Search tool made available to the user for initiating the search in MDDB; (b) The interaction with one of the elements unlocks and reveals the DNA barcode data element.
The taxonomy element utilises the scientific standard names to obtain the DNA barcode information. The interaction with the location element enables observation of where the DNA barcode is collected and the environment is used to inspect under which environmental conditions the DNA barcode has been detected.
Due to the complex nature of the data and their multiple dimensions, it is essential that the system supports various data views. The Taxonomy, Location, Environment and the uncovered DNA barcode components are containers of structured information (Fig.
The Taxonomy (Fig.
The Location compartment (Fig.
The DNA Barcode (Fig.
The latter component (Fig.
The relationships amongst the main block containers form journey paths in which terms of the various components and of different levels connect (Fig.
For simplicity, only one term for each tree-map level of the main containers is revealed. Path 1 (the dark blue label) exhibits the choice of focusing on a specific taxonomic rank (term genus Lactarius) observed at the sub-continental level (term Northern Europe) in a temperate grassland biome context. The connection to the ITS barcode of the DNA barcode container is by default, but the annotation for the type of ITS (i.e. barcode ITS2 or ITS1) is provided. For path 3 (pink label), a more specific selection emerges; the DNA barcode sequence variants of ITS2 belonging to Lactarius that have been observed in the samples collected in Sweden, independently of the contextual aspect, are collected. Conversely, path 2 (cyan label) reveals a more generic search of interest, which considers the higher level of taxonomy (Basidiomycota phylum) that has been observed on a broader scale, the continental level (term Europe). This data selection can be sequentially refined, as shown by path 4 (orange), where only the DNA sequence variants belonging to the Russulales order level and detected in countries belonging to northern Europe are selected. Path 5 (the green label) suggests a simple yet a novel, powerful approach for exploring fungal biodiversity data when the Taxonomy component is not initiated. This case demonstrates how to directly observe a community at a small-scale level of diversity (e.g. the selection of a single specific sample), for which all sequence variants can be provided, regardless of their assigned taxonomic reference name.
The web-based MDDB front-end application for the Biodiversity Search tool is developed using the criteria for user-centred design (
The interface uses the graphical basics and widgets provided by modern browsers and APIs. In graphical user interfaces, the elements such as icons and pictograms are included that have a clear perceived affordance (
The relation of the usability aspects (
Usability attributes used for the User Interaction design of the Biodiversity search tool.
Usability attribute | Biodiversity search tool UI design choices |
---|---|
Learnability First impression with a system and learning how to use a system as quickly and as easily as possible. |
The search tool window in Fig. The output results window offers table and map views and displays icons that are universally recognisable. These choices emphasise the “learn-by-doing” concept, where users can effortlessly grasp the fungal biodiversity navigation. |
Efficiency How efficiently and smoothly the user performs a task. |
Efficient interaction with the data is achieved through exploration of the dynamic tree. Terms of interest are suggested in a free search box (Fig. Experts using standard terms for their predefined queries find this method efficient for locating terms and ensuring consistent results. This aspect aligns with leading to the interoperability principle of FAIR. |
Effectiveness How a user can complete a task with a high degree of accuracy. |
Error reduction is the design criterion applied to aid in effectively controlling the construction of predefined queries. The implementation choice is controlled by the autocomplete function of the free text search box and the use of controlled terms that belong to dictionaries of reference names. This solution ensures precision and accuracy in data retrieval. |
Satisfaction How a system influences user motivation and effectiveness of use. |
For the minimalistic and functional design, we chose blue and white colours because they are aesthetically pleasing. These colours help segment the screen quickly, supporting visual search and focusing attention to the main elements. Likewise, the choice of colours enhances satisfaction by aiding users in identifying different functions. We have also used the “TAB” view to increase visibility and simplify the views. |
In this section, we describe the technical design criteria for shaping the MDDB-UI (https://mycodiversity.liacs.nl). The criteria cover the accessibility of the data, the visual presentation of results, data retrieval and the performance of our system.
The MDDB-UI is a browser front-end that provides free open access to information stored in MDDB. The home main page (Fig.
On the MENU option of the home page (Fig.
The data access to MDDB is granted in an intuitive manner, considering both the nature of the data and the diversity of the users. The usability attributes (Table
Navigation of items of the Location and Taxonomy compartments. a Display of a tree view selection. One subclass (i.e. subcontinent) of the continent 'Africa' is selected. b The Taxonomy autocomplete search. At least three characters are needed to initiate the filter and suggestions containing the characters are displayed.
A value is selected when the user clicks on an item and it is displayed as a checkbox (Fig.
Query construction using the search tool. (a) The Location subclasses of the continent 'Africa' are displayed. The country and the sample site, shown as checked selected terms, are the items used for the query; (b) The Taxonomy autocomplete selection: illustration of focusing on a specific taxa group. The autocomplete selection is useful for directly accessing the lower levels of the taxonomic tree.
The Environment selection provides environmental terms used to describe habitats and ranges of numerical values. The selected items (Fig.
The search implementations are embedded in iframes and use PHP (version 7.4.0) (
The query results appear in the dynamic output window on Table Tab and the Map Tab (Fig.
The table view is the common method for displaying a dataset output based on a query. Each row of the results shown on the table corresponds to a unique record of a DNA sequence variant hit (i.e. ZOTU) detected from a selection of the filtering options. This hit represents the occurrence and its attributes display the annotation associated with the other components (Fig.
Most attributes in the table appear as hyperlink values, which redirect to URLs, such as the corresponding mappings to the reference taxonomic name or to the provenance sample it originates from (e.g. NCBI BioSample record). These are reliable standards and data sources that follow the FAIR findability data principle (
The map view is an associated function of the table view and initiates when the user clicks the Map Tab button (Fig.
All samples stored in the MDDB include a geographical location specified as a decimal numerical value (GPS) (
This latter information is perceived on the interactive map, where the marker icon embedded over each sample indicates the amount of DNA barcode sequences, represented as ZOTU numbers, observed in the original sample plot. The total numbers of plots and ZOTUs shown on the map reflect the aggregated results of the occurrence hits displayed on the table view (Fig.
The data selected by the Biodiversity Search tool can be downloaded by selecting the Download Tab (Fig.
The Download tab provides two options for obtaining the selected data: either the occurrence metadata or the set of distinct ZOTUs sequences detected by the selection. The total occurrence data, indicated by the number of occurrences records displayed in Fig.
The scalability requirement takes into account the nature and the volume of the data. The data comprise large datasets of DNA sequence barcodes and annotations from literature and metabarcoding studies. Extensive queries (
We present a comprehensive scenario that demonstrates a potential use case for interacting with the Biodiversity search tool. This interaction enables access to and acquisition of informative data from the MDDB for a specific fungal biodiversity investigation. The learnability and efficiency principles covered in the Usability subsection of User Interface Design, are useful for extending the utility of the interactive tool for further fungal biodiversity inspections.
Bananas are the world’s fourth most important food crop after rice, wheat and maize and are grown in more than 130 countries (
Here, we illustrate how the Biodiversity search tool can be used to collect data from other studies to enhance the investigation of environmental factors that may influence the occurrence and diversity of Fusarium species. The Location and Taxonomy components are initiated for this use case. For the Location, the tree view is preferably used, as this permits visualisation of the available subclasses of Asia, defined as sub-continents. These include two regions: southern Asia and south-eastern Asia. For the Taxonomy selection, the user focuses on a specific group, with the taxa known. Thus, the autocomplete search box is used to directly obtain the term Fusarium (Fig.
The query result (displayed in Appendix) of this use case returns 71 ZOTU records that correspond to a distinct set of 53 Fusarium sequence variants hits detected for the two regions of Asia. The interactive map displayed on the Map Tab option displays the number of sequence variants for each sample (Fig.
Example of occurrence data display using the map view and data table view simultaneously. a. Map view: The markup of plot MDDBSRS000485 (SRA Sample ID SRS651474) displaying the number of ZOTUs observed; b. Table View: The six records belonging to the Fusarium genus associated with the SRA Sample SRS651474.
The purpose of the use case is specifically illustrative, displaying the usability and utility of the tool. The scenario illustrates that it can provide informative below-ground specific regional information, such as ZOTU diversity for a targetted fungal group and an important characteristic of the habitat, acidity, which can be measured. This valuable information should be included to predict Fusarium richness for further sampling and to build upon ecological assessments that associate the above-ground layer. This helps prevent the invasion of the disease on future crop plots.
The Biodiversity Search tool makes an important contribution to scientific research in agriculture. It allows for the observation of fungal communities where Fusarium has been detected, based on reliable existing published data, as described in the design criteria in the Connectivity of the Components subsection of the Conceptual Data Design. Studies analysing Fusarium soils have identified other fungal genera in these soils that play a key role in suppressing Fusarium (
The scope of this work is to illustrate the importance of the MDDB-UI in assisting research on fungal biodiversity. The most important contributions described in this manuscript are the accessibility, dimensionality of the system and the reliability of the data. These aspects have defined the design and implementation of supportive tools that enhance the use of informative data, which is undisclosed in public metabarcoding studies.
The main components of the Biodiversity Search tool provide access to metabarcoding data from published studies. The dimensionality is a unique feature of the MDDB system. Compared to other related systems, as presented in the "Biodiversity Repositories - Related Work" section, it permits the estimation of fungal biodiversity on different levels. The tool allows direct exploration and selection of data from a broader scale (e.g. major phylum group, subcontinental level) to a more targetted taxonomic level (e.g. genus level) and to specific small geographical areas of interest (e.g. one plot sample). The incorporation of samples in the Location compartment allows observation at a local community level. For this case, initiating the Taxonomy is not necessary, which is a powerful approach to access the unique list of unclassified sequence data. It enables direct observation of unique ZOTUs diversity in individual site plots. This novel way of accessing data can be useful for building datasets applied for comparative analysis on community levels and investigating ecological patterns of specific fungal associations. Access to environmental measurements allows observation of DNA sequence occurrences and ZOTU richness amongst different ranges (e.g. elevation), independently of the geographical names of the location where samples are collected. Our primary goal is to permit users to select measurements, such as areas of spatial ranges and observe biodiversity patterns, based on these data selections. The values of the aggregation function used in the queries for associating the different levels of the components and for the connection amongst the different components provides various graphical data views. As observed in the "Presentation of Results: Map View" section, the multiple-layered visualisation shown on the map view displays both the occurrence and the richness of the DNA sequence variants associated with Fungi. Instead of using double-coloured ranges, an extension of this implementation is to use size or height of the plot to represent the measurement quantity.
The multidimensionality of the components of the Biodiversity Tool is reliable as it originates from standard classified sources: currently from the Unite database used for taxonomy, Geonames classification for location and the controlled ontological terms for habitat. The values of these reliable references, incorporated into reference study sources, are provided in the MDDB-UI in the Table View via their persistent identifiers and URLs. Reliability is also achieved through provenance provided by the redirection of the metabarcoding study sources' records. This data provenance, obtained from third parties and metabarcoding studies, is not observed in the tools provided in the "Biodiversity Repositories - Related Work" section. Nevertheless, it is a priority to continue following the principles that lead to data FAIRness (
Additionally, the Literature Search Tool (
Lastly, to increase the meaningfulness of data and of its views, there is a need to include more abiotic factors constituting the contextual aspect of the habitat. These measurements, in their proper standard annotated form and with relevant third-parties incorporation (e.g. bioclimate factors), will extend the use of the relevant factors for investigating the drivers of biodiversity patterns.
The MDDB interface is an open system for incorporating search tools and graphical data views to interpret hidden biodiversity in public metabarcoding studies. The biodiversity search tool presented in this work provides the dimensions of Taxonomy, Location and Environment for accessing public metabarcoding data. This will expand the interpretation of fungal biodiversity assessments across topographical regions and habitats, providing valuable insights. Additionally, it permits the direct observation of fungal biodiversity at different taxonomic and geographical levels and the exploration of fungal distribution on a visual map.
The use case presented in this work provides an example of obtaining meaningful data for biodiversity research to assist social communities in making data-driven sustainability decisions. Our system is based on controlled, standardised terms which, in combination with automatic cleaning tools, guarantees the quality of the biodiversity data.
One area of future research is to enhance the user interface by developing semantic parsers, based on Large Language Models (LLM) to translate natural language into executable SQL queries. Additionally, we will extend the collection of graphical data views supported by the user interface. Finally, we will study the requirements of biodiversity communities outside the fungal domain to enable a wider application of the MDDB system.
Query for generating the results displayed in the Table and Map view of the MDDB-UI. In this case, the components initiated are the geo-location and the taxonomy (Table
Query output view containing the informative data distributed in the table view and map view. The table displays the aggregation of ZOTUs (column attribute ZOTU) for each sample (column attribute MDDB_Plot), based on the user's selection.
MDDB_Plot | pH | SRA_Sample | Subcontinent | Genus | ZOTU | Unite |
MDDBSRS000484 | 5.78 | SRS651472 | Southeast Asia | Fusarium | 11 | 5 |
MDDBSRS000334 | 5.39 | SRS651487 | Southern Asia | Fusarium | 8 | 6 |
MDDBSRS000328 | 5.46 | SRS651488 | Southern Asia | Fusarium | 6 | 5 |
MDDBSRS000332 | 4.79 | SRS651279 | Southern Asia | Fusarium | 6 | 5 |
MDDBSRS000341 | 6.76 | SRS651332 | Southern Asia | Fusarium | 6 | 4 |
MDDBSRS000485 | 6.58 | SRS651474 | Southeast Asia | Fusarium | 6 | 4 |
MDDBSRS000329 | 5.25 | SRS651277 | Southern Asia | Fusarium | 4 | 4 |
MDDBSRS000333 | 4.74 | SRS651473 | Southern Asia | Fusarium | 3 | 3 |
MDDBSRS000498 | 5.8 | SRS651469 | Southern Asia | Fusarium | 3 | 2 |
MDDBSRS000322 | 5 | SRS651275 | Southern Asia | Fusarium | 2 | 2 |
MDDBSRS000327 | 4.91 | SRS651278 | Southern Asia | Fusarium | 2 | 2 |
MDDBSRS000331 | 5.07 | SRS651276 | Southern Asia | Fusarium | 2 | 2 |
MDDBSRS000338 | 7.08 | SRS651338 | Southern Asia | Fusarium | 2 | 2 |
MDDBSRS000378 | 3.47 | SRS651476 | Southeast Asia | Fusarium | 2 | 2 |
MDDBSRS000379 | 3.69 | SRS651249 | Southeast Asia | Fusarium | 2 | 2 |
MDDBSRS000326 | 5.64 | SRS651485 | Southern Asia | Fusarium | 1 | 1 |
MDDBSRS000330 | 5.15 | SRS651274 | Southern Asia | Fusarium | 1 | 1 |
MDDBSRS000335 | 3.38 | SRS651505 | Southern Asia | Fusarium | 1 | 1 |
MDDBSRS000337 | 6.49 | SRS651331 | Southern Asia | Fusarium | 1 | 1 |
MDDBSRS000374 | 2.6 | SRS651247 | Southeast Asia | Fusarium | 1 | 1 |
MDDBSRS000377 | 3.5 | SRS651475 | Southeast Asia | Fusarium | 1 | 1 |
Count | 21 | 71 |
SELECT DISTINCT S.sample_pk AS Plot, S.sra_sample, S.country_parent AS SubContinent, S.country_geoname_pref_en AS Country, S.pH, S.elevation, RTDB.genus_name AS ZOTU_Genus, ABS(S.sample_lat_dec) AS Lat, ABS(S.lat_corrected_value_i) AS LatCor, count(C.refsequence_pk) as ZOTURep, count(distinct RTDB.sh_unite_id) as UniteRepFROM Sample as S, Contain as C, RefSequence as RS, AssignTaxa as AT, RefTaxonomicDB as RTDB WHERE RTDB.kingdom_name LIKE 'Fungi\%' AND RTDB.genus_name LIKE 'Fusarium%' AND RTDB.refsequence_taxonomic_pk = AT.refsequence_taxonomic_pk AND AT.refsequence_pk = RS.refsequence_pk AND RS.refsequence_pk = C.refsequence_pk AND C.sample_pk = S.sample_pk AND S.country_parent IN ('Southern Asia', 'Southeast Asia') GROUP BY S.sample_pk, S.country_geoname_pref_en, RTDB.genus_name, S.sample_lat_dec, S.lat_corrected_value_i ORDER BY ZOTURep DESC, UniteRep DESC
We would like to thank Jelle Sinnige and Tim van Polen for their great contribution in supporting the spatial data measurement curation for the integration methods applied to the decimal coordinate values. In addition, we would like to acknowledge the BiCIKL project (Grant No 101007492).