Biodiversity Data Journal :
Methods
|
Corresponding author: Raïssa Meyer (raissa.meyer@awi.de)
Academic editor: Lyubomir Penev
Received: 08 Sep 2023 | Accepted: 02 Oct 2023 | Published: 03 Oct 2023
© 2023 Raïssa Meyer, Ward Appeltans, William Duncan, Mariya Dimitrova, Yi-Ming Gan, Thomas Stjernegaard Jeppesen, Christopher Mungall, Deborah Paul, Pieter Provoost, Tim Robertson, Lynn Schriml, Saara Suominen, Ramona Walls, Maxime Sweetlove, Visotheary Ung, Anton Van de Putte, Elycia Wallis, John Wieczorek, Pier Buttigieg
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Meyer R, Appeltans W, Duncan WD, Dimitrova M, Gan Y-M, Stjernegaard Jeppesen T, Mungall C, Paul DL, Provoost P, Robertson T, Schriml L, Suominen S, Walls R, Sweetlove M, Ung V, Van de Putte A, Wallis E, Wieczorek J, Buttigieg PL (2023) Aligning Standards Communities for Omics Biodiversity Data: Sustainable Darwin Core-MIxS Interoperability. Biodiversity Data Journal 11: e112420. https://doi.org/10.3897/BDJ.11.e112420
|
|
The standardization of data, encompassing both primary and contextual information (metadata), plays a pivotal role in facilitating data (re-)use, integration, and knowledge generation. However, the biodiversity and omics communities, converging on omics biodiversity data, have historically developed and adopted their own distinct standards, hindering effective (meta)data integration and collaboration.
In response to this challenge, the Task Group (TG) for Sustainable DwC-MIxS Interoperability was established. Convening experts from the Biodiversity Information Standards (TDWG) and the Genomic Standards Consortium (GSC) alongside external stakeholders, the TG aimed to promote sustainable interoperability between the Minimum Information about any (x) Sequence (MIxS) and Darwin Core (DwC) specifications.
To achieve this goal, the TG utilized the Simple Standard for Sharing Ontology Mappings (SSSOM) to create a comprehensive mapping of DwC keys to MIxS keys. This mapping, combined with the development of the MIxS-DwC extension, enables the incorporation of MIxS core terms into DwC-compliant metadata records, facilitating seamless data exchange between MIxS and DwC user communities.
Through the implementation of this translation layer, data produced in either MIxS- or DwC-compliant formats can now be efficiently brokered, breaking down silos and fostering closer collaboration between the biodiversity and omics communities. To ensure its sustainability and lasting impact, TDWG and GSC have both signed a Memorandum of Understanding (MoU) on creating a continuous model to synchronize their standards. These achievements mark a significant step forward in enhancing data sharing and utilization across domains, thereby unlocking new opportunities for scientific discovery and advancement.
microbiome, eDNA, biodiversity, information standards, omics, metadata, harmonization, FAIR, MIxS, Darwin Core
In recent years, the field of biodiversity research has witnessed rapid growth in data acquisition, further driven by the increasing application of omics technologies (e.g. metagenomics or metatranscriptomics) in biodiversity assessments. However, the sheer volume and heterogeneity of biodiversity data pose significant challenges to effective data integration and reuse, and to FAIR
The Biodiversity Information Standards (TDWG; https://www.tdwg.org/) group and the Genomic Standards Consortium (GSC; https://www.gensc.org/)
The overlap of TDWG and the GSC in multi-omic biodiversity data is an opportunity to begin sustainable convergence of the (meta)data standards these organizations maintain. Most notable among these are the Darwin Core (DwC; https://dwc.tdwg.org/)
These two (meta)data standards have co-existed for a number of years, but adoption of one or the other is still leading to the siloing of information and a resulting lack of sustained interoperability between systems such as those of the International Nucleotide Sequence Database Collaboration (INSDC; https://insdc.org), and of the Ocean Biodiversity Information System (OBIS; https://obis.org) or the Global Biodiversity Information Facility (GBIF; https://www.gbif.org/). Meanwhile, some of these stakeholders are creating bespoke/local interpretations of DwC/MIxS mappings, which may further silo the digital holdings of the omic biodiversity community.
In the Sustainable DwC-MIxS Interoperability Task Group (TG), we brought together experts to build semantically precise and sustained interoperability between TDWG’s DwC standard, and the MIxS checklist from the GSC.
We aim to consolidate previous work on this issue
A key motivation for consolidation is to ensure the "digital health" efforts leveraging the immense interest in using omic technologies to observe life in the oceans under the UN Decade of Ocean Science for Sustainable Development (2021-2030; https://oceandecade.org/). Stakeholders rallying around this global call either use both standards or wish to collaborate across them as part of the Decade's digital strategy (see Section 2.5. Data, information, and digital knowledge management in the Implementation Plan
This TG aimed to produce an approach to sustainably align the MIxS and DwC (meta)data specifications to enhance more efficient and interoperable exchange across their user communities. In the following, we present our report on building sustainable interoperability between DwC and MIxS, including a mapping between DwC and MIxS, a MIxS extension to DwC, as well as a Memorandum of Understanding (MoU) between TDWG and the GSC.
MIxS and DwC both use terms (strings associated with a meaning) to identify elements of data structures. That is, terms (such as “elevation”) are used to identify the intended meaning of, for example, 1) the attributes/columns in tabular data or 2) keys in key-value pairs. Both specifications provide metadata about their terms, clarifying their intended meaning and the expected values that should be associated with them once they are cast in a data structure (i.e., values in table cells, or values in key-value pairs).
Typically, in both MIxS and DwC data exchanges between human agents, (meta)data is arranged in spreadsheets or tabular form. The terms are thus used as attribute names/column headers. When archived in the INDSC (MIxS) and/or GBIF/OBIS (DwC), terms are rendered as keys in key-value pairs. Below, for precision, we default to the usage of “key” (e.g., “temperature”) and its associated “value” (e.g., “18”*
Table
Term | Definition |
---|---|
Darwin Core (DwC) | A specification released by TDWG that includes a glossary of terms intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions (in this document, unless otherwise specified, we refer to DwC Version 2021-03-29) * |
Darwin Core Archive (DwC-A) | A dataset that 1) contains data about species occurrences, checklists, sampling events and/or material sample data and 2) makes use of Darwin Core terms to qualify fields. DwC-A records comprise a set of text (CSV) files with a simple descriptor record (i.e. meta.xml) to inform others how your files are organized. The format is defined in the Darwin Core Text Guidelines. It is the preferred format for publishing data to the GBIF and OBIS networks. |
Darwin Core Extension | A list of defined keys to be used in combination with/in addition to DwC keys to create a more complete metadata record for a given situation. * |
Minimum Information about any (x) Sequence (MIxS) | A collection of checklists released by the GSC to define both the minimal and extended metadata associated with any sequencing record (in this document, unless otherwise specified, we refer to MIxS Version 5). * |
MIxS core | A MIxS checklist providing minimal (and extended) sets of metadata keys directly related to the sequences. |
MIxS environmental packages | A collection of MIxS checklists providing extended sets of metadata keys about different sampling environments, deemed important by the MIxS user community. |
Simple Knowledge Organization System Reference (SKOS) | A common data model for sharing and linking knowledge organization systems via the Web. It provides a lightweight, intuitive language for developing and sharing new knowledge organization systems. |
Simple Standard for Sharing Ontology Mappings (SSSOM) | A catalog of minimal and standard metadata elements for the dissemination of mappings between ontology terms. |
Simple Standard for Sharing Ontology Mappings (SSSOM; https://mapping-commons.github.io/sssom/home/)
We performed a comprehensive mapping from DwC to MIxS, capturing differences in both semantics and syntax between corresponding keys using the format of the SSSOM.
The semantic mapping was based on the minimal and standard set of metadata elements provided by SSSOM, in combination with the relevant SKOS predicates.
As the SSSOM standard set of metadata elements does not yet*
Table
Table 2: Metadata elements additionally added to the DwC-MIxS mapping document to capture the syntactic mapping between keys. Please see an example of how these keys were used in the mapping in the Suppl. material
Element ID | Description | TSV/RDF Example |
---|---|---|
syntax_predicate_id | The ID of the predicate or relation that relates the syntax of the subject and object of this match. | skos:relatedMatch |
syntax_comment | Free text field containing either curator notes or text generated by a tool providing additional informative information on the syntactic mapping. | The subject expects a verbatim input (so anything really), while the object expects a {float} {unit} entry. |
To facilitate the mapping process during our working period, we additionally added further metadata elements to capture definitions and value syntax (see Table 3). This working document is also available through our GitHub repository*
Table
Table 3: Metadata elements additionally added to the working document for the SSSOM mapping between DwC and MIxS keys. These metadata elements were additionally added to facilitate the mapping process by having all the information needed as part of one spreadsheet. Please see an example of how these keys were used in the mapping in Suppl. material
Element ID | Description | TSV/RDF Example |
---|---|---|
subject_definition | The definition of the subject of this mapping. | The original description of the depth below the local surface. |
subject_valueSyntax | The value syntax expected for the subject of this mapping. | verbatim |
syntax_predicate_id | The ID of the predicate or relation that relates the syntax of the subject and object of this match. | skos:relatedMatch |
syntax_predicate_label | The label of the predicate/relation of the syntactic mapping. | related match to |
object_definition | The definition of the object of this mapping. | Depth is defined as the vertical distance below local surface, e.g., for sediment or soil samples depth is measured from sediment or soil surface, respectively. Depth can be reported as an interval for subsurface samples. |
object_valueSyntax | The value syntax expected for the object of this mapping. | {float} {unit} |
syntax_comment | Free text field containing either curator notes or text generated by a tool providing additional informative information on the syntactic mapping. | The subject expects a verbatim input (so anything really), while the object expects a {float} {unit} entry. |
For each mapping, group consensus was reached through a combination of structured discussions in the GitHub issue tracker and online video-chat meetings. Mappings can be found in the TDWG/GBWG GitHub repository, with related discussions captured on the issue tracker.
The SSSOM compliance of the mapping products was validated by Chris Mungall*
The aims of the mapping process were to provide:
Darwin Core Archives are generally built on a combination of a core CSV file and zero or more extension CSV files. The schemas of the core and extensions are defined by XML documents maintained in the GBIF GitHub repository for machine-readable resources (https://github.com/gbif/rs.gbif.org). Core files act as the primary focus of a data set (e.g., Occurrences of organisms in nature), while the extensions add information relevant for specific uses (e.g., the proposed MIxS extension). The MIxS extension contains the list of keys that are orthogonal (have no equivalent mappings) to keys in the Darwin Core standard. Being orthogonal and defined by GSC, the keys in the extension are identified by IRIs from a namespace (fully qualified namespace [https://w3id.org/mixs/] became available with the release of MIxS V6) distinct from that of Darwin Core (http://rs.tdwg.org/dwc/terms/).
This was achieved by 1) documenting the relevant MIxS terms in the XML format specified by GBIF*
To test technical interoperability and simulate the ingestion of MIxS-compliant metadata into a Darwin Core-based database environment (e.g., OBIS or GBIF), a marine omics dataset
Similar tests were performed using data representing Pico- to Mesoplankton along the 2000 km Salinity Gradient of the Baltic Sea
We were able to successfully ingest the data into GBIF's user agreement test environment (www.gbif-uat.org). These test cases show it is possible for omics data to be incorporated along human observation-based occurrence datasets using data processing by MGnify. This advancement is especially relevant for microbial groups, some of which are only known from environmental DNA (eDNA) sequences. It opens up new opportunities to include the vast biodiversity of micro eukaryotes, Bacteria, and Archaea in repositories that up to now have been dominated by plants and animals.
Additionally, OBIS will be working on a first test case of the DNA-derived data extension utilizing Autonomous Reef Monitoring Structures (ARMS) datasets (https://doi.org/10.3389/fmars.2020.572680), which will link occurrences derived from genetic samples, morphological identifications and photographic evidence to each sampling device. To facilitate the addition of sequencing datasets to the database, OBIS is also developing a bioinformatics pipeline, which will output a dataset formatted to the DwC-A including the MIxS extension.
As of Sep. 1st 2023, 16,612,814 Occurrences distributed across 52 datasets have been published to the GBIF production environment and OBIS holds 23 million records/sequences from 36 datasets utilising the GBIF/OBIS variant of the MIxS DwC extension*
This TG has solicited and incorporated feedback from the GSC steering group and TDWG executive committee prior to the signing of the Memorandum of Understanding. We welcome feedback from users and implementers of the mapping and extension upon the publishing of this paper. Please share your feedback through the GBWG GitHub issue tracker, using the label "DwC-MIxS feedback V2.1.0".
To ensure that our mapping and approach are integrated into the procedures and workflows of both TDWG and the GSC, we drafted and circulated a Memorandum of Understanding (MoU; see Outcomes) to the executive bodies of each organization. This MoU has been signed in October 2022. It incorporates processes sustaining and furthering interoperability between these specifications and organizations. It is in this way, we hope that the work of our TG can lay the foundation for ever closer alignment, ultimately allowing precise machine-to-machine translation of metadata using GSC and TDWG specifications.
GitHub releases of new versions of either DwC or GSC shall trigger a notification to the maintainers of the mapping created by this TG, who will review the new release and update the mapping if needed. As both standards have a release approximately annually, we estimate that long-term maintenance should require approximately 10-30 combined person-hours for mapping review per year, plus review by the TDWG DwC Maintenance Group and the GSC Compliance and Interoperability Group (CIG), each of which can be accomplished as part of one of their regular monthly meetings.
As part of the MOU, both GSC and TDWG have agreed to provide personnel to maintain this mapping in perpetuity and to provide ongoing development to automate the mapping process as possible.
The next update of the mapping is expected before the end of the year with the new release of DwC around the MaterialEntity developments*
Note: The TG developed this mapping based on MIxS v5, the identifiers, however, are based on those noted in the working document preceding the MIxS v6 release*
Following our mapping approach (Approach: Mapping), we mapped 32 DwC keys to 12 MIxS keys. Our resulting SSSOM records are accessible through the GBWG DwC-MIxS GitHub repository.*
In the following section, we include the Memorandum of Understanding (MoU) between TDWG and the GSC.
The Biodiversity Information Standards (TDWG) group and the Genomic Standards Consortium (GSC) have emerged as de facto (meta)data standards authorities in the biodiversity domain. The former’s scope spans biodiversity data at large, while the latter focuses on genomic, and then multi-omic, data and metadata such as lab protocols or chemical/physical measurements. Their activities, technologies, and management structures have been largely parallel, with some notable exceptions catalyzed through joint interest groups such as the Genomic Biodiversity Working Group (GBWG).
The overlap of TDWG and the GSC in multi-omic biodiversity data is an opportunity to begin sustainable convergence of the (meta)data standards these organizations maintain. Most notably among these, are the Darwin Core (DwC) and the Minimal Information about any (x) Sequence (MIxS) specifications. This memorandum builds on the output of a GBWG task group to propose a solution for sustained mapping and scalable interoperation of both DwC and MIxS. Its goal is to ensure that TDWG and the GSC create a lasting and continuous model to synchronize their standards, eventually promoting full bi-lateral integration.
Recognizing that both the Biodiversity Information Standards (TDWG) group and the Genomic Standards Consortium (GSC) have established well-adopted and community-driven (meta)data specifications for sequence-based biodiversity data;
Further recognizing that users of one standard specification should not have to invest additional effort in independently translating their (meta)data into another;
It is resolved that:
The GSC and TDWG will maintain and endorse an authoritative and machine-readable mapping*
These authoritative mappings (in SSSOM-compliant tab-separated value files) and other digital references will be maintained in the GBWG GitHub repository within the TDWG organization and with TDWG-issued IRIs for the mapping files;
Further, both organizations will provide bilaterally endorsed reference implementations of how to use their counterpart’s specification in their data structures (e.g., a DwC Archive incorporating fields mapped to MIxS in a DwC extension);
Any necessary modification of identifiers (URNs, URLs, URIs, IRIs, etc) or other component of a standard issued by one organization which impacts the other should be declared and the particulars agreed upon in documented appendices to this MoU;
When one specification is updated, the TDWG DwC Maintenance Group and the GSC Compliance and Interoperability Group (CIG) will hold joint sessions to update and validate any mappings and reference implementations to ensure clarity in the multi-omic biodiversity data community.
The communication channels to communicate updates of either specification will be the MIxS issue tracker*
Additionally recognizing that unilateral innovation and research actions will propose and implement alternative mappings and extensions to sequence-based metadata specifications.
It is further resolved that:
Only those modifications which have been reviewed and endorsed by mechanisms bi-laterally convened by TDWG and the GSC will be considered standardized;
Innovation is still welcome, and both organizations will welcome input and inspiration from application-driven modifications of the base standard.
Signatories:
12. October 2022 Representative of TDWG Executive (Deborah L Paul)
13. October 2022 Representative of the GSC Board (Lynn Schriml)
We created a DwC extension*
Of the 96 keys contained in MIxS core, we included the 82 terms that were not mapped in the extension.
The TG’s GitHub repository hosts both the list of keys*
Following the terms of our MoU draft, this extension will be bilaterally endorsed by the GSC and TDWG to assure users that they are implementing an officially recognized recommendation. The manner in which this is declared (e.g., as a header in the DwC-A reference implementation) will be decided upon by the relevant bodies in the GSC and TDWG.
While the bilaterally endorsed GSC-TDWG extension provides stability, we recognize that the needs of the biodiversity community are more diverse and require more nimble forms of data exchange. In the creation of these more ad hoc extensions, the risk of creating siloed / bespoke data products (and thus reducing global interoperability) is often countered by the practicality of advancing with fewer overheads and at a more rapid pace than standards bodies can be expected to match. Here, without taking a position on the “better” route, we recognize the reality of this scenario.
To demonstrate how metadata fields relevant to sequence-based biodiversity data can relate to the core outputs of this TG, we include a variation of the MIxS-DwC extension - the DNA-derived data extension - developed by GBIF*
The DNA-derived data extension includes all keys of the MIxS-DwC extension, but brings in additional keys necessary to satisfy the exchange needs of the GBIF/OBIS/Atlas of Living Australia (ALA) networks. The additional keys originate primarily from the Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) recommendation and Global Genome Biodiversity Network (GGBN).
Additionally, the DNA-derived data extension also takes measures to optimize the formatting and machine-readability of keys from MIxS. This stems from the fact that some MIxS key-value pairs are not atomic, i.e., they include multiple values in the same field (e.g., the MIxS key “pcr_primers” requires the user to enter a value that is comprised of a string that represents both the forward and reverse primer sequence, separated by a semicolon). This value-level formatting creates a bespoke data structure which then requires custom software or code to parse, limiting interoperability with external systems. Thus, in the case of pcr_primers, the DNA-derived data extension uses alternative keys, based on the MIxS key, which are associated with atomic values: pcr_primer_forward and pcr_primer_reverse. This allows for more efficient and unambiguous data ingestion into search indices, relational databases, or similar solutions, with minimal processing.
We acknowledge that it is a balance for application profiles to both comply with community standard specifications, while also satisfying the needs of the systems using them. To include and represent the evolving needs of the community and applications in existing community standards, we encourage that requests for changes or new keys are directed directly to the GSC*
In the sub-sections below, we offer several recommendations based on the proceedings and outcomes of this TG. We see our TG’s diverse membership and perspectives as a strong model to follow in future work developing or interlinking community standard specifications used by many stakeholders. Through this, operational realities, technical soundness, and policy-level perspectives can be better integrated and built upon.
The Simple Standard for Sharing Ontology Mappings (SSSOM) offers a framework to represent ontology mappings in a precise way, with a structured way to include rich provenance. For the work of this TG, we have implemented an SSSOM mapping between the DwC standard and the MIxS checklist.
SSSOM provides a minimal set of standard elements for the dissemination of mappings between terms. This helps to ensure a reliable interpretation of mappings and enables sharing and data integration between human and machine agents.
As described in the Recommendations for semantic and syntactic alignment, even closely related MIxS and DwC terms, may have semantic variance, and expect values with different syntax. To manage that variance, we propose extending the list of SSSOM metadata elements to include elements to capture the syntactic mapping (syntax_predicate_id, syntax_comment; see Approach: Mapping) in addition to the existing semantic mapping metadata elements.
During the process of mapping, it is very useful to include additional attributes/columns in the SSSOM matrix in which information, upon which the mapping is based, can be stored.
We thus propose adding such columns during the process (e.g., definitions [subject_definition, object_definition], syntax requirements [subject_valueSyntax, object_valueSyntax]; see Approach: Mapping). Once the process is over, a leaner SSSOM product can be released omitting these supporting attributes.
For mapping keys from metadata standards to one another, this TG recommends:
Follow the SSSOM guidance*
Until official guidance is offered from the SSSOM team, apply the extension proposed above (see Approach: Mapping) to additionally capture the mapping of syntax requirements:
using the SSSOM predicate_id and corresponding comment to capture the semantics, and the syntax_predicate_id and corresponding syntax_comment to capture the syntactic mapping of terms.
Communicate any needed extensions to the SSSOM team via their issue tracker*
Due, in part, to the different approaches to atomization described above and below, many of the proposed relationships between MIxS and DwC keys required one-to-many or many-to-one mappings. This usually occurred when one specification offered multiple similar alternative keys for a phenomenon (e.g., DwC offers five keys relevant to “depth” measurements, while MIxS only offers one).
Recognizing that many keys in DwC or MIxS have community- and development-specific legacies, we recommend:
A mapping between metadata standards should be all-encompassing, and may thus include many-to-one, many-to-many, or one-to-many mappings.
Implementers, who represent a community of practice, can add notes on what keys they think are the most sensible.
In the long term, the standards agencies should aim to reduce the complexity of keys, moving towards atomization, to support more one-to-one relationships, eventually supporting full convergence.
DwC and MIxS specifications both offer guidance on the syntax expected for each value in a given key-value pair, alongside general notes on the expected semantics. In DwC, a value’s expected semantics*
For measured values, MIxS expects the unit to be included as part of the value, while in DwC the unit is not allowed as part of the value - it is either inherent in the term definition or requires a separate term to specific the unit (optional for verbatim fields*
For measured values, MIxS offers a “preferred” unit option, which - as the label implies - is not mandatory, while DwC either clearly defines the expected unit for a value or allows any unit to be used in a unit field related to the value field (except for verbatim fields).
Some MIxS keys, such as lat_lon, expect values that capture two or more measured/derived values. DwC typically separates these measured/derived values across two or more keys (e.g., decimalLatitude and decimalLongitude).
Also, several MIxS fields allow for a numeric value or a range, followed by a measurement unit (size_frac, samp_size, temp, depth, etc.). Darwin Core generally opts for atomic values associated with its keys.
Incompatibilities, such as those above, create (meta)data silos between communities using one or the other specification. Mappings built upon these can (in general) only be semantically and syntactically loose, and implementers must create and maintain converters or automated translators between the two, severely limiting and likely causing error propagation in machine-to-machine exchanges.
The SSSOM community is actively looking at ways to address these kinds of data structure mappings, and whether to address them as in scope for SSSOM, or to address these using the LinkML-transformer framework.
To secure improved semantic and syntactic alignment, this TG recommends the following:
The use of more explicit labels (terms), associated with less ambiguous definitions (many of which are more descriptive than definitional).
Additionally, further cross-organization efforts to align the semantics of their fields in successive releases, using their obsolescence/change processes as appropriate.
Examples or descriptions of what is within and outside of the semantic scope/range of each field.
For any non-verbatim fields, clear guidance on what syntax is expected in each field (e.g., how many terms, separated how, with or without which unit?).
Re-use of existing and established terms from more general standards organizations within each specification (e.g., using dcterms:license to capture licensing information within MIxS and DwC).
Alignment to official external standards (e.g., using ISO 8601 to capture the time and date of an event)*
Synchronization between standards bodies ahead of new releases for closer syntactic alignment.
Semantic stability and standard syntax so stable converters can be written.
Atomic key-value structures, such that no complex or bespoke data structure exists in each value. For example, splitting ranges into dedicated start and stop fields.
With advancement towards RDF- or JSON-based representations, allowing lists to be rendered as repeated key-value pairs.
Removing units from values by, for example, requiring a standard unit in the definition of each key.
In addition to MIxS core, MIxS contains numerous “environmental packages” which bundle keys which improve the contextualization of sequences in a given sampling environment. These are especially relevant for associating specific chemical and physical environmental measurements with specimens collected from these environments. Examples include marine, soil, food, and host-associated packages. These packages were created as a means to keep the core set relatively small, while rapidly accounting for the needs of sub-domains. These keys, and specifications of expected values, however, have not been harmonized or otherwise made interoperable with information standards published and used in Earth and environmental sciences.
Thus, this TG created SSSOM mappings and harmonization notes only for MIxS keys that directly pertained to sequences (MIxS core), rather than the specific environment they were obtained from.
Recognizing that the standardization domain/mandate of the GSC does not extend to standards of environmental parameters, this TG recommends that:
Any sustained reference implementation of a MIxS extension of DwC - endorsed by the GSC and TDWG - is limited to those MIxS keys which closely pertain to sequences (MIxS core), rather than the environments they originate from (MIxS environmental packages).
The GSC, as it begins to transition MIxS into RDF, should make efforts to map and eventually replace their environmental keys with equivalent, well-described keys from an information standards body working in the Earth and environment domain. We strongly advise that this is done as a joint activity with TDWG, to prevent decoupling and the need for downstream re-alignment.
Users wishing to use the MIxS environmental package keys in DwC Archives should use the MeasurementOrFact (MOF)*
While we demonstrate how to link MIxS environmental package keys to DwC’s MoF, we draw attention to the fact that the GSC’s mandate is not within the standardization of Earth and environment metadata. Thus, where possible, users should attempt to use values from more Earth and environmental vocabularies, thesauri, ontologies, etc.
Please see Suppl. material
TDWG and the GSC, in partnership with one or more standards bodies in the Earth and environmental sciences (e.g., the Earth Science Information Partners), convene a task group (or extend and expand this TG with a new mandate) to provide recommendations on how to sustainably and FAIRly incorporate well-adopted and more formally standardized environmental parameters into both MIxS and DwC.
To our knowledge, there is no sustained attempt to secure interoperability between the competing standards (most of which are informal, ad hoc, or de facto, as are MIxS and DwC) in this space. Some organizations and efforts of interest are listed below.
Parameter vocabularies
The British Oceanographic Data Centre (BODC)*
The Open Geospatial Consortium (OGC)*
Climate and Forecasting Variables*
We note that, while this vacuum exists, implementers will create their own internal standards for expediency*
Information on licensing is critical for data reusability (as declared in the FAIR Principles
Recognizing that the GSC does not currently intend to extend their core checklist to include a key for licensing information*
While saying this, we also recognize the need for further discussions around the subject of license and reuse in conjunction with access and benefit sharing discussions around the Nagoya protocol and digital sequence information, as well as in conjunction with the implementation of the CARE principles
In concluding this document, we emphasize the importance of convening a diverse and multi-stakeholder TG. With representatives from established biodiversity data infrastructures, domain experts, data generators, and publishers, we - ab initio - bridged the conceptual to the application space. We leveraged this to 1) generate, and internally review, a fine-grained mapping in a standard format, 2) implement new extensions to DwC, and 3) develop recommendations on how to expand on and sustain these. We have also identified areas of concern, which are in need of further attention and follow-up TGs.
Despite the achievements above, the work of this TG falls short of making an automated conversion possible. For this to be achievable, both community standards require further semantic and syntactic alignment, both between one another and with external data-on-the-web standards and best practices. In general, avoiding bespoke value syntax and complex semantics associated with keys (e.g., by unpacking complex keys into a number of simpler ones) will help this effort.
As stated in our MoU, the sustainability of this TG’s output must be ensured through aligned processes within the community standards bodies involved. As noted in Suppl. material
In the long term, as sequence-based (meta)data becomes more central to biodiversity observing, we anticipate a full convergence of these standards. Simultaneously, tools to converge records built from these specifications into more machine-readable forms (e.g., RDF triples), would increase their value, scalability, and portability.
We trust that the activities of this TG will inspire similar activities between other metadata standards in this space, to break down silos and open a path to a more collaborative and interoperable future.
Our approach demonstrates considerable reuse potential, with comprehensive documentation of each step for easy adoption and adaptation by others. In Suppl. material
We see our TG’s diverse membership and perspectives as a strong model to follow in future work developing or interlinking community standard specifications used by many stakeholders. Through this, operational realities, technical soundness, and policy-level perspectives can be better integrated and built upon. Further, leveraging pre-existing and standardized resources such as SSSOM and SKOS has streamlined the process and allows the mapping to be easily and broadly parsed and understood.
We encourage others to consider and further develop our approach. This will happen in the DwC/MIxS world, as new versions are released of each standard and the mapping, but it has also already started with other (meta)data standards, such as GGBN*
We envision this as a significant step toward fostering a collaborative digital ecosystem, where reduced redundancy and increased interoperability become the norm.
The described TG outputs (V2.1.0) are hosted in the "dwc-mixs" folder of GBWG GitHub directory. The permanent identifier to this repository and its content is https://doi.org/10.5281/zenodo.8393224.
The versions of the standards specifications used can be found:
The TG discussions can be found on the GBWG issue tracker with the label "DwC-MIxS TG". We encourage potential users to contribute to those discussions and/or request improvements as needed. For feedback on the V2.1.0 release specifically, please use the "DwC-MIxS feedback V2.1.0" label.
We would like to thank the Genomic Biodiversity Interest Group (GBWG) for providing a home for this TG and work. Further, we thank the TDWG Secretariat and Executive Committee as well as the GSC board for their support and for providing feedback on the TG outputs. Special thanks also to Harshad Hegde for validating the SSSOM compliance of the mapping products.
This publication is funded by the BiCIKL project, Grant No 101007492.
RM was supported by the European Union’s Horizon 2020 Research and Innovation Programmes under grant agreement N° 862923, project AtlantECO (Atlantic ECOsystem assessment, forecasting and sustainability), and grant agreement N° 862626, project EuroSea (Improving and Integrating European Ocean Observing and Forecasting Systems for Sustainable use of the Oceans).
All TG members contributed to the discussions and wrote the manuscript. The TG and the writing were led by RM with the support of PLB. All TG members developed the mapping, led by RM, PLB, and with significant contributions from WDD and JW. TR, TSJ, SS, YMG, PP, and MS developed and tested the extension. PLB, RM, RW, LS, and DP wrote and reviewed the MoU.
Recommendations were discussed and reviewed by all TG members.
The Supplementary Material was discussed and reviewed by all TG members.
Includes Sections on (1) Exemplar rows from the SSSOM mapping files, (2) Using MIxS environmental package keys in DwC Archives, (3) Issues noted for future TGs, (4) Relation of interoperable standards to the future of data-driven publishing
This example assumes that the corresponding unit of the value is defined in the metadata associated with the key. See Recommendations for semantic and syntactic alignment.
In the proceedings of this TG, it was noted that the loose usage of such terms referencing the linguistic artifacts (e.g., “terms”) and the more technical data structures (“key-value pairs”) can produce confusion during tasks that require semantic precision, including this mapping. Thus our clarification here.
Currently the SSSOM community is working to provide best practice for these situations; see https://github.com/tdwg/gbwg/issues/54, https://github.com/mapping-commons/SSSOM/issues/52, https://github.com/mapping-commons/SSSOM/issues/56.
For example, one of the challenges with mapping different term lists is that frequently we see that one system bakes in a unit to the meaning of the term, and the other system has a corresponding term whose value expects a compound of value plus unit.
https://github.com/tdwg/gbwg/tree/main/dwc-mixs. An exemplar row from the mapping files can additionally be found in the Suppl. material
ORCID: 0000-0002-6601-2165
ORCID: 0000-0002-2411-565X
GBIF: See some examples here https://www.gbif.org/dataset/9e29a2fe-d780-48a8-a93f-9ce041f9202f, https://www.gbif.org/dataset/9f0e1ca6-fb08-4c72-9a4a-1e3b7a528c10, https://www.gbif.org/dataset/4cefd38b-8ada-46e0-9ef7-3531f8a204df, https://www.gbif.org/dataset/9d7baaac-57db-4852-9993-7f0e7f15635b
OBIS: See some examples at https://obis.org/datasets, select "DNADerivedData" as data type
https://github.com/tdwg/gbwg/tree/v2.1.0/dwc-mixs/mapping. An exemplar row from the mapping files can additionally be found in the Suppl. material
See the meta.xml file of the Korean Peninsula Flora as an example of how an XML file is used as part of the DwC-A: https://www.gbif.org/dataset/e09e1e1f-2460-4017-a964-e999abd2bf66
In both MIxS and DwC, multiple definitions suffer from ambiguity, circularity, or other semantic aberrations. An effort to improve these would also improve future mapping and (meta)data (re)use efforts.
Verbatim fields are essential to collect specimen data from museums, etc.
The rare occasion where DwC and MIxS semantically and syntactically matched exactly was due to external standards (ISO 8601)
Please see Suppl. material
For example, GBIF is building basic vocabularies in SKOS, based on the values they see in the aggregation of original sources. The objective here is more to clean data than to build rigorous vocabularies. Such internal efforts would greatly benefit from having a consolidated, appropriately endorsed, and standardised specification of environmental terms to align to.
License information is additionally captured on the dataset level in a DwC-A in EML, however, this declaration may not carry through automatically to the record in the dataset.
Similar to the Chronometric Age vocabulary enhancement https://tdwg.github.io/chrono/terms/