Biodiversity Data Journal :
Forum Paper
|
Corresponding author: Hong Cui (hongcui@email.arizona.edu)
Academic editor: Sarah Faulwetter
Received: 07 Sep 2018 | Accepted: 23 Oct 2018 | Published: 07 Nov 2018
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Cui H, Macklin J, Sachs J, Reznicek A, Starr J, Ford B, Penev L, Chen H (2018) Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production. Biodiversity Data Journal 6: e29616. https://doi.org/10.3897/BDJ.6.e29616
|
Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project’s semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production.
Controlled Vocabulary, Computable Phenotype Data, Data Quality, Phenotype Ontologies
Phenotypes are paramount for describing species, studying function and understanding organismal evolution. Recent advancements in computation technology have enabled large-scale, data-driven research, but its full potential has not been realised due to lack of data. High impact research, such as studying trait evolution and its relationship to phylogeny and the environment (e.g.
Textual phenotype descriptions that hold valuable information are continuously being published, yet they are not amenable to computation. When added to the massive amount of phenotype data sitting in older publications, these free-text character descriptions represent a major, under-utilised resource for integrating phenotypic data into modern, large-scale biological research projects that typically involve genomic, climatic and habitat data. These descriptive data are often variable in expression and terminology. Different descriptions of the same character may appear to describe two different traits or two different characters might be interpreted as one. Transforming various natural language expressions into computable data requires a process, called ontologising, where the semantics (meaning) of varied expressions are mapped to terms in an ontology and therefore made explicit (
Currently, making free-text phenotype information computable requires highly trained post-doctoral researchers manually ontologising the descriptions, facilitated by some software applications. However, the manual curation of legacy descriptions is not a sustainable solution for phenotype data production because it does not stop the continued publication of free-text phenotype descriptions that need semantic curation before use. If we assume that each of the estimated 750,000 biomedical papers published in English in 2014 (
Manual curation also does not address the fundamental causes of large (~ 40%) variations in the phenotype data manually curated by different workers (e.g.
As long as phenotype descriptions continue to be produced as free text, computable phenotype data will remain a major bottleneck holding back large-scale biological research. Given the varied usages of phenotype terms/expressions by different authors and given the fact that the meanings of a term evolve over time, it is evident the semantics of phenotypic characters (categorical or continuous characters) can be most accurately captured at the time of writing by their authors. Any downstream process risks information loss or even misinformation.
We have been awarded funding to investigate a new paradigm of phenotype data production centred on description authors and supported by intuitive software tools to allow them to compose semantically clear descriptions while contributing their vocabularies/expressions to a shared ontology for their taxon groups. It brings authors to the forefront of ontology construction, promotes clear expressions and exposure of all valid meanings of technical terms and encourages open collaboration and consensus building amongst scientists. While the proposed approach presents a major conceptual change in phenotype data authoring, the change can be introduced via software environments with which authors are already familiar, for example, Google Docs and Wikis. We will approach the project from the perspectives of social and software engineering, examining human social and collaborative behaviour (e.g. attribution and motivation) and software usability to identify factors that encourage or discourage users from adopting the approach. Although we will start with a test case using the plant genus Carex L. (“sedges”, family Cyperaceae), the project has the potential to change how biodiversity is described in general and dramatically ease the production of computable phenotype data at a large scale.
Using the ongoing Carex revisionary work as the evaluation case for this project is an excellent choice because: 1). Carex is one of the largest genera in flowering plants, with close to 2000 species containing considerable variation. 2). A network of Carex experts already work closely to prepare the revisions. 3). Carex is treated in Flora of North America and Flora of China, from which we have previously extracted over 1200 Carex morphological terms and will be used to build the initial Carex Phenotype Ontology (CPO) for scientists to improve and 4). Scientists on this project will use the large amount of characters produced from this approach to expand their past research to a scale not possible before (
We also note that the larger academic and scientific research environment support the premises of the proposed approach. The importance of computable phenotype data is widely recognised and data silos are being actively dissolved. Ontologies and other data publications are valued and attribution methods are being actively examined to credit intellectual contributions to digital resource curation, such as the efforts by the International Society for Curation (http://biocuration.org), OBO Foundry (http://www.obofoundry.org) and THOR (http://project-thor.eu). Publishers like Pensoft are actively seeking and welcoming new methods to stop the continued publication of legacy descriptions. Having years of experience with using digital tools/devices, scientists are expert users of digital collaborative environments (e.g. Wikis, Google Docs). The time is right to investigate a long-term solution to phenotype data production.
Fig.
It is important to differentiate this approach from a standardisation approach where the authors are limited to using a set of “standardised” terms selected by others. The proposed approach does not limit author's choices, but it requires the authors to register the meaning (i.e. semantics) of the terms in their descriptions in an ontology and relate them to other existing terms to allow accurate interpretations in the future. For example, a standardisation approach might require Joe to use the term strong when he wishes to say stout. In contrast, our approach might show Joe that stout has two related but different meanings: increased size and strong (not fragile). This would allow Joe to choose the most precise term to use, increased size, strong or stout and, in turn, allow the reader, human or computer, to obtain the accurate meaning intended by the author. The key idea of the proposed approach is to make all valid meanings of a term clear and visible to a community of users and to encourage the user to filter and choose terms with the most accurate meaning for their purposes.
When the user adds a term to the ontology, the online open Ontology Editor is invoked, presenting different patterns to relate the terms in semantic ways (e.g. assert utricle in Carex ≡ perigynium in Carex, spike is_a inflorescence, spikelet ≡ secondary spike or small spike, stout ≡strong and increased size, weak ≡decreased magnitude or decreased strength). Ontology design patterns (e.g.
The system detects that the user is attempting to add a substructure (apex) to multiple parent structures (leaf and leaflet). This triggers the system to suggest the non-specific structure pattern to the user. When the user confirms, the system will insert four assertions (4 links in the graph) into the ontology automatically.
These patterns are expected to greatly improve the predictability of the ontology, reduce variation and lower the barrier to entry for biologists. The software will auto-detect situations, whenever possible, for which a pattern may be useful; once the user confirms, the system will carry out what needs to be done on the user’s behalf. Fig.
Small ontology building tasks such as conflicts amongst term definitions and relationships can be broadcast via a simple mobile app for registered authors to resolve at their leisure. Technical challenging cases can be resolved with help from trained ontology engineers, for example, the Planteome Project (http://planteome.org) or the OBO Foundry.
The rewards to authors who adopt this new workflow include: (1) Narrative descriptions in camera-ready form for publication. (2) A taxon-by-character matrix formulated ontology terms, ready for publication. These can be published in partner journals (e.g.Pensoft journals) in a customisable human readable form (e.g. sentences or matrices) and a variety of new ontologised formats such as EQs (Entity Quality) in the Phenoscape Knowledgebase (http://kb.phenoscape.org) or RDF graphs (Resource Description Framework, a format used widely on the Semantic Web). (3) Formal attributions and increased citations. On one hand, research has shown that studies that make their data available receive more citations than similar studies that do not (e.g.
Results from social and behavioural sciences research on computer mediated collaborative work, online community building and consensus making (e.g.
We hypothesise that, with careful design of the user interface that takes into account user-friendliness, efficiency, user motivation and other social and behavioural factors, this approach will increase phenotype data quality, ontology quality and computation efficiency.
Data quality: improve the semantic clarity of new phenotype descriptions to dramatically reduce the scope of the subsequent ontologisation effort,
Ontology quality: quickly improve the coverage of the phenotype ontology for a particular domain (e.g. a taxonomic group) and
Computation efficiency: obtain ontologised matrices and/or EQ statements with higher consistency and hence support a wide range of applications.
Assuming this proof of concept system is successful, this approach can be applied to any other science and engineering domains (e.g. biomedical, geology, astrophysics etc.). This being so, individual domain ontologies can be linked, based on shared concepts and terms, thus building powerful bridges for integration across domains, sciences and beyond.
Readers interested in learning more about our project and eventually evaluating our software prototypes can obtain further information from our github project page (https://biosemantics.github.io/author-driven-production) or contact authors. In summary, the goal of this project is to investigate the feasibility of transforming phenotype authors’ writing practice to produce computable phenotype data at the time of publication, with increased speed, scale, quality and consistency, while collectively curating phenotype ontologies, making them reflect a community consensus. Through thorough user experience research, we will also identify ways to reduce the entry barrier and promote user adoption of the new practice. When publishers adopt this new idea, we believe the ultimate goal of producing massive high-quality phenotype data for the entire scientific community can be achieved. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future.
ABI innovation: Authors in the Driver's Seat: Fast, Consistent, Computable Phenotype Data and Ontology Production
University of Arizona
Cui, Macklin and Sachs contributed the initial idea of the project. All authors edited and reviewed the manuscript.
The authors have declared that no competing interests exist.