Biodiversity Data Journal :
Software description
|
The Supertree Toolkit 2: a new and improved software package with a Graphical User Interface for supertree construction
Corresponding author:
Academic editor: Matthew Yoder
Received: 10 Jan 2014 | Accepted: 25 Mar 2014 | Published: 26 Mar 2014
© 2014 Jon Hill, Katie Davis
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Hill J, Davis K (2014) The Supertree Toolkit 2: a new and improved software package with a Graphical User Interface for supertree construction. Biodiversity Data Journal 2: e1053. https://doi.org/10.3897/BDJ.2.e1053
|
Building large supertrees involves the collection, storage, and processing of thousands of individual phylogenies to create large phylogenies with thousands to tens of thousands of taxa. Such large phylogenies are useful for macroevolutionary studies, comparative biology and in conservation and biodiversity. No easy to use and fully integrated software package currently exists to carry out this task. Here, we present a new Python-based software package that uses well defined XML schema to manage both data and metadata. It builds on previous versions by 1) including new processing steps, such as Safe Taxonomic Reduction, 2) using a user-friendly GUI that guides the user to complete at least the minimum information required and includes context-sensitive documentation, and 3) a revised storage format that integrates both tree- and meta-data into a single file. These data can then be manipulated according to a well-defined, but flexible, processing pipeline using either the GUI or a command-line based tool. Processing steps include standardising names, deleting or replacing taxa, ensuring adequate taxonomic overlap, ensuring data independence, and safe taxonomic reduction. This software has been successfully used to store and process data consisting of over 1000 trees ready for analyses using standard supertree methods. This software makes large supertree creation a much easier task and provides far greater flexibility for further work.
Supertree, phylogeny, data curation, meta-data
Supertrees are large phylogenies created by amalgamating anywhere from tens to thousands of smaller source phylogenies. A number of algorithms exist for this, the most widely used being Matrix Representation with Parsimony (
Some workers have written and made available scripts that carry out one or more of the required processing steps (e.g.
Here, we present the next version of the Supertree Toolkit that builds on the experience of the first version. We have rewritten all code and designed the software around a user interface that can carry out both data collection and processing. It contains a number of additional features over the original software which are 1) new processing steps, such as Safe Taxonomic Reduction, 2) user-friendly GUI that guides the user to complete at least the minimum information required and includes context-sensitive documentation, and 3) a revised storage format that integrates both tree- and meta-data into a single file. We will first detail the storage mechanism, based on RelaxNG XML, and the user interface features. We then cover the available processing pipeline steps and show some examples of their use.
Supertree Toolkit (STK)
The STK consists of three components: a Python module, a Graphical User Interface (GUI), and a Command Line Interface (CLI). The python module contains all processing, importing and exporting functions. These functions deal with the Phyml format (see below) and are available in any Python environment by importing the supertree_toolkit module. The GUI and CLI then import this Python module and hook it to the interface by processing user options. In this way the core functionality can be tested by using standard unit test infrastructure and the interfaces are cleanly separated. A test suite of over 375 tests is included in the source code which benchmark the expected performance of the software.
User interface
There are two user interfaces: a GUI for data entry and processing, and a CLI for data processing. The latter is useful for dealing with large datasets. The GUI is based on Diamond (
We have maintained all the previous functionality of the previous version of the STK which are detailed in
Metadata and file format
XML is an ideal way to store structured metadata. We build on the methods used by Spud (
Each dataset has a name and contains a number of "Sources" (Fig.
Data structure of the STK metadata. Each project consists of several sources, which in turn contain bibliographic information and one or more source trees. The blue boxes show the hierarchy for a single source tree. The data structure has been simplified here and more meta data can be stored for each source tree.
The result of this schema is a single XML datafile that contains all metadata and source data required. This file is termed a Phylml (Phylogenetic Meta Language), which can be parsed by any standard XML parser.
Processing functions
There are a number of processing functions included in the STK. These can be chained together to construct a processing pipeline to collect, curate and process data (Fig.
Example of a processing pipeline that can be created with the STK. Data are collected and then are put through the processing pipeline in order to create a matrix. The resulting matrix (in either Nexus format (.nex) or Hennig format (.tnt)) can then be analysed in any suitable software such as PAUP* (
Data summary
This function creates a text summary of the data. The summary includes a taxa list, years of publication, characters used, and analyses used.
Clean data
Before and during processing trees may become uninformative (i.e. contain no clades), for example after substitution of taxa, or when dealing with polyphyletic taxa. This function checks that the data are suitable for processing and removes uninformative trees (and sources if they contain no trees) and should be run regularly on data between processing steps.
Permute trees
When creating supertrees at species level digitised trees need to account for the fact that some species may be polyphyletic. There are no formal mechanisms for dealing with this so taxa can be encoded with a '%d' sign to designate them as polyphyletic (where d is a consecutive integer for each taxon). The 'permute trees' function generates all possible permutations of these trees to enable a consensus tree of some kind to be created.
Substitute taxa
One of the most onerous tasks of supertree creation is ensuring a consistent taxonomy is used throughout. This requires the removal of synonyms, mis-spellings and other naming errors. The 'sub taxa' function allows substitution and deletion of taxa whilst maintaining the tree structure. Substitutions are aware of polyphyletic taxa and will collapse superfluous nodes when deleting taxa. This function is used throughout the processing.
Data independence
It has previously been noted that all data included in supertree analyses should be independent of each other. Here, we defined non-independent data as being datasets that contain a subset of the same taxa and use identical characters. This function flags source trees that are subsets of others (and can automatically remove them if required) and flags those that are identical (i.e. same taxa and characters).
Replace genera
This is one of the final steps of the pipeline. After all processing some taxa at genus level may be left in the source trees. This function replaces those genera with a polytomy of species already in the dataset. Note that this assumes a species level tree is to be created and this step can be omitted if this is not the case.
Data overlap
In order to create a supertree all source trees should exhibit sufficient taxonomic overlap (
Create subset
One of the novelties of the STK is that it can be used to create subsets of the whole dataset, based on the metadata. For example, all trees that used molecular character can be extracted and used to create a new dataset. Similarly publication year, author, or analysis type can all be used to create subsets. These can be used to create independent supertrees and the effect of including, say, only molecular data, can be compared to the supertree generated from the whole dataset.
Create matrix
One of the key functions of the STK is to create a matrix for supertree analysis from the input source trees. This function can generate a matrix in a number of formats and also output a single treefile containing all trees in the dataset.
Safe Taxonomic Reduction (STR)
A new function for this version is Safe Taxonomic Reduction (STR). This is the only new functionality in this version over the previous version (
KED was funded by BBSRC grant (BB/K006754/1) and a Systematics Association SynTax grant ("Building the arthropod supertree interactively: Malacostracan crustaceans as a test case"2010/11 funding round awarded to Matthew Wills and Mark Wilkinson).
GNU GPL v3
The STK is available from Launchpad (http://launchpad.net/supertree-toolkit). There are two main bzr branches: a stable release version and the development version (trunk). Contributors are expected to branch trunk, develop their new feature and request a merge back into trunk. We encourage all such contributions. STK is released in GPLv3 and is available as source code, via Launchpad's PPA system, as a Windows and MacOS X binary.
A full user manual, including a tutorial and data for the tutorial are available from the Launchpad website.
In future we aim to integrate web-based taxonomy databases to aid taxonomy and nomenclature standardisation. We are also developing a website to release all data that have been collected thus far. The STK will be integrated into that online resource. Finally, we intend to develop a simple tree editing and visualisation GUI such that no external software is required for the whole processing pipeline.
The authors wish to thank Steve Mitchell, Cyrille Delmer and Matthew Wills (all at the University of Bath) for help with testing and bug reporting. We would also like to thank Carl Boettiger, Karen Cranston, Graeme Lloyd and Matthew Yoder for comments that helped improve the manuscript.
JH wrote the Supertree Toolkit software and drafted the manuscript. KED drafted the manuscript and designed the software.
The STK User manual and tutorial dataset.