Corresponding author: Jon Hill (
Academic editor: Matthew Yoder
Building large supertrees involves the collection, storage, and processing of thousands of individual phylogenies to create large phylogenies with thousands to tens of thousands of taxa. Such large phylogenies are useful for macroevolutionary studies, comparative biology and in conservation and biodiversity. No easy to use and fully integrated software package currently exists to carry out this task. Here, we present a new Python-based software package that uses well defined XML schema to manage both data and metadata. It builds on previous versions by 1) including new processing steps, such as Safe Taxonomic Reduction, 2) using a user-friendly GUI that guides the user to complete at least the minimum information required and includes context-sensitive documentation, and 3) a revised storage format that integrates both tree- and meta-data into a single file. These data can then be manipulated according to a well-defined, but flexible, processing pipeline using either the GUI or a command-line based tool. Processing steps include standardising names, deleting or replacing taxa, ensuring adequate taxonomic overlap, ensuring data independence, and safe taxonomic reduction. This software has been successfully used to store and process data consisting of over 1000 trees ready for analyses using standard supertree methods. This software makes large supertree creation a much easier task and provides far greater flexibility for further work.
Supertrees are large phylogenies created by amalgamating anywhere from tens to thousands of smaller source phylogenies. A number of algorithms exist for this, the most widely used being Matrix Representation with Parsimony (
Some workers have written and made available scripts that carry out one or more of the required processing steps (e.g.
Here, we present the next version of the Supertree Toolkit that builds on the experience of the first version. We have rewritten all code and designed the software around a user interface that can carry out both data collection and processing. It contains a number of additional features over the original software which are 1) new processing steps, such as Safe Taxonomic Reduction, 2) user-friendly GUI that guides the user to complete at least the minimum information required and includes context-sensitive documentation, and 3) a revised storage format that integrates both tree- and meta-data into a single file. We will first detail the storage mechanism, based on RelaxNG XML, and the user interface features. We then cover the available processing pipeline steps and show some examples of their use.
Supertree Toolkit (STK)
The STK consists of three components: a Python module, a Graphical User Interface (GUI), and a Command Line Interface (CLI). The python module contains all processing, importing and exporting functions. These functions deal with the Phyml format (see below) and are available in any Python environment by importing the supertree_toolkit module. The GUI and CLI then import this Python module and hook it to the interface by processing user options. In this way the core functionality can be tested by using standard unit test infrastructure and the interfaces are cleanly separated. A test suite of over 375 tests is included in the source code which benchmark the expected performance of the software.
There are two user interfaces: a GUI for data entry and processing, and a CLI for data processing. The latter is useful for dealing with large datasets. The GUI is based on Diamond (
We have maintained all the previous functionality of the previous version of the STK which are detailed in
XML is an ideal way to store structured metadata. We build on the methods used by Spud (
Each dataset has a name and contains a number of "Sources" (Fig.
The result of this schema is a single XML datafile that contains all metadata and source data required. This file is termed a Phylml (Phylogenetic Meta Language), which can be parsed by any standard XML parser.
There are a number of processing functions included in the STK. These can be chained together to construct a
Data summary – produce text summary of data, such as number of taxa, trees and characters. Clean data – check data and remove redundant data, such as non-informative trees. Permute all trees – remove non-monophyly from trees. Substitute taxa – perform substitutions or deletions of taxa. Data independence check – check that all source trees are independent of each other. Data overlap – check the taxonomic overlap (Fig. Replace genera – replace generic-level taxa with polytomies of all species in that genus that are in the dataset. Create matrix – create a MRP matrix (Baum and Ragan coding) of the dataset. Create subset – create a subset of this dataset, e.g. only certain years or data types. STR – perform safe taxonomic reduction on the dataset. This is new functionality over the previous version of the STK (
This function creates a text summary of the data. The summary includes a taxa list, years of publication, characters used, and analyses used.
Before and during processing trees may become uninformative (i.e. contain no clades), for example after substitution of taxa, or when dealing with polyphyletic taxa. This function checks that the data are suitable for processing and removes uninformative trees (and sources if they contain no trees) and should be run regularly on data between processing steps.
When creating supertrees at species level digitised trees need to account for the fact that some species may be polyphyletic. There are no formal mechanisms for dealing with this so taxa can be encoded with a '%d' sign to designate them as polyphyletic (where d is a consecutive integer for each taxon). The 'permute trees' function generates all possible permutations of these trees to enable a consensus tree of some kind to be created.
One of the most onerous tasks of supertree creation is ensuring a consistent taxonomy is used throughout. This requires the removal of synonyms, mis-spellings and other naming errors. The 'sub taxa' function allows substitution and deletion of taxa whilst maintaining the tree structure. Substitutions are aware of polyphyletic taxa and will collapse superfluous nodes when deleting taxa. This function is used throughout the processing.
It has previously been noted that all data included in supertree analyses should be independent of each other. Here, we defined non-independent data as being datasets that contain a subset of the same taxa and use identical characters. This function flags source trees that are subsets of others (and can automatically remove them if required) and flags those that are identical (i.e. same taxa and characters).
This is one of the final steps of the pipeline. After all processing some taxa at genus level may be left in the source trees. This function replaces those genera with a polytomy of species already in the dataset. Note that this assumes a species level tree is to be created and this step can be omitted if this is not the case.
In order to create a supertree all source trees should exhibit sufficient taxonomic overlap (
One of the novelties of the STK is that it can be used to create subsets of the whole dataset, based on the metadata. For example, all trees that used molecular character can be extracted and used to create a new dataset. Similarly publication year, author, or analysis type can all be used to create subsets. These can be used to create independent supertrees and the effect of including, say, only molecular data, can be compared to the supertree generated from the whole dataset.
One of the key functions of the STK is to create a matrix for supertree analysis from the input source trees. This function can generate a matrix in a number of formats and also output a single treefile containing all trees in the dataset.
A new function for this version is Safe Taxonomic Reduction (STR). This is the only new functionality in this version over the previous version (
KED was funded by BBSRC grant (BB/K006754/1) and a Systematics Association SynTax grant ("Building the arthropod supertree interactively: Malacostracan crustaceans as a test case"2010/11 funding round awarded to Matthew Wills and Mark Wilkinson).
Homepage:
Download page:
Bug database:
Platform: Linux, Windows, MacOS X
Programming language: Python
Type: bzr
Browse URI:
Location: lp:supertree-toolkit
Other
GNU GPL v3
The STK is available from Launchpad (
A full user manual, including a tutorial and data for the tutorial are available from the
In future we aim to integrate web-based taxonomy databases to aid taxonomy and nomenclature standardisation. We are also developing a website to release all data that have been collected thus far. The STK will be integrated into that online resource. Finally, we intend to develop a simple tree editing and visualisation GUI such that no external software is required for the whole processing pipeline.
The authors wish to thank Steve Mitchell, Cyrille Delmer and Matthew Wills (all at the University of Bath) for help with testing and bug reporting. We would also like to thank Carl Boettiger, Karen Cranston, Graeme Lloyd and Matthew Yoder for comments that helped improve the manuscript.
JH wrote the Supertree Toolkit software and drafted the manuscript. KED drafted the manuscript and designed the software.
The STK GUI, based on Diamond (
Data structure of the STK metadata. Each project consists of several sources, which in turn contain bibliographic information and one or more source trees. The blue boxes show the hierarchy for a single source tree. The data structure has been simplified here and more meta data can be stored for each source tree.
Example of a processing pipeline that can be created with the STK. Data are collected and then are put through the processing pipeline in order to create a matrix. The resulting matrix (in either Nexus format (.nex) or Hennig format (.tnt)) can then be analysed in any suitable software such as PAUP* (
Result of the taxonomic overlap check which highlights which source trees are not sufficiently well connected to the rest of the dataset.
Manual and tutorial data
Mix: PDF, .nex and .phyml
The STK User manual and tutorial dataset.
File: oo_6193.zip