Corresponding author: Jonathan A. Rees (
Academic editor: Jeremy Miller
Taxonomy and nomenclature data are critical for any project that synthesizes biodiversity data, as most biodiversity data sets use taxonomic names to identify taxa. Open Tree of Life is one such project, synthesizing sets of published phylogenetic trees into comprehensive summary trees. No single published taxonomy met the taxonomic and nomenclatural needs of the project. Here we describe a system for reproducibly combining several source taxonomies into a synthetic taxonomy, and we discuss the challenges of taxonomic and nomenclatural synthesis for downstream biodiversity projects.
NSF AVAToL (Assembling Visualizing, and Analyzing the Tree of Life) DEB-1208809
Automated and community-driven synthesis of the tree of life
Duke University
The authors declare no conflicts of interest.
Any large biodiversity data project requires one or more taxonomies for discovery and data integration purposes, as in "find occurrence records for primates" or "find the taxon record associated with this sequence" (
The Open Tree of Life project consists of a set of tools for:
synthesizing summary phylogenetic trees ('synthetic trees') from a corpus of phylogenetic tree inputs (input trees) matching groupings in synthetic trees with higher taxa (such as supplementing synthetic trees with taxa obtained only from taxonomy.
The outcome is one or more synthetic trees combining phylogenetic and taxonomic knowledge. Fig.
Although Open Tree is primarily a phylogenetic tree aggregation effort, it requires a reference taxonomy that can support each of these functions.
For synthetic tree synthesis (1), we use the taxonomy for converting OTUs (operational taxonomic units, or 'tips') on input trees to a canonical form. Synthetic tree construction requires that an input tree OTU be matched with an OTU from another input tree
For higher taxon associations (2), we compare the groupings in the synthetic tree to those in the taxonomy.
For supplementation (3), only a relatively small number of described taxa are represented in input trees (currently about 200,000 in the phylogenetic corpus out of two million or more known taxa), so the taxonomy provides those that are not. The large complement of taxonomy-only taxa can be 'grafted' onto a synthetic tree in phylogenetically plausible locations based on how they relate taxonomically to taxa that are known from input trees.
This overall program dictates what we should be looking for in a reference taxonomy. In addition to the technical requirements derived from the above, we have two additional requirements coming from a desire to situate Open Tree as ongoing infrastructure for the evolutionary biology community, rather than as a one-off study. Following are all five requirements:
An additional goal is that the process should be reproducible and transparent. Given the source taxonomies, we should be able to regenerate the taxonomy, and taxon records should provide information about the taxonomic sources from which it is derived.
No single available taxonomic source meets all requirements. The NCBI taxonomy has good coverage of OTUs, provides a rich source of phyogenetically informed higher taxa, and is open, but its taxonomic coverage is limited to taxa that have sequence data in GenBank (only about 360,000 NCBI species having standard binomial names at the time of this writing). Traditional all-life taxonomies such as Catalogue of Life, IRMNG (
These requirements cannot be met in an absolute sense; each is a 'best effort' requirement subject to availability of project resources.
Note that the Open Tree Taxonomy is
There are probably about a dozen public all-life taxonomy compilations. The methods of assembly and curation are documented for only a few of these. We assume that most are evolving databases that are extended and maintained by a combination of single-record operations and some amount of
Had an available all-life taxonomy met all of Open Tree's requirements, the project would have used it. Unfortunately, for each one, there is at least one Open Tree requirement that goes beyond what the taxonomy provides.
An important part of any synthesis method is name-string parsing and matching, and this is the focus of the Global Names Architecture (
The Taxonomy Tree Tool (TTT,
The conventional approach to meeting the requirements stated in the introduction would be to create a database, copy the first taxonomy into it, then somehow merge the second taxonomy into that, repeating for further sources if necessary. However, it is not clear how to meet the ongoing update requirement under this approach. As the source taxonomies change, we would like for the combined taxonomy to contain only information derived from the latest versions of the sources, without residual information from previous versions. Many changes to the sources are corrections, and we do not want to retain information that has been corrected or superseded by a later version of a source.
Rather than maintain a database of taxonomic information, we instead developed a process for assembling a taxonomy from two or more taxonomic sources. With a repeatable process, we can generate a new combined taxonomy version from new source taxonomy versions
In the following, any definite claims or measurements refer to the Open Tree reference taxonomy version 3.0.
source taxonomy: imported taxonomic source (NCBI taxonomy, etc.) workspace: data structure for creation of the reference taxonomy name-string: name of one or more taxa, without author information, considered as a sequence of characters, without association with any particular description, or nomenclatural code node: a data record intended to correspond to a taxon. Records name-strings, authorship, parent node, optional rank, optional annotations. If a workspace node, it originates from a single source taxonomy, and records its source (provenance) and its alignments to others parent (node): the nearest enclosing node within a given node's taxonomy tip: a node that is not the parent of any node primary name-string: one particular name-string of a node. Each node has exactly one primary name-string homonym name-string: a name-string that belongs to multiple nodes within the same taxonomy. This is analogous to the nontechnical meaning of 'homonym' and is not to be confused with 'homonym' in the nomenclatural sense, which only applies within a single nomenclatural code. Nomenclatural homonyms and hemihomonyms ( synonym name-string (of a node): a non-primary name-string image (of a node n'): the workspace node corresponding to n'
This section is an overview of the taxonomy assembly method. Several generalities stated here are simplifications; the actual method (described later) is significantly more involved.
We start with a sequence of source taxonomies S1, S2, ..., Sn, ordered by priority. Priority is the means by which conflicts between sources are resolved, and therefore has a profound effect on the outcome of assembly. If a curator judges S to be more accurate or otherwise "better" than S', then S will occur earlier in the priority sequence than S' and its information supersedes that from later sources. Curators (either project personnel or participants in Open Tree workshops and online forums) determine priority based on their taxonomic expertise. Source taxonomies are sometimes split into pieces in order to establish different priorities for different parts. Priority choice by curator is a fragile and subjective aspect of the method, but we could not identify any other information available at scale that could be brought to bear on conflict resolution.
We define an operator for combining taxonomies pairwise, written schematically as U = S + S', and apply it from left to right:
U0 = empty, U1 = U0 + S1, U2 = U1 + S2, U3 = U2 + S3...
The combination S + S' is formed in two steps:
A mapping or A
As a simple example, consider a genus represented in both taxonomies, but containing different species in the two:
S = (b,c,d)a, S' = (c,d,e)a
S and S' each have four nodes. Suppose c, d, and a in S' are aligned to c, d, and a in S. The only unaligned node is e, which is a sibling of c and d and therefore grafted as a child of a. After the merge step, we have:
S + S' = (b,c,d,e)a
One might call this merge heuristic 'my sibling's sibling is my sibling' or 'transitivity of siblinghood'.
This is a very common pattern. Fig.
The other merge method is an
The vast majority of alignment and merge situations are simple, similar to the examples shown in Fig.
We build the taxonomy from ten sources. Some of these sources are from taxonomy projects, while others were manually assembled based on recent publications. As described above, OTT assembly is dependent on the input order of the sources - higher ranked inputs take priority over lower ranked inputs. Table
We excluded SILVA's plant, animal, and fungal branches from OTT because these groups are well covered by other sources and poorly represented in SILVA. For example, SILVA has only 299 taxa in
As a secondary heuristic, records from GBIF that originate from PaleoDB, and do not come from any other taxonomic source, are annotated extinct. This is not completely reliable, as some PaleoDB taxa are extant.
animals, plants, GBIF backbone records that originate from IRMNG (IRMNG is imported separately) GBIF backbone records that originate from IPNI GBIF backbone records whose taxonomic status is 'doubtful' GBIF backbone records for infraspecific taxa (subspecies, variety, form) IRMNG records whose nomenclatural status is 'nudum', 'invalid', or any of about 25 similar designations NCBI Taxonomy records that have no potential for unification with OTUs in phylogenetic studies: those with name-strings containing 'insertion sequences', 'artificial librarries', 'transposons', or any of about 15 similar designations
The IPNI and IRMNG derived GBIF records are suppressed because they include many invalid names. We pick up most of the valid names from other sources, such as the direct IRMNG import, so this is not a great loss. Although GBIF's original taxonomic sources indicate which names are known to be invalid, this information is not provided by the GBIF backbone. Note that the GBIF backbone might import the same name from more than one source, but its provenance information only lists one of the sources. We suppress the record if that one source is IPNI or IRMNG.
Each source taxonomy has its own import procedure, usually a file download from the provider's web site followed by application of a script that converts the source to a common internal form for import (a set of nodes, see terminology section). Given the converted source files, the taxonomy can be read by the OTT assembly procedure.
After each source taxonomy is loaded, the following normalizations are performed:
Diacritics removal - accents, umlauts, and other diacritic marks are removed in order to improve name matching, as well as to follow the nomenclatural codes, which prohibit them. The original name-string is kept. Child taxa of "containers" in the source taxonomy are made to be children of the container's parent. "Containers" are groupings in the source that don't represent taxa, for example nodes named "incertae sedis" or "environmental samples". The members of a container aren't more closely related to one another than they are to the container's siblings; the container is only present as a way to say something about the members. The fact that a node had originally been in a container is recorded as a flag on the child node. When a subgenus X has the same name-string as its containing genus, its name-string is changed to "X subgenus X". This follows a convention used by NCBI Taxonomy and helps distinguish the two taxa later in assembly. Sibling taxa with the same name-string are combined.
The normalized versions of the taxonomies then become the input to subsequent processing phases.
This section and the next give details of the taxonomy combination method introduced above.
OTT is assembled in a temporary work area or
As described above, source taxonomies are processed (aligned and merged) in priority order. For each source taxonomy,
A set of
capitalization and spelling repairs (e.g. change 'sordariomyceta' to 'Sordariomyceta') addition of synonym name-strings to facilitate later matching (e.g. ' selecting a different name-string as primary (e.g. 'Choanomonada' to ' deletions (e.g. removing synonym ' merges to repair redundancies in the source (e.g. rename nodes to avoid confusing homonym name-strings (e.g. there are two Cyanobacterias in SILVA, one a parent of the other; the parent is renamed to its NCBI name ' alignments when names differs (Diatomea is alignments to override automated alignment rules (
In the process of assembling the reference taxonomy, about 300
Given a source node, the alignment procedure begins by finding the nodes in the workspace that it could
(This is the ordinary language use of the word 'candidate', unrelated to 'candidatus' as used in prokaryote taxonomy.)
It follows that if the workspace has multiple nodes with the same name-string (homonym name-strings), all of these nodes will become candidates for every source node that also has that name-string.
The purpose of the alignment phase is to choose a single correct candidate for each source node, or to reject all candidates if none is correct. For over 97% of source nodes, there are no candidates or only one candidate, and selection is fairly simple, but the remaining nodes require special treatment.
Similarly,
Once we have a list of candidates, we apply a set of heuristics in an attempt to find a single candidate, and thereby align a source node n' with a workspace node n. The heuristics are as follows, presented in the order that we apply them in the alignment process:
Each heuristic, when presented with a source node and a candidate (workspace node), answers 'yes', 'no', or 'no information'. 'Yes' means that according to the rule, the two nodes refer to the same taxon, 'no' means they refer to different taxa, and 'no information' means that this rule provides no information as to whether the nodes refer to the same taxon.
The answers are assigned numeric scores of 1 for yes, 0 for no information, and -1 for no. A candidate that a heuristic gives a no is eliminated, one that is unique in getting a yes is selected, and if there are no yeses or no unique yes, more heuristics are consulted.
The heuristics are applied in the order in which they are listed above. The outcome is sensitive to the ordering. The ordering is forced to some extent by internal logic, but overall the ordering was determined by trial and error.
If there is a single candidate that is not rejected by any heuristic, it is aligned to that candidate.
More specifically, the method for applying the heuristics is as follows:
Start with a source node N and its set C of workspace node candidates C1 ... Cn. For each heuristic H as listed above: For each candidate Ci currently in C, use H to obtain the score H(N, Ci) Let Z = the highest score from among the scores H(N, Ci) If Z < 0, we are done - no candidate is suitable Let C' = those members of C that have score Z If Z > 0 and C' contains only one candidate, we are done (match is that candidate) Otherwise, replace C with C' and proceed to the next heuristic If C is singleton after all heuristics are exhausted, its member is taken to be the correct match. Otherwise, the source node does not match unambiguously, and alignment fails.
If the alignment process ends with multiple candidates, there is an unresolvable ambiguity. If the ambiguous source node has no children, it is dropped, which is OK because it probably corresponds to one of the existing candidates and therefore would make no new contribution. If the ambiguous source node has children, it is treated as unaligned and therefore new, increasing the number of workspace nodes having that name-string. This could easily be wrong because it is so unlikely that the source node really represents a distinct taxon. Usually, the subsequent merge phase determines that the grouping is not needed because it inconsistent or can be 'absorbed', and it is dropped. If it is not dropped, then there is a troublesome situation that calls for manual review.
As an example of an unaligned tip, consider GBIF
The scientific purpose of taxonomies is to understand taxa as historical and predictive biological abstractions. We only understand the identity of a taxon, that is, its membership or 'circumscription', through a description. For each taxon record in a taxonomy, there is a described taxon, but finding that description can be very challenging and is not at this time possible at scale (although see
The large scale taxonomic sources we use provide limited help in sorting out the connection from taxon record to description. This represents a vast lost opportunity. Some halfway measures are possible. The example above was found by looking at synonyms of same-named taxa in different sources: if a name X is a synonym of name Y in one source, but X and Y have separate taxon records in another, then we can infer that the sense of name Y differs between the sources.
Resolving name usages to descriptions is not a problem we're in a position to solve. And even if it were solved for the taxonomy, the phylogenetic tree files that are the input to Open Tree rarely provide the information that would help a curator choose the right taxon (record). We have made a deliberate pragmatic choice to ignore this problem, and to assume congruence of taxa in situations when name and hierarchical context agree, admitting that this practice can lead to errors in both taxonomic assembly and in phylogenetic synthesis.
After the alignment phase, we are left with the set of source nodes that could not be aligned to the workspace. The next step is to determine if and how these (potentially new) nodes can be merged into the workspace.
The combined taxonomy (U, above) is constructed by adding copies of unaligned nodes from the source taxonomy S' one at a time to the workspace, which initially contains a copy of S. Nodes of S' therefore correspond to workspace nodes in either of two ways: by mapping to a copy of an S-node (via the S'-S alignment), or by mapping to a copy of an S'-node (when there is no S'-S alignment for the S'-node).
As described above, each copied S'-node is part of either a graft or an insertion. A graft or insertion rooted at r' is attached to the workspace as a child of the nearest common ancestor node of r''s siblings' images. A graft is flagged
The following schematic examples illustrate each of the cases that come up while merging taxonomies. Taxonomy fragments are written in Newick notation (
Case 1: ((a,b)x,(c,d)y)z + ((c,d)y,(e,f)w)z = ((a,b)x,(c,d)y,(e,f)w)z
This is a simple graft. The taxon w does not occur in the workspace, so it and its children are copied. The workspace copy of w is attached as a sibling of its siblings' images: its sibling is y in S', which is aligned to y in the workspace, so the copy becomes a child of y's parent, or z.
Case 2: ((a,b)x,(c,d)y)z + (a,b,c,d)z = ((a,b)x,(c,d)y)z
No nodes are copied from S' to the workspace because every node in S' is aligned to some node in S - there are no nodes that
Case 3:(a,b,c,d)z + ((a,b)x,(c,d)y)z = ((a,b)x,(c,d)y)z
Supposing x and y are unaligned, then x and y from S' insert into the classification of z. The workspace gets copies of these two S'-nodes.
Case 4: ((a,b)x,(c,d)y)z + (a,b,c,d,e)z = ((a,b)x,(c,d)y,?e)z
In this situation, we don't know where to put the unaligned taxon e from S': in x, in y, or in z (sibling to x and y). The solution used here is to add e to z and mark it as
For example, family
Case 5: (a,b,c,d,e)z + ((a,b)x,(c,d)y)z = (a,b,c,d,e)z
We don't want to lose the fact from the higher priority taxonomy S that e is a proper child of z (i.e. not
So that we have a term for this situation, say that x is
Case 6: ((a,b)x,(c,d)y)z + ((a,c)p,(b,d,e)q)z = ((a,b)x,(c,d)y,?e)z
If the source has a hierarchy that is incompatible with the one in the workspace, the conflicting source nodes are ignored, and any unaligned nodes (e) become
For example, when WoRMS is merged, the workspace has, from NCBI,
((
and the classification given by WoRMS is
((
That is, NCBI groups
The test for compatibility is very simple: a source node is incompatible with the workspace if the nodes that its aligned children align with do not all have the same parent.
A new workspace node gets its data from the unaligned source node, including its name-strings and primary name-string designation. The merge phase records, in each workspace node, a reference to the unaligned source taxonomy node from which it was copied, allowing creation of an actionable hyperlink. This provenance tracking has proven to be quite valuable.
For aligned nodes, source name-strings that were not already present are added to the set of workspace node name-strings (also tracking the relevant source node for each). References to aligned source nodes are recorded in workspace nodes at this point, so that at the end of assembly, each workspace node records the initial source node that contributed it as well as all the source nodes that aligned to it.
After all source taxonomies are aligned and merged, we apply general
The final step is to assign unique, stable identifiers to nodes so that external links to OTT nodes will continue to function correctly after the previous OTT version is replaced by the new one.
Identifier assignment is done by aligning the previous version of OTT to the new version. As with the other alignments, there are scripted
The alignment is computed only for the purpose of assigning identifiers; the previous OTT version is not merged into the workspace. An identifier can only persist from one OTT version to the next if it continues to occur in some source taxonomy.
The assembly method described above yields the reference taxonomy that is used by the Open Tree of Life project. The taxonomy itself, the details of how the assembly method unrolls to generate the taxonomy, and the degree to which the taxonomy meets the goals set out for it are all of interest in assessing how, and how well, the method works. We will address each of these three aspects of the method in turn.
The methods and results presented here are for version 3.0 of the Open Tree Taxonomy (which follows five previous releases using the automated assembly method). The taxonomy contains 3,594,550 total taxa; 3,272,177 tips; and 277,365 internal nodes. 2,335,412 of the nodes have a Linnean binomial of the form
The number of species level homonym name-strings (2867) is surprisingly high. While a small number of these are legitimate, e.g.
As OTT is assembled, the alignment procedure processes every source node, either choosing an alignment target for it in the workspace based on the results of the heuristics, or leaving it unaligned. Fig.
We counted the frequency of success for each heuristic, i.e. the number of times that a particular heuristic was the one that accepted the winning candidate from among two or more candidates. Table
After assembly, the next step in the method is to merge the unaligned nodes into the workspace taxonomy. Of the 3,780,949 unaligned nodes, the vast majority (99%) are grafted into the workspace. The remaining nodes (<1%) are either insertions, absorptions or remain unmerged due to ambiguities.
We also examined the fate of nodes from each of the input taxonomies, and Table
The introduction sets out requirements for an Open Tree taxonomy. How well are these requirements met?
We set out to cover the OTUs in the Open Tree corpus of phylogenetic trees. The corpus contains published studies (each study with one or more phylogenetic trees) that are manually uploaded and annotated by Open Tree curators. The user interface contains tools that help curators map the OTUs in a study to taxa in OTT. Of the 3,242 studies in the Open Tree database, 2,871 have at least 50% of OTUs mapped to OTT. (A lower overall mapping rate usually indicates incomplete curation, not an inability to map to OTT.) These 2,871 studies contain 538,728 OTUs, and curators have mapped 514,346 to OTT taxa, or 95.5%.
To assess the reason for the remaining 4.5% of OTUs being unmapped, we investigated a random sample of ten OTUs. In three cases, the label was a genus name in OTT followed by "sp" (e.g. "
Of the 194,100 OTT records that are the targets of OTUs, 188,581 (97.2%) are represented in NCBI Taxonomy. If the Open Tree project had simply adopted NCBI Taxonomy instead of OTT, it would have met its OTU coverage requirement (but not the taxonomic coverage requirement). By comparison, GBIF covers 87.6%, and IRMNG covers 62.8%. The high coverage by NCBI reflects a preference among Open Tree curators for studies that use molecular phylogenetic evidence over those that don't.
Assessing whether OTT is more 'phylogenetically informed' than it otherwise might be is difficult. The phylogenetic quality of the taxonomy is determined by the taxonomic sources and their priority order. We have relied on the project's curators, who have a strong phylogenetic interest, to provide guidance on both. Following are examples of curator decision-making:
For microbes, SILVA is considered more phylogenetically sound than NCBI taxonomy, because the SILVA taxonomy is based on a recent comprehensive phylogenetic analysis. Priority of NCBI Taxonomy over the GBIF backbone is suggested by NCBI's apparent interest in phylogeny, reflected in NCBI Taxonomy's much higher resolution, its inclusion of phylogenetically important non-Linnaean groups such as Eukaryota, and by its avoidance of known paraphyletic groupings such as The Hibbett 2007 upper fungal taxonomy reflects, by construction, results from the most recent phylogenetic studies of
Ideally we would have a measure of 'phylogenetically informed' that we could use to compare OTT to other taxonomies, to test alternative constructions of OTT, and to check the forward progress of OTT. It is not clear what one would use as a standard against which to judge. The Open Tree project's synthetic tree of life is a candidate, but is not without issues (such as its own possible errors, and the fact that OTT is itself use in construction the synthetic tree). Ensuring that comparisons are meaningful, and comparable with one another, would be a technical challenge.
OTT has 2.3M binomials (presumptive valid species names), vs. 1.6M for Catalogue of Life (CoL) or 2.2M for GBIF. Since the GBIF source we used includes the Catalogue of Life, OTT includes all species in CoL. It is not clear whether the differences between OTT and GBIF, and between GBIF and CoL, are due to mostly to inclusion of additional accepted names, versus names that are not currently accepted (synonym, invalid, and so on); this is not a question we investigated.
This level of coverage would seem to meet Open Tree's taxonomic coverage requirement as well as any other available taxonomic source.
We aimed for a procedure that would allow simple re-building from sources, and also easy incorporation of new versions of sources. Re-building OTT version 3.0 from sources requires 17 minutes of real time. Our process currently runs on a machine with 16GB of memory; 8GB is not sufficient.
In the upgrade from 2.10 to 3.0, we added new versions of both NCBI and GBIF. NCBI updates frequently, so changes tend to be manageable and incorporating the new version was simple. In contrast, the version from GBIF represented both a major change in their taxonomy synthesis method. Many taxa disappeared, requiring changes to our ad hoc patches during the normalization stage. In addition, the new version of GBIF used a different taxonomy file format, which requires extensive changes to our import code (most notably, handling taxon name-strings that now included authority information).
We estimate the update from OTT 2.10 to OTT 3.0 required approximately three days of development time related to source taxonomy changes. This was greater than previous updates due to the changes required to handle the major changes in GBIF content and format.
All sources are available on the Internet without access controls. We considered legal restrictions that might apply to OTT’s particular use of the source materials, and are satisfied that they do not. In particular, copyright does not apply because the taxonomy is not creative expression in the sense of copyright law (
In any kind of compilation, sources need to be attributed regardless of legal considerations. For OTT sources this is accomplished in two ways. First, every record (node) has a hyperlink to the source taxonomy record from which it was originally copied, as well as one to each of the source records that aligned to it. Second, the release notes list the sources with version information and references to relevant publications and web sites.
The primary actionable information in the source taxonomies consists of name-strings, and therefore the core of our method is a set of heuristics that can handle the common problems encountered when trying to merge hierarchies of name-strings. These problems include expected taxonomic issues such as synonyms, homonyms, and differences in placement and membership between sources. They also include errors such as duplications, spelling mistakes, and misplaced taxa. The problem cases add up to over 100,000 difficult alignments when the total number of source records measures over 6 million.
Ultimately there is no fully automated and foolproof test to determine whether two nodes can be aligned - whether node A and node B, from different source taxonomies, are about the same taxon. The information to do this is in the literature and in databases on the Internet, but often it is (understandably) missing from the source taxonomies.
It is not feasible to investigate such problems individually, so the taxonomy assembly methods identify and handle thousands of 'special cases' in an automated way. We currently use only name-strings, rudimentary classification information, and (minimally) ranks to guide assembly. We note the large role that our hand-curated "separation taxonomy" played in the alignment phase. This is a set of taxa that are consistent across the various sources, and allow us to make the (seemingly obvious) determination "these two taxa are in completely separate groups, so do not align them".
We have also developed a system for curators to directly add new taxon records to the taxonomy from published phylogenies, which often contain newly described species that are not yet present in any source taxonomy. These taxon records include provenance information, including references explaining the taxon, and the identity of the curator. We expose this provenance information through the web site and the taxonomy API.
We also provide a feedback mechanism on the synthetic tree browser, and find that most of the comments left are about presence, absence, choice, and spelling of labels, rather than the topology of the synthetic tree. These are issues that are addressed by improvements to the taxonomy. Expanding this feature to capture this feedback in a more structured, and therefore machine-readable, format would allow users to directly contribute taxonomic patches to the system.
Given the unique goals of the Open Tree Taxonomy in comparison to most other taxonomy projects, it is difficult to compare OTT to other taxonomies in a meaningful way. The Open Tree Taxonomy is technically most similar to the GBIF taxonomy, in the sense that each is a synthesis of existing, overlapping taxonomies rather than a curated taxonomic database or one based on grafting. The GBIF method is yet unpublished (for basic information on the GBIF backbone see
The development of the assembly software has been driven by the needs of the Open Tree project, not by any concerted effort to create a widely applicable or theoretically principled tool. A system like this is never finished, and this one is in its infancy. There are endless opportunities for bringing additional techniques, methods, data, and code libraries to bear, and we have faced difficult choices in deciding where to put our effort. Following are some of the directions for development that could have the highest impact.
It is likely that alignment and merge could be improved by making better use of species proximity implied by the shape of the classification, and decreasing its reliance on the names of internal nodes. Better use of proximity might permit separation and identification of tips without use of a separation taxonomy, removing the need for the manual work of maintaining the separation taxonomy and the adjustment directives needed to align source taxonomies to it. An example is The alignment method should be extended to make use of authority information, when it is available. If name-strings match, or even if just species epithets match, then matching authority information is good evidence that the same taxon is meant. The form of authority information varies between sources, but could be normalized using the Global Names Parser ( Name-strings could also be analyzed to detect partial matches, e.g. matching on species epithets even when the genus disagrees, and spelling and gender variant recognition. Doing so would eliminate thousands of duplications. Other work on name matching, such as the Global Names Resolver ( The redescription problem described above should be addressed to the extent possible. Our handling of duplicate records in source taxonomies is incomplete and needs to be fixed. If two source nodes can be aligned to the same workspace node, then the duplication will not affect the workspace. But if there is no workspace node to align them to, the duplication persists in the workspace after the source taxonomy has been merged. There is special case logic when taxonomies are imported to fold together An assembly run can lead to a variety of error conditions and test failures. Currently these are difficult to diagnose, mainly for lack of technology for displaying the particular pieces of the sources, workspace, and assembly history that are relevant to the error. Once this information is surfaced it is usually not too difficult to work out a fix in the form of a patch or an improvement to the program logic. A small amount of automation could speed this kind of investigation and save curator time. The community curation should be developed, as mentioned above. Its success would depend on allowing users to test proposed changes and diagnose and repair any problems with them. Curators frequently request new taxonomy sources. The most frequently requested are improved fish, bird, plant, and paleontological sources. Community members have also suggested the Plazi TreatmentBank ( The presence of invalid and unaccepted names remains a significant problem. The information needed to detect them is available, and could be harvested. Basic usability features for application to new projects would include proper packaging of the application, and support for Darwin Core (
Future work on taxonomy aggregation should attempt a more rigorous and pluralistic approach to classification (
All source code is open source (licensed BSD 2-clause) and available on GitHub at
We have presented a method for merging multiple taxonomies into a single synthetic taxonomy. The method is designed to produce a taxonomy optimized for the Open Tree of Life phylogenetic tree synthesis project. Most taxonomy projects are databases of taxonomy information that are continuously updated by curators as new information is published in the taxonomic literature. In contrast, the Open Tree Taxonomy takes several of these curated taxonomies and assembles a synthetic taxonomy
We have also developed a system for curators to directly add new taxa to the taxonomy from published phylogenies. These taxon additions include provenance information, including the source of the taxon and identity of the curator. We expose this provenance information through the website and the taxonomy API. Most of the Open Tree feedback has been about taxonomy, and expanding this feature to other types of taxonomic information allows users to directly contribute expertise and allows projects to easily share that information.
Taxonomic information is certainly best curated at a scale smaller than "all life" by experts in a particular group. Therefore, producing comprehensive taxonomies is always a synthesis of curated taxonomies. We advocate for the type of methods being used by Open Tree and by GBIF, where synthesis of overlapping sources is done in a repeatable fashion from sources, allowing changed information in sources to be quickly included in the comprehensive taxonomy, and also allowing continuous improvement to the synthesis method. Provenance information is retained and presented as part of the synthetic taxonomy. This type of synthesis requires that source taxonomies be available online, either through APIs or by bulk download, in a format that can be easily parsed, and ideally without terms of use that prevent distribution and reuse of the resulting synthetic taxonomies.
We would like to thank Nico Franz, Mark Holder, and Alan Ruttenberg for their comments on early versions of this manuscript; Markus Döring, Tony Rees, and Paul Kirk for answering our many questions about source taxonomies; Cody Hinchliff for writing an early version of the assembly code and establishing the general approach to taxonomy combination; and Yan Wong and other users of the Open Tree system provided many helpful comments on the taxonomy. The reviewers' comments were very thoughtful and thorough and led to many important improvements to the presentation.
JAR designed and implemented the taxonomy assembly system; KAC and JAR wrote the paper.
The authors declare no conflicts of interest.
Role of taxonomy in assembly of the Open Tree of Life synthetic phylogenetic tree. Dotted lines in the synthetic tree are those that come only from taxonomy, while solid lines have phylogenetic support. The taxonomy both links disjoint phylogenies and adds taxa not present in input trees.
Merging taxonomies:
Fate of nodes as they move through the alignment procedure. Green, rounded boxes are endpoints that result in aligned nodes, while red, sqaure boxes are endpoints that result in unaligned nodes.
List of taxonomic sources: The ten sources used in v3.0 of Open Tree Taxonomy. Six are online taxonomic resources (SILVA (
Name | OTT Focus | Taxa | Synonyms | Priority order | Reasons |
separation taxa | life | 31 | 8 | 1 | |
SILVA | life | 78687 | 0 | 2 | P |
Hibbett 2007 |
|
227 | 0 | 3 | P |
Index Fungorum |
|
284973 | 157734 | 4 | P,T |
Schäferhoff 2010 |
|
119 | 0 | 5 | P |
WoRMS | 330412 | 223196 | 6 | P,T | |
NCBI | life | 1488029 | 719526 | 7 | O,P,T |
GBIF | life | 3273321 | 1143026 | 8 | T |
IRMNG | life | 1706655 | 685983 | 9 | T |
Open Tree curation | n/a | n/a | n/a | 10 | O |
Summary of Open Tree Taxonomy 3.
|
|
3594550 | Total taxon records (nodes) |
1842403 | Synonym records |
277365 | Internal (non-tip) nodes |
3272177 | Tips. |
3116485 | Rank of 'species' |
70886 | Below the rank of species (e.g. subspecies, variety) |
67070 | Above the rank of species that subtend no node of rank species |
2335412 | Name-string has the form of a Linnaean binomial Genus epithet |
9089 | Homonym name-strings |
2867 | Homonym name-strings where the nodes have species rank |
6110 | Homonym name-strings where the nodes have genus rank |
38 | Maximum nesting depth of any node in the taxonomy |
53287 | Maximum number of children for any node in the taxonomy |
12.96 | Branching factor (average number of children per internal node) |
45008 | Source taxa that were absorbed into a larger taxon |
317624 | Marked |
252600 | Annotated as being for an extinct taxon |
Frequency of success of alignment heuristics. In cases where there were multiple candidate nodes, this table lists the number of times that a particular heuristic was the one to select a single candidate. Heuristics are listed in the order in which they are applied. Success of an ealier heuristics means that a later heuristic is not used for a given node.
Alignment heuristic | Number nodes |
---|---|
Separation | 22126 |
Disparate ranks | 154 |
Lineage | 25688 |
Overlap | 7438 |
Proximity | 228 |
Same name-string | 84211 |
Fate of source nodes from each of the input taxonomies. Unaligned nodes are either copied into the workspace or absorbed. Aligned nodes are added to the workspace through grafting or insertion.
|
|
|
|
|
|
separation | 30 | 30 | 0 | 0 | 0 |
SILVA | 74400 | 74395 | 5 | 0 | 0 |
Hibbett 2007 | 227 | 226 | 1 | 0 | 0 |
Index Fungorum | 276262 | 276048 | 188 | 25 | 1 |
WoRMS | 327570 | 269029 | 57026 | 1283 | 232 |
Schäferhoff 2010 | 119 | 118 | 1 | 0 | 0 |
NCBI | 1320665 | 1198221 | 119532 | 2441 | 471 |
GBIF | 2451566 | 1640700 | 808757 | 1963 | 146 |
IRMG | 1561123 | 90746 | 1466929 | 3128 | 320 |
curated | 29 | 29 | 0 | 0 | 0 |
total | 6011991 | 3549542 | 2452439 | 8840 | 1170 |