Taxonaut: an application software for comparative display of multiple taxonomies with a use case of GBIF Species API

Abstract Background The Species API of the Global Biodiversity Information Facility (GBIF) provides public access to taxonomic data aggregated from multiple data sources. Each data source follows its own classification which can be inconsistent with classifications from other sources. Even with a reference classification e.g. the GBIF Backbone taxonomy, a comprehensive method to compare classifications in the data aggregation is essential, especially for non-expert users. New information A Java application was developed to compare multiple taxonomies graphically using classification data acquired from GBIF’s ChecklistBank via the GBIF Species API. It uses a table to display taxonomies where each column represents a taxonomy under comparison, with an aligner column to organise taxa by name. Each cell contains the name of a taxon if the classification in that column contains the name. Each column also has a cell showing the hierarchy of the taxonomy by a folder metaphor where taxa are aligned and synchronised in the aligner column. A set of those comparative tables shows taxa categorised by relationship between taxonomies. The result set is also available as tables in an Excel format file.


Introduction
Taxonomy is the discipline that manages classification of organisms, where each classification, or taxon, is examined as scientific hypothesis. A taxon is a concept that may vary with time depending on the developing understanding of nature or geological area of interest, as a subset of the natural world. Taxa, or more precisely names of taxa, are also used as building blocks to describe nature, especially in biology and environmental sciences. Information on organisms is organised and accessed using their names. The utility of an unnamed specimen is severely limited because it lacks a mechanism to access and manage any associated data, as well as the ability to locate that particular specimen at will within a large collection.
The Global Biodiversity Information Facility (GBIF) has been gathering data on organism specimens for over a decade using a consistent data format, Darwin Core (Wieczorek et al. 2012). A Darwin Core record provided in the first released schema (Darwin Core Task Group, Biodiversity Information Standards (TDWG) 2009) contains a scientific name of the specimen, depending on its level of identification which need not be to species level but can be a higher taxon name, with, optionally, a list of higher taxon names representing the classification that the identification followed. Taxonomic hierarchies behind those data are not necessarily coherent or even consistent, because data providers may choose a classification depending on the purpose of their collection. Therefore, it is necessary to coordinate those taxonomies to use specimen data stored in GBIF in a concordant manner. It is not limited to specimen data but can apply to any data indexed by biological name. GBIF's Backbone Taxonomy (GBIF Secretariat 2013) is an effort to provide a synthetic index covering all names in GBIF data. It requires all classifications be cast into a single view, as Catalogue of Life (Roskov et al. 2016) seeks to achieve through expert panels, but the resulting taxonomy is unlikely to satisfy every user. It is challenging to provide a coherent, single taxonomy because of the limited availability of experts prepared to contribute sufficient time to its maintenance, although using an open platform such as GitHub to manage the GBIF Backbone Taxonomy (Page 2013) may ease the limitation, in comparison with the Catalogue of Life approach. Even if such a unified, single taxonomy were available, it would still need to provide the capacity for a user to choose a preferred taxonomy or taxonomies from alternatives: no single taxonomy will satisfy everyone. It implies that we need to store multiple taxonomies and a method to choose, or at least compare, them by uniting those stored taxonomies without 'unitary-ism' (Berendsohn and Geoffroy 2007).
Data structures capable of storing multiple taxonomic views have been proposed (Berendsohn 1995, Pullan et al. 2000, Ytow et al. 2001, Pyle 2004) and applied to taxa of various scopes up to a nomenclatural code scale (Pyle and Michel 2008). Name services such as uBio, ZooBank, Global Names Usage Bank, Plazi and GBIF Checklist Bank are available as data sources of taxonomic views with APIs.
Software systems have been developed to compare two or more taxonomies visually from either the taxonomy or the data-visualisation perspective (or occasionally both) (Graham and Kennedy 2010). Taxonomy-oriented systems show rich information that expert users require, but in general needs significant extra interpretation that may be beyond the skill of non-expert end-users. Displaying rich information can also result in a reduction of the manageable number of taxonomic views, especially if relationships between the views are complicated, where such software systems are most useful. Taxonaut, development of TaxoNote (Morse et al. 2003) to facilitate navigation over taxonomies, is a Java application software which is designed to provide summarised, simultaneous comparison of two or more hierarchies, focusing on the needs of non-expert users. It sticks to conventional display components, specifically a table of folder-style tree structures, rather than, for instance a set of directed acyclic graphs, to reduce novice users' difficulty in interpretation. It uses two representation methods in combination which are categorised as agglomeration and matrix respectively (Graham and Kennedy 2010). Application of the software is, however, limited by the number of available taxonomic views represented as data sets. A stable version of GBIF Species API was released recently with comprehensive Java libraries, which provides a reliable test case of multiple taxonomies comparison. Note that during preparation of this manuscript the GBIF backbone was updated (Döring 2016). Results presented here were based on the previous version of the backbone (GBIF 2013). This contribution describes the extension of Taxonaut to use the GBIF Species API to recover multiple taxonomies and reveal issues experienced in handling such 'real' data. The software is extensible to other data sources because it is designed to accept multiple data sources from different APIs.

Data source
GBIF ChecklistBank stores more than 2500 checklists contributed by data providers. These checklists can be categorised into two groups, a small number of large checklist and many small sets of name usages extracted from separate publications. Data are available in JSON format using Species API, which includes name usages in checklists, the higher and lower name usages (parent and child name usages), synonyms and citations. Java libraries with source code are available to use the API. Most of the name usages have been mapped to the GBIF Backbone Taxonomy, a synthetic hierarchy intended to cover all checklists in the Checklist Bank. The Backbone Taxonomy is available as one of the stored checklists.

Design goals
Taxonomies comprise an organised list of taxonomic names that are drawn from diverse data sources and organised according to an expert in the local domain. Publication of new taxa after the publication of the main hierarchy will often, but not necessarily, specify the hierarchical position that the author intends for the new taxon. Various experts will inevitably propose a variety of such views (Ytow et al. 2001). TaxoNote (Morse et al. 2003) was built to facilitate comparison of disparate taxonomies. Taxonaut was designed to facilitate graphical exploration of the information space that these taxonomies represent.
The graphical user interface (GUI) components of Taxonaut have been modified from TaxoNote to improve usability. TaxoNote read data from local XML files in the InfoVIS2003 Contest tree format (Fekete and Plaisant 2003) where each file contained a single hierarchy. Therefore, hierarchies to be compared are specified by selecting flies to read. Taxonaut, on the other hand, gathers remote data fragments using GBIF Species API in an interactive manner described below. GUI components supporting this interaction were enhanced. See "General work flow" section for detail of this enhancement, three sub-panes named "Name usages", "Detail" and "Name Tree." Third-party software components are listed in "Availability" section.

Ostensive comparison of taxon concepts
A formal model is required to establish the equivalence of taxa. A taxon is a named concept that has been represented as a set of individual organisms (e.g. Gregg 1954, Geoffroy andBerendsohn 2003). Such sets cannot be treated formally as mathematical sets (Ytow et al. 2001) but can be handled as rough sets (Pawlak 1991). The plasticity of taxon concepts, which are intrinsically open sets, means that strict application of set equivalence is not practically useful, being highly sensitive to set completeness. Each taxon can be represented ostensively by some individuals or lower taxa and rough set theory used to establish equivalence (Ytow et al. 2006). Note that a lower taxon is not a member but a subset of the higher taxon.
Rephrasing in terms of taxonomy, a taxon concept can be represented by aggregations of lower taxa that are included and excluded. Name usages in databases, also known as potential taxon (Berendsohn 1995), Nomenclatural Taxon (Pullan et al. 2000, Name Record (Ytow et al. 2001), Assertion (Pyle 2004), TaxonConcept (Taxonomic Names and Concepts interest group, Biodiversity Information Standards (TDWG) 2005), or TaxonNameUsage (Pyle 2016), provide these taxon concepts as building blocks. This approximation enables the evaluation of ostensive compatibility of taxon concepts instead of exhaustive set equivalence.

Alignment of taxa on screen
There are ways to show relatedness of taxa displayed on screen (Graham and Kennedy 2010), such as linking taxa by a line (Craig and Kennedy 2008), highlight taxa by shared colours, or overlapping by animation. Taxonaut uses alignment of taxa for this purpose, using the two dimensions of the screen in a consistent manner, one for multiple hierarchies, the other for taxa in each hierarchy. This is visually similar to the familiar tree structure used to display files in a computer. For operational consistency, equivalent name usages in column are expected to be aligned horizontally. A composite tree structure, a tree not in sense of phylogeny but graph theory, covering all name usages in hierarchies under comparison is used to align name usages in each hierarchy (Fig. 1). Thus names appear on an appropriate line with an indentation representing the depth of the node. Vertical placement of name usage in a hierarchy under comparison is determined by the composite tree while horizontal placement is decided by the hierarchy. Alignment by the composite tree is robust to missing name usages in each hierarchy which appear as gaps in each hierarchy. If name usages of the same name literal in hierarchies have incompatible paths to the tree root, the composite tree has two or more nodes of the name (e.g. taxon S in the aligner tree in Fig. 1). Note that the composite tree will not provide a taxonomically sensible hierarchy if hierarchies under comparison are inconsistent with each other.

Implementation of composite tree
The composite tree is the key component for an aligned display. It is implemented by extending the Java DefaultTreeModel, a data model of JTree from the standard Java library. JTree is a GUI component to display a tree structure on screen, of which nodes can be expanded or collapsed to show or hide child nodes if present. The composite tree data is generated by integrating hierarchies in sequence, where each hierarchy is also represented by a Java TreeModel as a structural copy of the original hierarchy (Fig. 2). The first hierarchy in a group to be compared is copied directly to make the initial composite tree. Subsequent hierarchies are embedded into the composite tree as follows: The tree representing a hierarchy is scanned by breadth-first search with a pre-fetch of one depth. The root node of composite tree is used as the parental composite node at the beginning. Schematic explanation of aligned rendering of name usages in hierarchies on screen. Name usages in three tree diagrams representing hierarchies, h , h and h are aligned with the aid of an aligner tree. Capital letters in hierarchies represent name usages in each hierarchy. The aligner tree is composed from hierarchies as a kind of superset where letters in the aligner tree are place holders to determine vertical offset of nodes. Rendering of a tree diagram in two dimensions based on the folder metaphor requires calculation of node position where horizontal offset depends on depth of the node while vertical offset is relevant to both depth and width. Nodes in hierarchy diagrams can be aligned horizontally by sharing the calculation of vertical offset of nodes. For example, vertical position of node C in hierarchy diagram h is determined by vertical offset of node C of the aligner tree while horizontal offset is determined in the diagram h alone. Inconsistency between hierarchies results in duplicated nodes in the aligner tree; different assignments of name usage S in hierarchies h and h results in two S nodes in the aligner tree. Vertical offset of nodes S in hierarchies h and h are determined by position of corresponding nodes S in the aligner tree. Missing name usages, e.g. name usages B and S in hierarchy h , result in vacancies in the hierarchy diagrams like vacancy for boundary sets of Rough set representation. Note that such vacancies do not affect the horizontal offset in hierarchy diagrams. Schematic diagram of the data structure for the composite tree. In addition to the fundamental TreeModel data the composite tree possesses three mapping tables. First the forward map, which links each name usage node to a composite node. Second the reverse map, which links the composite node to each name usage node in the hierarchies under comparison. It can link a single composite node to multiple name usage nodes. Third the parental name, which links pair of the name literal and its parental name literal of a name usage node to the composite node. It providers a working cache of candidate composite nodes to which the name usage node may map to. If the parental name table contains a composite node that matches a name literal and parental name literal of the name usage node in the hierarchy being processed, the composite node is a good candidate for a mapping. Examples of some composite nodes in Fig. 1 are shown in the insertion. There are two entries for composite node D, one is indexed by literals D and B while the other by D and C. The D-B indexed entry is created for node D in hierarchy h , while D-C indexed entry is created for node D in hierarchy h . Note that the composite tree does not retain the direct parent-child relationship between B and D after insertion of composite node C. The parental name table has this information even after insertion of composite node C to provide a quick look up table to find composite node D as a candidate for hierarchy node with D and B literal pair, otherwise it is necessary to scan the parental path against hierarchies already incorporated. The parental name table has two entries for composite nodes S because there are two composite nodes, distinguished by suffixes. The parental name table also provides a distinction between these two composite nodes. The three tables are implemented by Java HashMaps.
1 2 • 1. Each node in the hierarchy is examined against the forward mapping to find a composite node. • 1.1 If found, the composite node is used as the parental node in the later steps. • 1.2 If not, look up the double indexed mapping by the name literal of the node. Note that it happens only on the root node of the hierarchy because of later pre-fetch. ▪ 1.2.1 If there is no entry of the literal, create a new composite node as a child node of the parental composite node, add map entries between the hierarchy node and the created composite node. The composite node is used as the parental node in the later steps. ▪ 1.2.2 If there is an entry of the literal that shares a parental literal, examine compatibility between parental paths of the hierarchy node against each composite node in the list. If there is only one composite node having matched path, the composite node is used as the parental composite node in the later steps. Else, create a new composite node as a child node of the parental composite node, add to maps. The composite node is used as the parental node in the later steps. ▪ 1.2.3 Else, i.e. there is an entry of the literal but no composite node shares the parental literal, examine compatibility between parental paths of the hierarchy node and the parental composite node of each composite node in the list. If there is only one composite node having matched path, the composite node is used as the parental composite node in the later steps. Else, create a new composite node as a child node of the parental composite node, add to maps. The composite node is used as the composite node in the later steps. • 2. If the hierarchy node has child node, examine them as follows • 2.1 Examine redundancy of name literals in child nodes of the hierarchy node. If there are 'redundant' name usages, i.e. two or more name usages of the same name literal, merge them into a single child node of the parental hierarchy node. Other redundant node should be removed after transfer of all lower nodes. • 2.2 Examine each child of the parental composite node against child hierarchy node, testing whether it shares the name literal. If they have the same name literal, add the hierarchy node and the composite node to maps, and rearrange the sequence of child hierarchy node to match to the sequence of child composite nodes for graphical alignment. • 2.3 Map each of remaining child hierarchy node to a composite node using the procedure described under 1.1 and 1.2. • 3. Go to step 1 to process the next hierarchy node.
The algorithm is designed to be modest and conservative which maps a hierarchy node to a composite node only when the candidate composite node is unique. It might be improved by choosing a best fit composite node from two or more candidates, but it requires further investigation.
Step 2.1, on the other hand, reduces multiplication in an aggressive way to reduce the burden of later matching of candidate nodes. Examples of such redundancy to be omitted will be described later. If it is necessary to compare these 'redundant' name usages, users are advised to set height of the tree to zero to cut higher taxa path.

Availability
The source code and the executable jar file are available from GitHub, https://github.com/ nomencurator/taxonaut. The executable jar file runs in Java 8 envirionment available for Linux, Mac OS X, Solaris and Windows. Instruction is available on its Wiki page. The software is distributed under the Apache Licence. Executable jar file contains following third party software: GBIF API client Java code, Apache axis including jaxrpc, Jackson version 1, Apache POI, Google guava and Find Bugs. These third party software are copyrighted by the original authors. All but Find Bugs are licensed under the Apache Licence. Find Bugs is licensed under the Lesser GNU Public License.

General work flow
The software display is divided into three resizable areas: Query pane; Result list pane; and Analysis pane (Fig. 3). Users are expected to enter a name or names in the query pane to get search results in the result list pane as a list. The result list is used to select records to be analysed in the analysis pane. These three panes are designed to hold subpanes accessed by tabs allowing various operations or data types to be displayed. The current implementation, for example, has three tabbed sub-panes in the analysis pane to display taxonomic comparison, a hierarchy containing the selected name usage or detail of the name usage. These three sub-panes provide facets of analysis, with auxiliary information available in optional pop-up windows to reduce number of tabs. A new facet, e.g. a distribution map of the taxon, could be added to the analysis pane with an additional tab label. Search results other than name usage, such as publication or author, could be hosted by the result list pane as sub-panes with tabs labelled "Publication" or "Author" depending on query result available from data sources. ZooBank is an example of data sources with APIs for Author and Publication, to which connection remains to be developed. The result list pane can be used to specify target of the query, such as "retrieve publications containing this name." The query pane can similarly handle different type of queries through tabbed sub-panes, such as bibliographic query to find information relevant to specific publication, or personal name to find data relevant to a person. These extensions are dependent on the data sources to be used and future implementation.
The current implementation has two query sub-panes to examine either Species API or local search to highlight names in the Hierarchy sub-pane described later. Users can compose a query to external data sources by name or names separated by '|', with rank of interest specified, if necessary. There are options to include basionyms or synonyms, and choice of name matching mode. If the query is a vernacular name, users can specify the language. The result of a query is accumulated into a cache to reduce access to the data source.
The Name usages sub-pane in the result list pane shows name data returned from Species API in tabular format, showing information that might be useful in choosing records of interest. This sub-pane has sets of selection-boxes at the bottom to specify highest rank (height) and lowest rank (depth) of the hierarchies to be retrieved. Height and depth can be specified either as a number (e.g. 4, meaning hierarchical ranks) or name of a rank (e.g. order). Note that if both number and name are specified, the software will represent whichever is the fewer intermediate ranks, because some taxonomies utilise large numbers of intermediate ranks between the widely recognised major ranks (e.g. class, order, family). By default the software will recover the highest and deepest records available from the recovered data. By selecting a line in the list, the software will show details of the record in the "Detail" sub-pane in the Analysis pane, and retrieve the hierarchy of specified height and depth to show in the "Name Tree" sub-pane. Selection of two or more lines enables the button also at bottom of the pane to compare selected name usages. The button triggers retrieval of hierarchies containing each name usage selected and their display in a comparison table of the "Hierarchies" sub-pane in the Analysis pane. Since sub-panes in the Analysis pane are independent of one another, users can examine the detail of a  Because of volume and diversity of available data, it is impractical to show the result of an exhaustive search. Some example tasks are described instead. Comparative display of five datasets containing Minyomerus records with an aligner. Each column shows the aligner or one of the datasets as both hierarchy and name usage assignment to higher name usage. Widths of the columns were adjusted to show full higher taxonomy. Hierarchies are expanded from root node to name usages selected in the name list pane of which nodes in the tree diagrams are highlighted. A row of the Comparison of those records gives five columns and an aligner column after integration mentioned above (Fig. 4). The moderate height of higher name usages paths enables the display of all hierarchies in a display of ordinary size. Numbers of species names under Minyomerus vary with data sources (Fig. 5) Extending the query to include basionyms and synonyms by specifying these options enables the examination of the data sources that incorporate the synonym treatments. The genus level query can be further expanded by adding species names relevant to those synonymous species to the query enables the inclusion of those 'illegitimate' name usages as siblings of other ordinary usages under the higher name usage (Fig. 6). Comparison of Minyomerus species in hierarchical trees. Hierarchical display of member taxa uses a screen more efficiently than that of higher taxa path, because the latter results in sparse screen containing diagonal name path. Although a taxon concept is represented by a pair of intent and extent represented by higher taxonomy and member taxa, respectively, it is better to focus on either of them at once on a screen because of the nature of tree diagrams.
Users can adjust what is shown on the display by scrolling hierarchies and resizing the display area to focus on their interest, e.g. showing membership hierarchies only. An auxiliary window pops up by placing cursor (not shown for clarity) on a name usage which gives summary of higher taxa path.
The lower part of the table shows a distribution of name usages in hierarchies with categorisation based on analysis during composition of the aligner tree (Fig. 7), i.e. name usages that appeared in all hierarchies sharing the same name literal and higher name usages, or different in higher name usage, name usages missing in some hierarchies, inconsistent name usages, synonyms, and all name usages. Each cell contains the higher name usage of the name literal on the line. Inconsistent name usages are extracted from the composite tree alone, because it retains nodes covering all name usages for the alignment and hence inconsistencies, both inter-and intra-hierarchies, are reproduced in the aligner tree. Synonym tab is designed to show name usages having different higher names depending on taxonomic views where higher names are not on the path of higher nodes of the composite tree. Synonymous usages also appear in the 'different taxa' tab because they are different in their higher name literal. Note that the synonym tab is expected to capture potential synonyms between hierarchies rather than what is stated as a synonym in a single hierarchy. This example allows an informed user to reconstruct the taxonomic history of the genus as well as assessing the differences in coverage of the various data sources. In this case, the Facets of assignments to higher name usages are shown in assignment tables with tabs of facet names where the word 'taxa' is used for 'name usages' to save screen space. Five hierarchies obtained by the union query are the same as in Fig. 6, are shown in both tree diagrams and assignment table where a row represents assignments of name usages having the name literal in the aligner column to higher name usages in hierarchies, although tree diagrams are minimised in the figure for clarity. Assignments are categorised into facets which can be switched by selecting tabs at bottom of the nomenclatural history of the genus has been expertly assessed by Jansen and Franz (2015) and the interested reader can follow the changes to the nomenclature. The objective of the software, however, is to provide rapid access to these changes so than an expert can more quickly undertake a new revision of a group less well studied. The non-expert user can more readily judge which hierarchies are more useful for their own purpose.

Higher name usages
Query of Lipotes returns 13 genus usages from distinct data sources, where each of them assigns the genus to one of the Cetacea ( Choosing those 13 name usages gives a table of the hierarchies with an additional composite tree to align name usages in each hierarchy (Fig. 8). Users can adjust the view port and width of each column to display the area of interest. Arrangement of columns follows the order of hierarchy evaluation which is determined by the order of selected name usages in the list. The user can rearrange the columns for easier comparison between hierarchies without affecting the composite tree structure. The alignment algorithm is sensitive to higher classification because it uses top-down name matching. Amongst those 13 hierarchies, Paleobiology Database and NCBI Taxonomy use higher, or, more intermediate taxa in their classification than others. The highest taxon in most of those classifications is either Animalia or Mammalia except "root" of NCBI Taxonomy and Biota of World Register of Marine Species which are higher than Animalia and incompatible each other in their name. Those two highest names results in two independent trees (Fig. 9) which can be partially unified by limiting the highest level of taxon (Fig. 10). There are two incompatible hierarchy paths between Mammalia and Cetacea which results in two Cetacea nodes in the composite tree. Detail of the composite tree can depend on the sequence of incorporation of hierarchies into the tree. Berendsohn (1995) pointed out that "a database system based on potential taxa is open to an inflation of records: any name referred to in a publication may form a new potential taxon". GBIF ChecklistBank is prone to such data inflation.  Alignments with different height of hierarchies containing Lipotes, up to order (upper), class (middle) or phylum (bottom), respectively. The height and depth of hierarchies being compared can be specified in the name usage list pane by either rank or number of levels above and below the name usage. If specified by rank, the software composes hierarchies to be limited within less than the next rank. For example, if class is specified as upper limit of hierarchies, the results may contain name usages less than phylum to cope with unranked intermediate name usages. The aligner tree for hierarchies up to order has four branches containing Lipotes. One branch is an artefact of misinterpreted data, Cetacea UNKNOWN, where UNKNOWN should be processed as authority rather than a part of name literal itself. Other three branches are results of different family names which require independent path between order and genus. The aligner tree for hierarchies up to class contains 9 Lipotes branches. Besides the erroneous UNKNOWN branch, remaining eight branches are joined into three higher branches depending on intermediate taxa between Mammalia and Cetacea, i.e. without intermediates, via Artiodactyla or Cetartiodactyla. Note that hierarchies without intermediates are mapped to branches in aligner tree either with or without intermediates depending on the sequence of integration into the aligner tree, because addition of an inconsistent path to the aligner tree requires the choice of one of those paths to which a new direct hierarchy is mapped but there is no way to choose one of them. The aligner tree up to phylum also contains 9 branches of Lipotes including one erroneous 'UNKNOWN'. Although the number of branches is the same to that for class, branching points get higher.

Data inflation in higher name usages
Search of Inia, a genus of Iniidae (Mammalia: Cetacea), returns 20 name usages where each of NCBI Taxonomy, Catalogue of Life and German Wikipedia Species Pages returns two records and English Wikipedia Species page returns three records. Other data sources contain a single record for each. Both records of NCBI Taxonomy were derived from the original data source. One of them without rank is said to be a synonym of Inia with genus rank. One of the records attributed to Catalogue of Life was created at GBIF from a denormalised classification, assigned to Animalia. Both records attributed to German Wikipedia are records from de-normalised classifications of different height; one up to order Cetacea while the other up to class Mammalia. Each of two Inia records has only one species, either of Inia geoffrensis or Inia araguaiaensis, where these data originate in German Wikipedia. It implies that these two species data in German Wikipedia have different higher taxa. All higher name usages of Inia araguaiaensis are marked as from denormalised classification, while Mammalia and Cetacea records of Inia geoffrensis came from German Wikipedia. Difference in highest name usages in the data source might be the cause of those unshared Cetacea records. One of three Inia records attributed to English Wikipedia Species page is genuine and the other two records are from denormalised classification. Each of these Inia records has one lower name usage, either of Inia araguaiaensis, Inia geoffrensis or Inia geoffrensis boliviensis. Via family Iniidae, the last one assigned to order Artiodactyla, while others are assigned to order Cetacea. These three subgenus records share the kingdom record, the latter two shares a phylum record, but records of class or lower are unshared. Records of higher name usages originated in the data source directly are kingdom Animalia of the three, phylum Chordata and genus Inia of Inia araguaiaensis, and class Mammalia of Inia geoffrensis boliviensis. All other name usages of genus or above have been created as a consequence of data processing by the aggregator.
Search of Iniidae, the family containing Inia, returns three records attributed to German Wikipedia and five records attributed to English Wikipedia, and each a single record for each data source. Search of Cetacea returns five records attributed to German Wikipedia and 7 records attributed to English Wikipedia. Search of Artiodactyla also returns 7 records attributed to English Wikipedia. Search for Mammalia returns 3 records attributed to German Wikipedia and 11 records attributed to English Wikipedia. Search for Chordata returns two records attributed to German Wikipedia and four records attributed to English Wikipedia, two of latter have 21,998 and 47,985 descendent name usages, respectively. This is an example of the type of data inflation mentioned in Berendsohn (1995). While Wikipedia could be the worst case of data inflation, multiple name usages in a single data source is not necessarily data inflation if the data source is expected to retain taxonomic views. Unification of records that look like replicates should remain with application software and end users because its appropriateness depends on nature of data sources and purpose of data usage.

Homonyms and synonyms
A genus of fish was named as Lembus by Günther in 1859. The same literal Lembus was assigned to a genus of protozoa by Cohn in 1866. Kahl renamed the protozoan genus as Cohnilembus in 1933. By assignment of the fish species Lembus maculatus to another genus (currently Gobiomorus), the monospecific genus Lembus disappeared and the name literal Lembus became unavailable. Species of protozoan Lembus are now classified into three genera, Cohnilembus, Kahlilembus and Pesudocohnilembus. Ten other genera are relevant to the protozoan Lembus because some species were moved to Lembus from Anophrys, Cyclidium, Lembadionella, Lembadion, Philasterides, Proboscella, Sparotricha, Trichonema, Uronema and Vibrio. Retrieval of data relevant to those genera was examined, by specifying search options. Taxonaut has the following name search options, three checkboxes to include basionyms, synonyms and vernacular name, and a set of radio buttons to choose one of matching modes, exact, fuzzy, suggest, full text or contain. The 'contain' option is slightly different from other matching mode options. It returns name usages containing name usage of the query literal, or a list of higher taxa in essence.
The default search mode is exact matching. It retrieves 15 records, two from GBIF Backbone Taxonomy, two from Interim Register of Marine and Nonmarine Genera, two from IRMNG Homonym List, one from Integrated Taxonomic  The Name list pane shows synonyms in shaded text for easier distinction from accepted names. Selecting a line of shaded text, the Detail pane shows information including the accepted name. Amongst of those 15 records, ITIS, CoL and one of GBIF Backbone Taxonomy data are claimed as valid names. Three of 4 records from Global Name Usage Bank are claimed as synonym while the other is stated as doubtful. Others give their accepted name, either Cohnilembus or Gobiomorus, except one record from Global Name Usage Bank which claims that Lembus is valid name of synonym Lembus. Those synonym records from Global Name Usage Bank originate in different works, 'Eshcmeyer (2004)', 'Eshcmeyer (1998)' and 'Günther (1859)', which results in the record stating that valid Lembus is synonym of Lembus. Besides this self synonymy, multiple name usages of the same name literal is in the nature of the design because Global Name Usage Bank provides access to accumulated name usages instead of giving a single consensus view unlike CoL or GBIF Backbone Taxonomy. Note that one of two records of CoL saying 'denormed classification' is not an original one but was created at GBIF which places it at just under kingdom, while the other is placed as a member of a family. Query of Lembus with both basionym and synonym option returns 6 more Alvarius records from 4 data sources, 4 Pelmatia records from 3 data sources, 8 Phylypnus records from 5 data sources and two Sobiomorus records from two data sources. It doesn't cover genera relevant to protozoan Lembus species except Cohnilembus which is synonym of the genus itself. This difference in recovery of synonyms between two genera named Lembus reflects quality of data sources.

Beyond homonymy: A single name literal valid under nomenclatural codes
Each nomenclatural code is independent of other nomenclatural codes. A single name literal can be a valid name of a taxon under different nomenclatural codes, for which a term hemihomonym was proposed (Starobogatov 1991, Shipunov 2013. Ficus is, for example, a valid genus name of figs under the botanical code and of a group of sea snails under the zoological code. Nomenclatural data have been independently gathered depending on nomenclatural codes covering taxa of interest, e.g. IPNI or Zoological Record. The code relevant to the names recovered is thus implicit, but can be compared with the endings used for taxa above genus level, as a possible mechanism of data validation. There are, however, too many exceptions to the standard endings for this to be more than a flag for the user's attention. This implicit dependency on classification creates potential confusion in modern classifications and is an easily avoided source of difficulty if database compilers simply made the code under which they are working explicit. Some hierarchies are not explicitly constructed under any nomenclatural code, specifically phylogenies created from molecular data. They do not refer to kingdom in botanical/zoological sense being purely reliant on the specimen sampled. Making the explicit declaration of the relevant code under which the name is used would be of enormous value for subsequent machine processing. Query of Ficus against GBIF Species API returns 110 records consisted of 77 generic, 4 subgeneric, two series, three specific and 24 unranked records. The unranked records contains four (presumably) cultivars, 16 species records having species epithets starting with a capital letter, and 4 records were deduced as generic records. Specifying genus as rank of Ficus query, the software gives 101 records as a union of 76 generic records and 25 unranked records which is the same as the unranked subset of the query without rank limitation. Comparison of hierarchies containing these generic records within ranks lower than order by selecting family in 'Height' option shows six families (Fig. 11) (Fig. 12) of 55 hierarchies which is less than the number of hierarchies up to family because some data set cover both zoological and botanical branches in a single hierarchy. Some hierarchies do contain neither Animalia nor Plantae which is consequence of each source hierarchy does not extend to the 'classical' kingdoms. These clues, endings or kingdoms, are insufficient for autonomous determination whether the shared name literal is a homonym, i.e. falls in a single code. Comparison of Ficus usages as hierarchies up to family, without lower name usages. 65 hierarchies from separate data sources are shown with pop-up window revealing the name of data source where the cursor is placed on. Capparaceae is included as a side effect of synonymisation of a Ficus species to a Capparis species. Phytoseiidae is included as an artefact of treating the description of specimen where the name was obtained. Rosaceae and Meliaceae are included in only one data source, and hence there are two families containing Ficus, each for animal and plants. The aligner tree of Ficus from kingdom to genus, as a horizontal composite of four captured screens instead of vertical composition. Some branches are overlapped intentionally to enable confirmation of its coverage. There are 36 Ficus branches in the aligner tree for 55 hierarchies compared (not shown). Besides Animalia and Plantae, there are four major branches, starting with Metazoa, Chlorobiota, Viridiplantae and Tracheophyta, which require implicit taxonomic knowledge to evaluate homonymy.

Discussion
Capability of the software to explore multiple taxonomic views has been limited by data availability. GBIF Species API provides wider coverage data with documented API, software library, support for developers and long-term stability including financial support, compared with predecessors such as uBio or ZooBank. It does not, however, eliminate the necessity of access to original data providers using their own API to obtain details unavailable from the GBIF Species API, or to examine interpretation of source data. Some data sources of GBIF ChecklistBank contain data derived from other data sources of GBIF ChecklistBank, with or without interpretation by non-primary provider. Name usage data are inevitably interpreted by the data provider (Ytow et al. 2001), even if the provider intends to be neutral. For example, Jansen and Franz (2015)  The number of hierarchies that can practically be compared is limited by the size of screen, the processing speed of the host computer and the download speed of the internet connection. In addition to the length of names, the path of each tree in the comparison table increases the width of column as a function of the depth of the path. Hierarchies containing more taxa at intermediate ranks results in reduction of number of hierarchies that could be practically compared. These limitations can be reduced by limiting the height of hierarchies under comparison, although this reduction of height can affect the results. An extreme way to increase number of hierarchies displayable is to narrow each column to minimum, e.g. width of one character. Each cell of the comparison table shows higher name usage if the literal of the row appears in the hierarchy, or empty if not, as illustrated in Fig. 11. At this extreme end, the row of trees showing hierarchies can be minimised by the slider on the edge of the table. Narrowing column reduces available information about higher taxon but use of the literal in the hierarchy is still visible. Colouring cells containing name usages like the list view (Graham and Kennedy 2008) could improve visibility. Adjusting width of columns could be a bit of a cumbersome task. Table lens technique (Rao and Card 1994) could help with this task, which might also make comparison of hierarchies easier by focusing on columns of interest.
Analysis of hierarchies in the assignment table remains to be extended, especially the utilisation of relationships between name usages if given, e.g. the 'related' method of GBIF Species API. Reduction of columns showing equivalent hierarchies would be helpful in keeping the number of columns in a manageable range. It would be more important to use data from sources such as Plazi where data are provided as they appeared in each publication, rather than as a checklist-style composition. It might be necessary to redesign the user interface to display relationships between hierarchies because a relationship between hierarchies is pairwise. The display of inconsistency in the assignment table depends on the fact that the composite tree contains nodes representing all name usages in hierarchies under comparison and hence it retains inconsistency to be detected. The table under the 'inconsistency' tab shows that each inconsistency in the composite tree is mapped to these name usages in relevant hierarchies, which implies that some of these name usages are inconsistent. It does not mean that all name usages on the row of the table are inconsistent with each other. A relationship between name usages in different hierarchies in general can involve multiple name literals, unlike inconsistency where difference assignment of a name literal is a symptom. The display of relationships between name usages in hierarchies requires more work.
Recovery of synonymous lower name usages in the example of Minyomerus suggests that extending hierarchical queries to search for lower synonyms, instead of adding each lower name usages to initial query, can be rewarding. It will probably result in much slower query execution because the synonym extension requires a recursive query as follows: for each of result name usages recovered by the first query for given name literals, its synonym name usages are requested; and then name literals of result synonym are used as the next seed to obtain name usages repeatedly until the set of name literals and hence name usages are saturated. It also could raise difficulties for some users because unexpected names can appear in hierarchies. Practicality of synonym extension for lower name usages needs more investigation.
The example of Ficus shows that information of the nomenclatural code governing name usages is essential for handling name literals, although it is almost universally implicit. Nomenclatural code, a sort of language of the name in general, might be supported as a data element in future though, for now, it is necessary to manage it within application software. It would also enable the management of ambiregnal names where a single taxon has names valid in multiple nomenclatural codes. It is suggested that NameUsage data structure of GBIF Species API be extended to manage nomenclatural code information, either by adding a new property or expanding NameType enumeration representing type of names. The NameType distinguishes viral names from other scientific names already, and hence there is no reason to prevent adding nomenclatural codes as other name types.
Finally, the accelerating rate of species loss makes the work of taxonomists so much more urgent. Few can devote the time to regular maintenance of taxonomic name-lists, but without such lists much data is inaccessible. Clearly we need to have better software tools to explore and correct the data we do have, so that new data can be integrated into a coherent framework.

Conclusions
The software described here allows experts to assess nomenclatural data from various sources in a way that is more time-efficient than the traditional literature-based search methods. It also facilitates comparison of data sources, the majority of which are compiled without expert taxonomic oversight.
Non-expert users can also use the software to examine and assess the quality of the data available from the large numbers of data aggregation projects that exist. Large-scale data aggregators are prone to data inflation and error propagation, because much of their data gathering is perforce automatic.
It remains a truism that data quality can only be increased by more people looking at and using the data.