Biodiversity Data Journal :
R Package
|
Corresponding author: Clarke van Steenderen (vsteenderen@gmail.com)
Academic editor: Zachary Foster
Received: 10 Nov 2021 | Accepted: 28 Feb 2022 | Published: 11 Mar 2022
© 2022 Clarke van Steenderen
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
van Steenderen C (2022) BinMat: A molecular genetics tool for processing binary data obtained from fragment analysis in R. Biodiversity Data Journal 10: e77875. https://doi.org/10.3897/BDJ.10.e77875
|
Processing and visualising trends in the binary data (presence or absence of electropherogram peaks), obtained from fragment analysis methods in molecular biology, can be a time-consuming and often cumbersome process. Scoring and analysing binary data (from methods, such as AFLPs, ISSRs and RFLPs) entail complex workflows that require a high level of computational and bioinformatic skills. The application presented here (BinMat) is a free, open-source and user-friendly R Shiny programme (https://clarkevansteenderen.shinyapps.io/BINMAT/) that automates the analysis pipeline on one platform. It is also available as an R package on the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/web/packages/BinMat/index.html). BinMat consolidates replicate sample pairs of binary data into consensus reads, produces summary statistics and allows the user to visualise their data as ordination plots and clustering trees without having to use multiple programmes and input files or rely on previous programming experience.
AFLP, binary data scoring, GUI, ISSR, R package, R Shiny
Fragment analysis is a method in molecular biology that encompasses the processes by which fragments of DNA are separated by size in order to generate characteristic band profiles. Bands are detected and scored through either the traditional method of viewing them on polyacrylamide gels (
Processing and analysing the binary data, obtained from fragment analysis methods, can quickly become challenging due to the large size of datasets and the time required to organise and format them to suit the needs of different programmes used in analysis pipelines. Common practice is to independently replicate each Polymerase Chain Reaction (PCR) sample in order to consolidate the output into one consensus read per individual (see, for example,
Manually consolidating the replicate pairs of large binary matrices in this way is not only impractical, but it also lends itself to human error. Even after fragments have been scored and processed, the downstream analyses of these data are complex. For example, a number of different programmes are often required for different analyses, each of which require a different input file format. This requires a certain level of computational and/or bioinformatic skills, can be both difficult and time-consuming and can result in further potential errors when changing between file formats.
The R programming language (
Here, I present BinMat, an R package and R Shiny application that automates the analysis of fragment data. Named 'BinMat', from 'Binary Matrix', the application offers researchers a user-friendly, open-source platform that does not require multiple programmes and file input formats (Fig.
The R Shiny application platform allocates a maximum memory of 1 GB and is accessible here. The online version may time-out due to insufficient memory if a particularly large binary data file is uploaded. In such a case, the programme can be run directly from R on the user's local machine by typing
install.packages("shiny")
shiny::runGitHub("BinMat", "clarkevansteenderen")
into the console.
The programme's code is freely available on GitHub.
File input
BinMat reads in binary data that has already been processed from raw electropherograms using programmes such as GeneMarker (SoftGenetics) and RawGeno (
File input for a dataset containing replicate pairs that needs to be consolidated.
Sample label | Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 |
Sample A rep 1 | 0 | 0 | 1 | 1 | 1 |
Sample A rep 2 | 0 | 0 | 1 | 1 | 1 |
Sample B rep 1 | 1 | 1 | 0 | 0 | 0 |
Sample B rep 2 | 0 | 1 | 0 | 0 | 1 |
Table
Sample label | Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 |
Sample A rep 1 + Sample A rep 2 | 0 | 0 | 1 | 1 | 1 |
Sample B rep 1 + Sample B rep 2 | ? | 1 | 0 | 0 | ? |
Output overview
Once the data have been consolidated, the user can view and download information in the 'SUMMARY' tab at the top of the window; showing the average number of peaks (± standard deviation (sd)), the maximum and minimum number of peaks and the total number of loci. The 'ERROR RATES' tab shows the Euclidean (EE) (± sd) and Jaccard (JE) (± sd) error rates. See
The 'Remove samples with a jaccard error greater than' button removes samples with a Jaccard error (ranging from 0 to 1) greater than or equal to a specified value. This can give the user an idea of how filtering their data can affect overall error rates. The default value is set at zero.
Clustering methods, such as the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and neighbour-joining, are frequently used in the analyses of fragment data to create dendrograms (e.g.
Hierarchical clustering tree: UPGMA
The 'UPGMA TREE' tab in BinMat allows the user to upload a consolidated binary matrix as a CSV file (in the format shown in Table
\(dJi = \frac{f01 + f10}{f01 + f10 + f11}\)
Ordination: nMDS Plot
The 'nMDS PLOT' tab allows the user to upload a consolidated binary matrix with grouping information as a CSV file. The input file format is shown in Table
Data input required for the creation of a non-metric multidimensional scaling (nMDS) plot. Grouping information needs to be in the second column. The data here represents binary replicate pairs that have already been consolidated into consensus reads.
Sample label | Group | Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 |
Sample A | Africa | 0 | 0 | 1 | 1 | 1 |
Sample A | Asia | ? | 1 | 0 | 0 | ? |
The distance methods available are 'binary' (Jaccard's distance), 'euclidean', 'maximum', 'manhattan', 'canberra' and 'minkowski'. The 'No. of dimensions (k)' option can be set at '2' or '3' and can be determined using the 'nMDS Validation' tab using the 'Scree plot' and 'Shepard plot' buttons. The resulting distance matrix can be downloaded as a CSV file and the plot itself as a SVG file. Once the user has uploaded their data, an editable table will appear to allow for the selection of colours and symbols for each group. The user can adjust symbol size and can select whether sample labels should appear on the graph or not. The nMDS plot is created using the isoMDS function in the MASS package (
Scree plot
The optimal number of dimensions to use for the nMDS plot should minimise the resulting stress value.
Shepard plot
Shepard plots are graphical representations of how well the ordination fits the original distance data (
Filter data
The 'Filter data' tab allows the user to filter their dataset by setting a threshold value for the number of peaks present. The new subsetted data and the removed samples can be downloaded as a CSV file and re-uploaded to create a new nMDS plot and/or hierarchical clustering tree.
Comparing BinMat's output to PAST and SplitsTree
Two AFLP datasets were downloaded from the Dryad Digital Repository. These comprised data generated by
Comparisons of non-metric multidimensional scaling (nMDS) plots in BinMat (A1 and B1) and PAST (A2 and B2). Both nMDS plots are plotted for k = 2 dimensions. Data were taken from
The SplitsTree output for the data taken from
Comparison of hierarchical clustering trees in A) BinMat and B) PAST using the data taken from
The BinMat R package is available on the Comprehensive R Archive Network (CRAN) and on GitHub and is command-line driven. More information about the package can be obtained by typing
library(help = BinMat)
into the console after it has been installed. This details all the functions available (Table
BinMat R package functions available on CRAN. Typing ?functionName into the console provides more information about each function.
Function | Description |
check.data() | Checks for unwanted characters. |
consolidate() | Consolidates replicate pairs. 1 & 1 = 1; 1 & 0 = ?; 0 & 0 = 0 |
errors() | Calculates Jaccard and Euclidean error rates. |
group.names() | Outputs groups in the uploaded binary matrix. |
nmds() | Creates a non-metric multidimensional scaling (nMDS) plot. |
peak.remove() | Removes samples with peaks equal to, or less than, a specified threshold value. |
peaks.consolidated() | Peak summary for a consolidated binary matrix. |
peaks.orignal() | Peak summary for replicate data or consolidated data from file. |
scree() | Creates a scree plot of stress values vs. ordination dimensions. |
shepard() | Creates a shepard plot for goodness-of-fit for ordination data. |
upgma() | Draws a hierarchical clustering tree (UPGMA) with bootstrapping. |
To cite BinMat, use
citation("BinMat")
There are four example binary matrices embedded in the BinMat package called "BinMatInput_reps", "BinMatInput_ordination", "bunias_orientalis" and "nymphaea" that can be accessed by creating objects such as:
data1 = BinmatInput_reps
data2 = BinmatInput_ordination
These binary matrices can be used to test the various functions as a demonstration example, as shown in the worked example in the vignette supplied with the package. The "BinMatInput_reps" and "BinMatInput_ordination" are small hypothetical datasets, illustrating how BinMat consolidates replicate pairs and then creates an nMDS plot coloured by groups (e.g. populations). The "bunias_orientalis" and "nymphaea" datasets are real-world AFLP and ISSR results from
BinMat offers users of fragment analysis methods an efficient and easy-to-use platform to process their binary data matrices, by means of either a graphical user interface or an R package. The programme produces comparable output to other mainstream software, with the benefit of housing all of its functionality on one platform. Suggestions for improvement (for example via pull-requests on GitHub) and feedback from the community, are welcomed.
This work was supported by funding from the South African Working for Water (WfW) programme of the Department of Forestry, Fisheries and the Environment: Natural Resource Management Programmes (DFFE: NRMP). Funding was also provided by the South African Research Chairs Initiative of the Department of Science and Technology and the National Research Foundation (NRF) of South Africa. Any opinion, finding, conclusion or recommendation expressed in this material is that of the authors and the NRF does not accept any liability in this regard. Guy Sutton is thanked for his valuable advice and suggestions in the writing of this manuscript. Megan Reid is thanked for providing her Nymphaea ISSR dataset, testing BinMat's functionality and providing ongoing feedback. Prof. Iain D. Paterson and Dr. Shelley Edwards are thanked for their assistance throughout the course of my MSc degree.
Centre for Biological Control, Department of Zoology and Entomology, Rhodes University, South Africa
There are no conflicts of interest.
Consolidated AFLP binary data from Arias et al. (2014), with a grouping column. This is used as input to BinMat for the creation of an nMDS plot.
Consolidated AFLP binary data from Arias et. al. (2014), without a grouping column.
Raw AFLP binary data from Arias et al. (2014), before replicates have been consolidated.
A NEXUS file containing AFLP binary data for native and invasive Brunias orientalis species from the Tewes et al. (2018) study. This file is used as input for the SplitsTree programme.
Consolidated AFLP binary data from Tewes et. al. (2018), with a grouping column. This is used as input to BinMat for the creation of an nMDS plot.
Tewes et al. (2018) consolidated AFLP binary data without a grouping column.
Raw AFLP binary data from Tewes et al. (2018), before replicates have been consolidated.