BinMat: A molecular genetics tool for processing binary data obtained from fragment analysis in R

Abstract Processing and visualising trends in the binary data (presence or absence of electropherogram peaks), obtained from fragment analysis methods in molecular biology, can be a time-consuming and often cumbersome process. Scoring and analysing binary data (from methods, such as AFLPs, ISSRs and RFLPs) entail complex workflows that require a high level of computational and bioinformatic skills. The application presented here (BinMat) is a free, open-source and user-friendly R Shiny programme (https://clarkevansteenderen.shinyapps.io/BINMAT/) that automates the analysis pipeline on one platform. It is also available as an R package on the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/web/packages/BinMat/index.html). BinMat consolidates replicate sample pairs of binary data into consensus reads, produces summary statistics and allows the user to visualise their data as ordination plots and clustering trees without having to use multiple programmes and input files or rely on previous programming experience.


Introduction
Fragment analysis is a method in molecular biology that encompasses the processes by which fragments of DNA are separated by size in order to generate characteristic band profiles. Bands are detected and scored through either the traditional method of viewing them on polyacrylamide gels (Bassam et al. 1991) or through the use of fluorescent markers (such as FAM or ROX) that tag fragments so that they can be detected by capillary electrophoresis (Dresler-Nurmi et al. 2000, Applied Biosystems 2014. There are a number of techniques associated with fragment analysis, including AFLP (Amplified Fragment Length Polymorphism) (Vos et al. 1995), RAPD (Random Amplified Polymorphic DNA) (Koeleman et al. 1998) and ISSR (Inter-Simple Sequence Repeats) (Wolfe andListon 1998, Abbot 2001). Fragment analysis offers a wide range of applications, such as DNA fingerprinting, SNP (single nucleotide polymorphism) genotyping and microsatellite profiling (Applied Biosystems 2014), which are used across a broad range of disciplines.
Processing and analysing the binary data, obtained from fragment analysis methods, can quickly become challenging due to the large size of datasets and the time required to organise and format them to suit the needs of different programmes used in analysis pipelines. Common practice is to independently replicate each Polymerase Chain Reaction (PCR) sample in order to consolidate the output into one consensus read per individual (see, for example, Taylor et al. 2011 andSutton et al. 2017). The term 'consolidate', as used here, refers to the process of checking the binary value scored at each locus position across every replicate pair and creating one representative consensus output for that sample. For example, if both replicates show the presence of a band at a particular locus, a '1' is recorded as 'present' at that locus. If a band was absent in both replicates, a '0' is recorded. If one replicate shows the presence of a band, but the other shows an absence, a '?' is recorded to denote an ambiguous read.
Manually consolidating the replicate pairs of large binary matrices in this way is not only impractical, but it also lends itself to human error. Even after fragments have been scored and processed, the downstream analyses of these data are complex. For example, a number of different programmes are often required for different analyses, each of which require a different input file format. This requires a certain level of computational and/or bioinformatic skills, can be both difficult and time-consuming and can result in further potential errors when changing between file formats.
The R programming language (RStudio Team 2020) is becoming an increasingly popular means of analysing genetic data (Paradis et al. 2004, Schliep 2011, Archer et al. 2017, as it can read in multiple file formats and perform a number of analyses all on one platform. Packages in R can, however, often be challenging to utilise for newcomers to programming. The development of a GUI (Graphical User Interface) can address this by collating multiple processing tools into one place and make complex computational tasks more accessible to researchers (see, for example, Reyes et al. 2019).
Here, I present BinMat, an R package and R Shiny application that automates the analysis of fragment data. Named 'BinMat', from 'Binary Matrix', the application offers researchers a TM user-friendly, open-source platform that does not require multiple programmes and file input formats (Fig. 1). Moreover, a GUI was developed to make data processing easier and more accessible. BinMat is available on three platforms; namely the shinyapps.io server, GitHub and as an R package on CRAN. The following sections detail the functionality of BinMat, how its output compares to PAST (Hammer et al. 2001) and SplitsTree (Huson 1998) (which are standalone software typically used to analyse genetic data) and how it can be accessed.

R Shiny graphical user interface
The R Shiny application platform allocates a maximum memory of 1 GB and is accessible here. The online version may time-out due to insufficient memory if a particularly large binary data file is uploaded. In such a case, the programme can be run directly from R on the user's local machine by typing Flowchart of the utility of the BinMat programme, starting with input that has been processed in programmes such as GeneMarker and RawGeno, to the rapid visualisation of a hierarchical clustering tree and non-metric dimensional scaling (nMDS) plot.
The programme's code is freely available on GitHub.

File input
BinMat reads in binary data that has already been processed from raw electropherograms using programmes such as GeneMarker (SoftGenetics) and RawGeno (Arrigo et al. 2012). This needs to be uploaded as a CSV (comma-separated values) file in the format shown in Table 1. Column headings are required, but are not limited to the exact labels shown in the example. If the data consist of replicate pairs, these need to be organised so that they appear consecutively, with a unique name for each sample. It is important to check the data to ensure that there are no single samples without their replicate. When the 'Consolidate matrix' button is clicked, each replicate pair in the dataset is consolidated into a consensus output.  Table 2 shows the output if the data in Table 1 were used as input. The resulting consolidated binary matrix can be downloaded as a CSV file using the 'Download Matrix' button once the message 'COMPLETE. READY FOR DOWNLOAD' appears on the screen. The 'Check my data for unwanted values' button checks the data for any values in the dataset other than a '1', '0', or '?' and returns the column and row index for the unwanted character/s. File input for a dataset containing replicate pairs that needs to be consolidated. Table 2.
A consolidated matrix derived from Table 1 using BinMat.

Output overview
Once the data have been consolidated, the user can view and download information in the 'SUMMARY' tab at the top of the window; showing the average number of peaks (± standard deviation (sd)), the maximum and minimum number of peaks and the total number of loci. The 'ERROR RATES' tab shows the Euclidean (EE) (± sd) and Jaccard (JE) (± sd) error rates. See and Holland et al. 2008 for detailed reviews regarding error rates and their calculation.
The 'Remove samples with a jaccard error greater than' button removes samples with a Jaccard error (ranging from 0 to 1) greater than or equal to a specified value. This can give the user an idea of how filtering their data can affect overall error rates. The default value is set at zero.
Clustering methods, such as the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and neighbour-joining, are frequently used in the analyses of fragment data to create dendrograms (e.g. Van Eldere et al. 1999, Ticknor et al. 2001, Liu et al. 2009, Timm et al. 2010. Additionally, ordination methods, such as those offered by non-metric multidimensional scaling (nMDS) plots, are also often used (see, for example, Denaro et al. 2005, Zhang et al. 2008, Vašek et al. 2017).

Hierarchical clustering tree: UPGMA
The 'UPGMA TREE' tab in BinMat allows the user to upload a consolidated binary matrix as a CSV file (in the format shown in Table 2), specify the number of bootstrap replications and download the resulting hierarchical clustering tree as a scalable vector graphics (SVG) file. This function makes use of the pvclust function in the pvclust package (Suzuki and Shimodaira 2006) and uses the UPGMA clustering method. The uploaded binary data are converted into a distance matrix applying the Jaccard transformation (dJ ) (Jaccard 1908) shown below. f represents the total number of times that a band occurred at the same locus in both samples, f represents the shared absence of bands and f and f represents the number of times that a band was present in only one of the two sample replicates. The Jaccard transformation was applied using the dist function, applying the 'binary' method. This transformation was prefered because it does not treat the shared absence of bands as being biologically meaningful.

Ordination: nMDS Plot
The 'nMDS PLOT' tab allows the user to upload a consolidated binary matrix with grouping information as a CSV file. The input file format is shown in Table 3, where grouping information needs to appear in the second column.
The distance methods available are 'binary' (Jaccard's distance), 'euclidean', 'maximum', 'manhattan', 'canberra' and 'minkowski'. The 'No. of dimensions (k)' option can be set at '2' or '3' and can be determined using the 'nMDS Validation' tab using the 'Scree plot' and i 11 00 10 01 'Shepard plot' buttons. The resulting distance matrix can be downloaded as a CSV file and the plot itself as a SVG file. Once the user has uploaded their data, an editable table will appear to allow for the selection of colours and symbols for each group. The user can adjust symbol size and can select whether sample labels should appear on the graph or not. The nMDS plot is created using the isoMDS function in the MASS package (Venables and & Ripley 2002).

Scree plot
The optimal number of dimensions to use for the nMDS plot should minimise the resulting stress value. Clarke 1993 suggests that stress values < 0.05 = excellent, < 0.10 = good, < 0.20 = usable, > 0.20 = not acceptable, while Dugard et al. 2010 suggest that a stress value below 0.15 represents a good fit for the data. BinMat indicates the 0.15 threshold as a dotted red line on the resulting scree plot.

Shepard plot
Shepard plots are graphical representations of how well the ordination fits the original distance data (Leeuw and Mair 2014). BinMat plots the original Jaccard distances (x-axis) against the transformed distances used to create the nMDS ordination plot (y-axis). R values are shown on the plot for the regression line of best fit.

Filter data
The 'Filter data' tab allows the user to filter their dataset by setting a threshold value for the number of peaks present. The new subsetted data and the removed samples can be downloaded as a CSV file and re-uploaded to create a new nMDS plot and/or hierarchical clustering tree.

Comparing BinMat's output to PAST and SplitsTree
Two AFLP datasets were downloaded from the Dryad Digital Repository. These comprised data generated by Arias et al. 2014 and for Heliconius (Lepidoptera: Nymphalidae) and Bunias orientalis L. (Brassicaceae) specimens, respectively. With the 2 Table 3.
Data input required for the creation of a non-metric multidimensional scaling (nMDS) plot. Grouping information needs to be in the second column. The data here represents binary replicate pairs that have already been consolidated into consensus reads.
authors' permission, a subset of each were used to compare output from BinMat to that of PAST v.4.0 (Paleontological Statistics Software Package for Education and Data Analysis) (Hammer et al. 2001) and SplitsTree v.4.14.6 (Huson 1998) (input data are available in Suppl. materials 1, 2, 3, 4, 5, 6, 7). Replicate pairs were consolidated in BinMat and used to create nMDS plots and UPGMA hierarchical clustering trees (1000 bootstrap repetitions). The lowest number of dimensions were used for nMDS plots (k = 2) and their stress and R values recorded. SplitsTree was used to create a NeighborNet tree applying Jaccard's distance transformation. The nMDS plots created by BinMat and PAST showed comparable clustering patterns (Fig. 2A1, A2, B1 and B2).
Comparisons of non-metric multidimensional scaling (nMDS) plots in BinMat (A1 and B1) and PAST (A2 and B2). Both nMDS plots are plotted for k = 2 dimensions. Data were taken from Arias et al. 2014 (A1 and A2) and Tewes et al. 2018 (B1, B2 and B4). Stress and R values are shown above each plot. Diagram B3 shows the original nMDS plot presented by , which depicts the same clustering pattern of the native range samples (T1, T3 and T4). Diagram B4 shows the SplitsTree representation of the same data (NeighborNet, Jaccard distance).
The SplitsTree output for the data taken from Tewes et al. 2018 (Fig. 2B4) corroborated the corresponding nMDS plot from the original paper (Fig. 2B3) and from that created by BinMat (Fig. 2B1). Both hierarchical clustering trees using the UPGMA method showed equivalent topologies and bootstrap support values for clades (Fig. 3). BinMat, PAST and SplitsTree perform equally as well for the visualisation of fragment analysis output, where BinMat offers the advantage of a quicker, automated process on one platform.

BinMat as an R package on CRAN
The BinMat R package is available on the Comprehensive R Archive Network (CRAN) and on GitHub and is command-line driven. More information about the package can be obtained by typing library(help = BinMat) into the console after it has been installed. This details all the functions available (Table 4).
peak.remove() Removes samples with peaks equal to, or less than, a specified threshold value.
peaks.consolidated() Peak summary for a consolidated binary matrix.
peaks.orignal() Peak summary for replicate data or consolidated data from file.
scree() Creates a scree plot of stress values vs. ordination dimensions.
shepard() Creates a shepard plot for goodness-of-fit for ordination data.
There are four example binary matrices embedded in the BinMat package called "BinMatInput_reps", "BinMatInput_ordination", "bunias_orientalis" and "nymphaea" that can be accessed by creating objects such as: data1 = BinmatInput_reps data2 = BinmatInput_ordination These binary matrices can be used to test the various functions as a demonstration example, as shown in the worked example in the vignette supplied with the package. The "BinMatInput_reps" and "BinMatInput_ordination" are small hypothetical datasets, illustrating how BinMat consolidates replicate pairs and then creates an nMDS plot coloured by groups (e.g. populations). The "bunias_orientalis" and "nymphaea" datasets are real-world AFLP and ISSR results from Reid et al. 2021, respectively. These two datasets have already been consolidated and serve as examples for the generation of nMDS plots.

Conclusion
BinMat offers users of fragment analysis methods an efficient and easy-to-use platform to process their binary data matrices, by means of either a graphical user interface or an R package. The programme produces comparable output to other mainstream software, with the benefit of housing all of its functionality on one platform. Suggestions for improvement Table 4.
BinMat R package functions available on CRAN. Typing ?functionName into the console provides more information about each function.
(for example via pull-requests on GitHub) and feedback from the community, are welcomed.