Biodiversity Data Journal :
R Package
|
Corresponding author: Jarrett D. Phillips (jphill01@uoguelph.ca)
Academic editor: Zachary Foster
Received: 18 Oct 2022 | Accepted: 30 Nov 2022 | Published: 26 Jan 2023
© 2023 Jarrett Phillips, Taryn Athey, Paul McNicholas, Robert Hanner
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Phillips JD, Athey TB.T, McNicholas PD, Hanner RH (2023) VLF: An R package for the analysis of very low frequency variants in DNA sequences. Biodiversity Data Journal 11: e96480. https://doi.org/10.3897/BDJ.11.e96480
|
Here, we introduce VLF, an R package to determine the distribution of very low frequency variants (VLFs) in nucleotide and amino acid sequences for the analysis of errors in DNA sequence records. The package allows users to assess VLFs in aligned and trimmed protein-coding sequences by automatically calculating the frequency of nucleotides or amino acids in each sequence position and outputting those that occur under a user-specified frequency (default of p = 0.001). These results can then be used to explore fundamental population genetic and phylogeographic patterns, mechanisms and processes at the microevolutionary level, such as nucleotide and amino acid sequence conservation.
Our package extends earlier work pertaining to an implementation of VLF analysis in Microsoft Excel, which was found to be both computationally slow and error prone. We compare those results to our own herein. Results between the two implementations are found to be highly consistent for a large DNA barcode dataset of bird species. Differences in results are readily explained by both manual human error and inadequate Linnean taxonomy (specifically, species synonymy). Here, VLF is also applied to a subset of avian barcodes to assess the extent of biological artifacts at the species level for Canada goose (Branta canadensis), as well as within a large dataset of DNA barcodes for fishes of forensic and regulatory importance. The novelty of VLF and its benefit over the previous implementation include its high level of automation, speed, scalability and ease-of-use, each desirable characteristics which will be extremely valuable as more sequence data are rapidly accumulated in popular reference databases, such as BOLD and GenBank.
DNA barcoding, frequency matrix, genetic diversity, PCR error, sequencing error, trace file
The ability to distinguish between sequence disparity arising from true biological variation versus that arising as a result of sequencing artifacts, known to occur during the PCR/sequencing process, is of great importance. Numerous studies have noted the detrimental effect of sequencing errors on the accurate estimation of key population genetic parameters for assessment of genetic diversity, such as effective population size (Ne), haplotype diversity (h) and nucleotide diversity (\(\pi\)) (
Concerning PCR errors, whose magnitudes are highly variable (
Screening high-volume DNA sequences for putative errors can reveal incorrect basecalls, chimeras/heteroplasmies, contaminants, insertion-deletion mutations (indels) and other nucleotide substitutions, as well as nuclear-mitochondrial (NUMT) insert/pseudogene amplification (
DNA barcoding uses a small gene fragment from a standardised (orthologous) region of the genome to identify multicellular species (
A key approach employed within many modern sequencing platforms to quantitatively assess putative errors stemming from incorrect nucleotide basecalls is the PHRED quality score (
Here, we present VLF version 1.1 (
The VLF package inputs aligned DNA sequences as a matrix in FASTA format using the function fasta.read(file, seqlength = 648, pos1 = 1, pos2 = 3) and converts it into a sequence matrix. The first column of the matrix contains a specimen identifier, while the second gives the species name, followed by the DNA sequence in subsequent columns. The FASTA input header should be separated by ‘|’ and the 'pos1' and 'pos2' identifiers indicate the header’s position for the unique specimen identifier ('pos1') and the species name ('pos2'). For example, a FASTA header may be ‘>GBGC1668-06|NC 005317|Thunnus alalunga|COI-5P’, where GBGC1668-06 is the unique specimen identifier in the first position after the ‘>’ (pos1 = 1) and the species is Thunnus alalunga in the third position (pos2 = 3). The default sequence length is 648 bp. This function will automatically separate the FASTA file into a matrix containing the unique specimen identifier in the first column, the species name in the second column and the nucleotide sequence in the subsequent columns, one column per nucleotide. If the user wishes, they may also upload their sequences from their own format, provided the final sequence matrix follows these conventions. Sequence alignment can be handled using external software programmes such as MEGA (
The VLF package consists of three main functions: vlfFun(x, p = 0.001, seqlength = 648, own = NULL), aminoAcidFun(x, p = 0.001, seqlength = 216, own = NULL) and concordanceFun(nuc, aa, nuclength = 648, aalength = 216, aminoAcid.Modal). The functions vlfFun() and aminoAcidFun() have the same output: ‘modal’, ‘con100’, ‘conp’, ‘combine’, ‘specimen’, ‘position’, ‘sas’ and ‘VLFmatrix’. The ‘modal’ object contains the sequence of nucleotides or amino acids that occur most often in each position, based on the calculated frequencies. The ‘con100’ value gives the number of sequence positions that are 100% conserved amongst all specimens in the dataset, while the ‘conp’ value gives the number of sequence positions that are (1 - p)% conserved (i.e. if using the default value of p = 0.001, then (1 - p)% = 99.9%). The ‘combine’ value gives the number of amino acid positions that are (1 - p)% conserved for the first and second modal sequence (i.e. the two most common sequence variants in a taxon dataset). ‘Specimen’ is a vector containing the number of VLFs for each specimen in the dataset and 'position' is the number of VLFs for each sequence position in the dataset. The value ‘sas’ gives the number of singleton and shared VLFs in each sequence position of the dataset. Lastly, ‘VLFmatrix’ is a reduced matrix containing only VLFs, with “NA”s in any position that does not contain a VLF. Additionally, if the user specifies their own sequences, then the programme outputs specimen VLF counts (‘ownSpecCount’), position VLF counts (‘ownPosCount’), a VLF matrix containing all “own“ specimens (‘ownVLFMatrix’) and a reduced VLF matrix containing only those specimens which have VLFs in their sequence (‘ownVLFreduced’). This output allows the user to assess their own sequences of interest more easily, without having to filter through large datasets. The third main function of the VLF package is concordanceFun(nuc, aa, nuclength = 648, aalength = 216, aminoAcid Modal), where ‘nuc’ and ‘aa’ are the VLFmatrix outputs of the vlfFun() and aminoAcidFun() functions, respectively, ‘nuclength’ and ‘aalength’ are the sequence lengths for the nucleotide and amino acid sequences, respectively (648 bp and 216 residues by default) and ‘aminAcidModal’ is the modal output of aminoAcidFun(). The main goal of the concordanceFun() function is to calculate how many nucleotide VLFs occur within the codon of an amino acid VLF. The output for this function is a list of concordant nucleotide and amino acid VLFs ('matched'), a calculation of how many concordant VLFs there are for each codon position ('codons'), the number of concordant amino acid VLFs that changed amino acid residue type (‘concordantType’), the number of overall amino acid VLFs that changed amino acid residue type (‘aminoAcidType’), the overall number of nucleotide VLFs and amino acid VLFs that showed concordance (‘concordNuc’ and ‘concordAA’, respectively) and the number of sequences that contained both nucleotide VLFs and amino acid VLFs (‘sequences’).
The VLF package also has several other useful functions, such as one to calculate singleton, shared and total VLF error rates, based on a high degree of conservation at second codon positions (Error.Rate(single, shared, spec, seqlength)). In computing total error rates, both singleton and shared VLFs should be considered. This is because, despite shared VLFs making up a negligible fraction of overall sequences, they comprise a high proportion of sequences with VLFs (
\(\mathrm{Singleton \; ER} = \mathrm{\frac{\mathrm{2nd \; Position \; Singleton \; VLFs}}{\Big(\frac{2nd \; Positions}{Barcode}\hspace{1mm} - \hspace{1mm}2nd \; Position \; Shared \; VLFs\Big)\Big(Number \; of \; Barcodes\Big)}}\)
\(\mathrm{Shared \; ER} = \mathrm{\frac{\mathrm{2nd \; Position \; Shared \; VLFs}}{\Big(\frac{2nd \; Positions}{Barcode}\hspace{1mm} - \hspace{1mm}2nd \; Position \; Singleton \; VLFs\Big)\Big(Number \; of \; Barcodes\Big)}}\)
\(\mathrm{Total \; ER} = \mathrm{\frac{\mathrm{2nd \; Position \; VLFs}}{\Big(\frac{2nd \; Positions}{Barcode}\Big)\Big(Number \; of \; Barcodes\Big)}}\)
A useful feature of the VLF package is the ability to distinguish VLFs that are shared between members of the same species (i.e. occurring in two or more sequences) or that are singletons (i.e. occurring in only a single individual). In the case of singleton sequences, it is important to know how they manifest in large barcode libraries. There are two possibilities: (1) only a single specimen of a species was sampled or (2) multiple individuals within a species lacking true genetic polymorphisms were sampled (
In a study by
The question, therefore, that must be addressed is: does there exist an optimal threshold size for specimen sampling above which no new genetic (i.e. DNA barcode haplotype) variation is likely to be observed for a species? That is, can all (or nearly all) DNA barcode haplotype diversity for a species be uncovered by simply sampling N individuals? If so, how confident can one be in such an estimate?
It is well known that current sample sizes within barcode libraries are likely insufficient for making inferences at the phylogenetic level, for instance, in the calculation of divergence times of sister taxa via neutral coalescent/molecular clock models, but there is evidence that suggests otherwise (e.g.
The VLF package also contains functions to give visual outputs of the distribution of VLFs throughout the sequences. Decile.Plot(VLF, seqlength = 648) creates a decile plot showing the number of VLFs in every tenth of the sequence. The input ‘VLF’ is the ‘position’ output of the vlfFun() and aminoAcidFun() functions, containing the counts of VLFs in each position of the sequence. The user may also enter in the ‘sas’ output of these functions, to create a decile plot of both the single and shared VLFs. Similarly, the VLF package also contains the function Sliding.Window(VLF, seqlength = 648, n = 30) which creates a sliding window plot of VLFs with a default window size of 30 bp. A 30 bp k-mer window was selected by
In the following subsections, focus is placed specifically on ntVLFs (hereafter simply referred to as VLFs) for the sake of brevity. Required DNA sequence data is included in Suppl. material
Aligned avian barcode sequences, identified to at least the family level, were downloaded from the supplementary material of
Sequences were then analysed in R using the three primary functions of the VLF package outlined above in conjunction with others. Results were concordant with those of Stoeckle and Kerr (2012) (Fig.
In comparing results obtained via VLF to those found by
VLFs |
1 |
2 |
3 |
4 |
5 |
6 |
9 |
10 |
13 |
15 |
Specimens |
446 |
63 |
27 |
3 |
2 |
5 |
2 |
2 |
1 |
1 |
VLFs |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
13 |
14 |
15 |
18 |
Positions |
91 |
53 |
25 |
19 |
12 |
11 |
5 |
7 |
5 |
3 |
4 |
2 |
2 |
1 |
1 |
Positional error rates for birds. Per barcode error rates are indicated in parentheses.
Singleton |
8.54 x 10-5 (0.0553) |
Shared |
3.92 x 10-5 (0.0254) |
Total |
1.25 x 10-4 (0.0810) |
Another advantage of the VLF package is automation of the analysis. To perform this analysis using Excel, the user must manually enter macros for each individual dataset. The automation of the analysis makes it a user-friendly tool that can be utilised as a clean-up step during a barcode analysis workflow.
In addition, the effect of reducing full-length avian barcodes evenly at both the 5’ and 3’ ends and the choice of VLF frequency cut-off, on the presence of VLFs is clearly illustrated in Fig.
Taxa with large numbers of collected specimens should be expected to show strong VLF signals relative to the real biological variation present in DNA sequences. Thus, in addition to investigating the prevalence of VLFs at the class level (Aves), the incidence of VLFs at the species level was assessed for Branta canadensis (Canada goose), the species with the largest number of specimens (125) in
Analysis of the Canada goose dataset revealed a total of 27 specimen VLFs (between 1 and 3 VLFs for each specimen) across all 125 examined barcode sequences (18 specimens comprised VLFs). Similarly, 27 positional VLFs were observed across the entire 648 bp barcode segment (five singleton VLFs, 22 shared VLFs, between 1-10 VLFs for each position). Ten alignment positions displayed VLFs: sites 58, 59, 124, 145, 147, 190, 435, 490, 501, 535. Position 145 contained the most VLFs at 10, while all other sites had between 1 and 4 VLFs. Most VLFs were concentrated at the 5’ end of sequences, with 15 VLFs occurring within the third decile alone (Fig.
VLFs |
1 |
2 |
3 |
Specimens |
11 |
5 |
2 |
VLFs |
1 |
2 |
3 |
4 |
10 |
Positions |
5 |
1 |
2 |
1 |
1 |
Positional error rates for Canada goose (Branta canadensis). Per barcode error rates are indicated in parentheses.
Singleton |
7.74 x 10-3 (5.016) |
Shared |
3.56 x 10-3 (2.304) |
Total |
0.0113 (7.320) |
Seafood fraud is a growing economic and ecological problem facing society today. DNA-based identification of specimens to species (e.g. DNA barcoding) is increasingly being used as a means of verifying product integrity. The availability of such technologies is important given that species of higher economic value (e.g. halibut, red snapper) are often substituted with those of lower cost (e.g. catfish, tilapia) (
To assess the utility of VLF to the field of barcoding forensics for regulatory purposes, DNA barcodes from four research studies published between 2008 and 2011 (
BOLD Project Code |
No. of 5’-COI Sequences |
No. of Families/Genera/Species |
EBFSF |
296 |
2/6/10 |
EMRKT |
91 |
23/32/20 |
EWSHK |
1050 |
18/32/76 |
SSNA |
934 |
1/2/8 |
Total |
2371 |
44/72/114 |
Sequence alignment was carried out in MEGA6 using MUSCLE and the ‘Align DNA’ option with default parameters. Ends of the aligned sequences were then trimmed to the standard barcode length for fishes (i.e. 652 bp) and subsequently translated to amino acids using the ‘Vertebrate Mitochondrial’ and the ‘Invertebrate Mitochondrial’ codon tables. Alignments were checked for the absence of stop codons and verification that they were in the correct reading frame. Sequencing artifacts were common within DNA barcodes. For example, a single-base indel (specifically, a nucleotide deletion), identified using the SequenceMatrix (
Findings are presented below (Fig.
VLFs |
1 |
2 |
3 |
4 |
6 |
34 |
Specimens |
42 |
10 |
1 |
3 |
1 |
1 |
Positional error rates for fishes. Per barcode error rates are indicated in parentheses.
Singleton |
7.81 x 10-5 (0.0509) |
Shared |
1.56 x 10-5 (0.0102) |
Total |
9.37 x 10-5 (0.0611) |
Since the publication of
The VLF package is a useful tool for assessing errors in DNA sequences; however, the presence of a single VLF is not always an indication of biological error and so caution must be exercised when investigating these cases. When VLFs occur, it is advisable to assess whether they are singletons or shared between multiple specimens/species. The specific analyses carried out herein suggest that singleton and shared VLFs may occur outside the narrow 3’ and 5’ windows as seen in Fig.
Note that this method is only useful with large datasets of sequences since the default cut-off frequency for VLF designation is p = 0.001. Therefore, a dataset with at least 1,000 sequences is required, but even larger datasets are suggested. Having as much haplotype variation as possible for a given taxon is ideal; however, the datasets should not be so deeply divergent that many specimens are expected to have vastly different sequences from other specimens within the same dataset. As well, the datasets should contain multiple members from each species, to ensure adequate representation of singleton and shared VLFs. It is suggested to use 5-10 individuals per species, if possible, which is typical for most barcoding initiatives conducted to date (
It is important to consider the ways in which VLF assessment may be implemented into the BOLD system, as biological variants should not be tagged as sequence errors. An interesting, but noteworthy connection exists between the occurrence of sequencing errors within barcode records and the Barcode Index Number (BIN) framework (
VLF analysis is a useful tool for evaluating errors in sequence records. The VLF package allows users to quickly and easily assess their own barcode records without the need for manual configuration or the use of Excel. While we tested the VLF R package on a previously-studied avian barcode dataset, as well as investigated the distribution of VLF sequencing errors in DNA barcodes from a variety of seafood species to probe the incidence of product mislabelling, we suggest the programme be used further to assess sequence errors in other large BARCODE and non-BARCODE libraries within GenBank and BOLD, such as Lepidoptera. Inspection of species-specific sliding window plots could indicate highly-variable nucleotide sequence regions subject to high mutation rates (such as that indicated by the sharp peaks in Fig.
In this paper, we present a new R package, VLF, along with a simple R workflow, for quality assessment and curation of large reference sequence libraries through detection of sequence artifacts, such as machine errors, indels and NUMTs/pseudogenes, inconsistencies which have been observed in diverse COI sequence datasets for crayfish, grasshoppers, marine Metazoa and insects for instance (
VLF version 1.1 is available for download through the Comprehensive R Archive Network (CRAN) directly within R using the successive commands:
>install.packages(“VLF”)
>library(VLF).
The reference manual for VLF, which includes built-in functions with explanations for their proper use, can be accessed by typing:
>?VLF.
The birds nucleotide and amino acid dataset used by
>data(birds)
>data(birds_aminoAcids)
Package source code can be accessed by typing the name of the desired function. Alternatively, code is accessible via GitHub at https://github.com/jphill01/VLF.R.
Raw (unaligned) 5’-COI sequences used in the forensic VLF analysis can be directly downloaded using the Project and Dataset Search field within the BOLD Workbench.
We acknowledge that the University of Guelph resides on the ancestral lands of the Attawandaron people and the treaty lands and territory of the Mississaugas of the Credit. We recognise the significance of the Dish with One Spoon Covenant to this land and offer our respect to our Anishinaabe, Haudenosaunee and Metis neighbours as we strive to strengthen our relationships.
We would like to thank Mark Stoeckle for his guidance and input throughout the process of this work. Special thanks to Charles (Charlie) Keown-Stoneman, Rodger Gwiazdowski and members of the Hanner and Newmaster lab groups at the University of Guelph for their valuable edits and comments on earlier drafts of the manuscript. In addition, Sarah (Sally) Adamowicz provided invaluable edits and comments to the work in the final stages of writing.
This work was supported by the University Research Chair in Computational Statistics and an Early Researcher Award from the Government of Ontario to P.D.M.
JDP wrote the manuscript, retrieved the goose and fish sequence data, carried out data analysis and submitted version 1.1 of the VLF package to CRAN. TBTA coded the majority of the VLF package. PDM acted as an advisor in the field of statistics and bioinformatics, as well as assisted in coding the VLF R package to conform to requirements for inclusion within the CRAN repository. RHH acted as an advisor in the field of DNA barcoding. All authors participated in the editing of the manuscript and approved the final version.
None declared.
652 bp FASTA alignment of 2371 published fish 5'-COI DNA barcodes.
R script to reproduce all analyses.