Biodiversity Data Journal :
Short Communication
|
Corresponding author: Robert G Young (ryoung04@uoguelph.ca)
Academic editor: Scott Chamberlain
Received: 29 Jan 2020 | Accepted: 10 Apr 2020 | Published: 23 Apr 2020
© 2020 Robert Young, Jiaojia Yu, Marie-José Cote, Robert Hanner
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Young RG, Yu J, Cote M-J, Hanner RH (2020) The Molecular Data Organization for Publication (MDOP) R package to aid the upload of data to shared databases. Biodiversity Data Journal 8: e50630. https://doi.org/10.3897/BDJ.8.e50630
|
Molecular identification methods, such as DNA barcoding, rely on centralized databases populated with morphologically identified individuals and their referential nucleotide sequence records. As molecular identification approaches have expanded in use to fields such as food fraud, environmental surveys, and border surveillance, there is a need for diverse international data sets. Although central data repositories, like the Barcode of Life Datasystems (BOLD), provided workarounds for formatting data for upload, these workarounds can be taxing on researchers with few resources and limited funding. To address these concerns, we present the Molecular Data Organization for Publication (MDOP) R package to assist researchers in uploading data to public databases. To illustrate the use of these scripts, we use the BOLD system as an example. The main intent of this writing is to assist in the movement of data, from academic, governmental, and other institutional computer systems, to public locations. The movement of these data can then better contribute to the global DNA barcoding initiative and other global molecular data efforts.
Molecular database, DNA barcode, molecular sequence data, data organization tools, BOLD
The use of molecular identification techniques on biological samples has been in practice for some time and includes methods like restriction fragment length polymorphisms (
There are numerous large molecular databases including the National Center for Biotechnology Information (NCBI) GenBank, European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database, DNA Data Bank of Japan (DDBJ), and Barcode of Life Datasystems (BOLD). To contribute to these repositories users must follow specific upload guidelines and data formats. Following these guidelines can be challenging to end users. The scope of these challenges is often exacerbated when collaborating with researchers on large data projects over long periods of time. The inclusion of metadata can make adding records to these databases even more demanding. Metadata is data associated with primary sequence data, such as GPS location of collection, source of the DNA (such as tissue type), image files of the specimen, chromatograph files, and a multitude of other possibilities.
One of the data systems with more complex uploading processes, largely due to the rich metadata and files associated with records, is BOLD (
Owing to its accuracy, universal nature, and more standardized methodology as compared to traditional morphological identifications, DNA barcoding has been successfully applied to a wide range of fields of study including taxonomy (
In addition, the adoption of DNA barcoding for regulatory applications has placed barcoding in the context of state and international laws (
Populating public databases remains the main challenge for a growing community of researchers, regulators, and others using molecular identifications in a DNA barcoding approach. Scientists, who intend to share their data publicly, need to invest time to organize their data files to fulfill upload requirements. Although this can sometimes be labour-intensive, the ongoing management of this data in a centralized location provides great value when research teams shift or change, when data sharing across distances, and when publishing final data sets to accompany peer-reviewed literature. Although the strengths of a centralized DNA barcoding initiative are clear, there are still relatively few researchers contributing to public databases; and even when researchers contribute, they have been slow to make these data accessible to all users (
Although the DNA barcoding community has addressed potential roadblocks to the movement of data to shared databases through upload processes, challenges remain. This is especially true of researchers new to DNA barcoding or with few skills or resources to informatically bring data together in a standardized manner. For these researchers, hiring bioinformaticians can be expensive. In addition, some available options to organize data, such as using commands via a terminal window or command prompt (i.e. DOS C:\) may not always be possible when working in government or industry, where cybersecurity measures limit access (
There are multiple R packages developed for manipulating genomic data (
To address these gaps, we present the package Molecular Data Organization for Publication (MDOP) (R Ver. 3.5.1;
The following three sections describe scenarios where data or files need to be manipulated or obtained. The functions in this package can be initiated through use of one or more arguments when initiating the function. However, if so desired the user does have the option to initiate the function without arguments and will be prompted for the necessary information (see the package-associated readme file for examples https://github.com/rgyoung6/MDOP/blob/master/README.md).
When preparing uploads to centralized databases, a list of all files is often needed with associated information (e.g., image file data, trace file data). These data can be obtained using DOS or IOS commands. However, the use of the command prompt is not always possible, particularly in places where security features limit this possibility such as government institutions and industry.
target_file_list()
This function lists files with the extensions JPG, AB1, or FASTA/FAS for a chosen directory and all subdirectories. The list of file paths and file names will be saved as a text file in the chosen directory. The user can either choose to submit the file path and the file type as arguments when initiating the tool or, alternatively, can run the tool with no arguments and be prompted for the necessary information. If running without arguments, the user will first be prompted to choose a file folder as a location to save the output file. Then it is necessary to input the type of file for which you would like to have a list (JPG, AB1, or FAS). The output for the script will appear in a text file with the naming convention YYYYMMDD_target_file_list_TYP.txt, where the first eight characters represent the date of running, the second section is in reference to the function name, and the final section with TYP is the file type chosen (JPG, AB1, or FAS).
Moving, copying, sorting, and subsetting large numbers of files is often necessary when preparing to upload data to shared databases. The organization of files for upload is made more difficult when processing files from multiple sources, research groups, and over time. These three functions can assist in the organization of diverse sets of files for upload.
recursive_copy()
Often, the submission of numerous files, including image and chromatogram (trace) files, to a centralized data system is necessary. Bringing files into a central folder may be difficult when dealing with large numbers of files stored in cascading file structures. recursive_copy() is written to bring all files with a specific extension into a central location thereby making it easier to upload these files. The recursive_copy() function copies files with the extensions JPG, AB1, or FASTA/FAS in a directory and all subdirectories and places these files in a single destination folder. The user can either choose to submit the file path and the file type as arguments when initiating the tool or can run the tool with no arguments and be prompted for the necessary information. If running without arguments, the user will first be prompted to choose a file folder where the new folder of copied files will be located. Then it is necessary to input the type of file for which you would like to copy the files (JPG, AB1, or FAS). The output for the script will appear in a file folder with the naming convention YYYYMMDD_ recursive_copy()_TYP, where the first eight characters represent the date of running, the second section is in reference to the function name, and the final section with TYP is the file type chosen (JPG, AB1, or FAS).
max_packs()
Uploads of image files to centralized databases are often limited to a particular size per upload. It can be time consuming and challenging to partition files into folders of target sizes. The max_packs() function can be utilized to create these partitioned folders quickly and easily. This function will take a single file folder (but not containing folders) with target files (JPG, AB1, or FAS) and distribute them into folders based on a maximum folder size. The user can either choose to submit the file path, file type, and maximum desired file folder size as arguments when initiating the tool or can run the tool with no arguments and be prompted for the necessary information. If running without arguments, the user will first be prompted to choose a file folder with the target files of interest. Then it is necessary to input the type of file for which you would like to copy the files (JPG, AB1, or FAS). Finally, the user will be required to input an integer value for the maximum allowable size for the folders created with the copied files. The outputs for the script will appear in the target file folder location with the naming convention YYYYMMDD_ max_packs_TYP_#, where the first eight characters represent the date of running, the second section is in reference to the function name, the third element TYP is the file type chosen (JPG, AB1, or FAS), and the final element # is an index for the folder number.
copy_by_list()
It is likely, after scrutiny, that some files associated with molecular records will not need to be uploaded to shared databases due to quality filtering. For example, if a DNA sequence was of poor quality it might be removed from the dataset for potential upload. This would then require the removal of associated metadata files. It is often time consuming to complete a point-and-click removal for all these records. In addition, the screening of these poor-quality records is often completed in fasta files and/or through the use of lists. copy_by_list() will assist in the copying of select files in a larger file folder and placing them into a new file folder based on a specified list. This tool will copy the files based on a list of file names in a target text file and place the copies in a file folder at the identified location. This script will not look at subdirectories in the target directory. To get the files of interest into a single file folder, see recursive_copy(). When using copy_by_list(), the user can either choose to submit the file path and target file list as arguments when initiating the function or can run the function with no arguments and be prompted for the necessary information. If running without arguments, the user will first be prompted to choose a file folder where the files of interest are located. Then it is necessary to select the target file with the list of desired files. The output file folder name will follow the format YYYYMMDD_copy_by_list, where the first eight characters represent the date of running, and the second section is in reference to the function name. The text file with the list of target files which the user wants to be copied into a single folder needs to have one file name per line and a single blank line at the end of the list.
The manipulation of sequence data can also be a challenge when uploading to databases. This is especially true when dealing with large data sets containing multiple markers from different sources, researchers, or naming conventions. The following five R functions will help to manipulate multiple sequence fasta files. One note is that degap(), rank_seq(), head_derep(), and seq_derep() require a single line (not multiline) fasta input file for proper functioning. If the working file is in multiline format, the user can use multi_to_single_fasta() to convert it to single line format.
degap()
It is often desired to only upload unaligned data to public databases. To accomplish this easily we present the degap() tool. This function is designed to remove gaps (represented by "-") from all sequences in a selected fasta file. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or they can run the tool with no arguments and be prompted for the necessary information. The output file will follow the naming convention YYYYMMDD_degap.fas and be saved in the selected working directory.
rank_seq()
Often it is useful to screen out sequences of shorter length from further analyses. rank_seq() will take a multiple sequence fasta file and organize the sequences from shortest to longest. This will ease the visualisation of the fasta file in an alignment program and facilitate the selection of sequences over a given length and removal of sequences below a target length. This tool takes a select multiple sequence fasta file, organizes the sequences from shortest to longest, and saves the output in a new fasta file. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or, alternatively, can run the tool with no arguments and be prompted for the necessary information. The new sorted file will be saved in the selected location with the naming convention YYYYMMDD_rank_seq.fas.
head_derep()
Removing duplicate records based on the fasta file header may be necessary to ensure no repetition of data. head_derep() addresses this need. This function will reduce a select fasta file to all unique entries based on the headers. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or, alternatively, can run the tool with no arguments and be prompted for the necessary information. The output will be saved in the selected directory with the naming convention of YYYYMMDD_head_derep.fas.
seq_derep()
Removing duplicate records based on sequence may be necessary to ensure no repetition of data or when looking to determine the haplotype diversity in a multiple sequence file. This tool will reduce a select fasta file to all unique entries based on the sequences. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or, alternatively, can run the tool with no arguments and be prompted for the necessary information. The output will be saved in the selected directory with the naming convention of YYYYMMDD_seq_derep.fas.
multi_to_single_fasta()
Often, multiline fasta files where the header is on the first line followed by one or more lines of up to 80 characters containing nucleotide sequence data can be problematic when using different programs or tools. multi_to_single_fasta() can be used to change a multiple line fasta file format to a single line format where each header has a single line of nucleotide sequence data associated with a header. This tool will accept a multi-line fasta file and convert it to a single line fasta file format. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or, alternatively, can run the tool with no arguments and be prompted for the necessary information. The output will be saved in the selected directory with the naming convention of YYYYMMDD_multi_to_single_fasta.fas.
This work is intended to assist scientists, technicians, and data managers to organize DNA barcode data and associated metadata for upload to public databases. A fundamental element of DNA barcoding is the presence of a centralized repository with diverse data providing an understanding of the within-species compared to between-species variation present in sequences. Since this information is essential to all barcode projects, a greater effort to populate these databases and make records public is necessary. Although this package and these nine functions are not comprehensive, they do represent a step forward in helping researchers move data to shared databases. This is especially true in organizations with secured networks where access to file manipulation via terminal windows is not possible. As such, we present the Molecular Data Organization for Publication (MDOP) R package to remove obstacles to these uploads.
This work was supported in part through a research collaboration with the Canadian Food Inspection Agency through the Federal Assistance Program. Participation in this study was also supported by the Bioinformatics Masters program at the University of Guelph. The authors would like to thank Jarrett Phillips and Yoamel Milián-García for commenting on earlier drafts. In addition, we would like to thank three reviewers and an editor for helpful comments preparing this manuscript for publication.
RGY conceived and designed the study. RGY and JY wrote the R scripts. RGY, JY, MJC, and RHH evaluated the scripts. All authors discussed the results and contributed to the final manuscript.
The authors of this manuscript are not aware of any conflict of interest related to the preparation and publishing of this manuscript.