genepopedit

Simple and flexible manipulation of genomic data.

genepopedit

The goal of genepopedit is to provide a simple and flexible tool for manipulating large multi-locus genotype datasets in R.

** Note input parameters for genepopedit versions > 1.0.0.5 are now all completely lowercase. **

** funciton genepop_toploci() is currently deprecated due to the package diveRsity being no longer maintained.**

Use genepopedit to subset a SNP dataset by:

removing specified loci.
removing specified populations.
removing specified individuals.
renaming populations.
grouping populations.
reordering populations.
extracting meta-data:
- population names.
- population counts.
- sample IDs.
- loci names.
- allele calls.
- allele frequencies by population grouping.
- loci linkage and global Weir and Cockerham’s Fst.
- lists of unlinked loci maximizing pairwise global Weir and Cockerham’s Fst.
create datasets for training, assignment, and outlier detection, according to a population stratified random sample.
convert Genepop to STRUCTURE, FSTAT, NEWHYBRIDS, ASSIGNER, BGC, TREEMIX, COLONY, Genetic Stock Identification (gsi_sim), HZAR, and flattened/unflattened format.
simulate individual genotypes using pooled DNA allele frequencies.

workflow-diagram

Fig 1. genepopedit workflow including data preparation, diagnostics, manipulation and transformation. Files, functions, and function operations are denoted by black, grey and blue text, respectively. Function inputs and outputs are denoted by dashed and solid lines, respectively.

Requirement:
genepopedit functions through the manipulation of multi-locus diploid SNP files structured in the Genepop file format link. Specifically, we use the three number format (e.g. 110110) where the six digits correspond to the alleles of a given locus for an individual. Locus names can be listed in the first row separated by columns or each on their own row so that the total number of rows in the Genepop file equals:

nrows = nLOCI + nINDIVIDUALS + nPOPULATIONS - 1

nrows* = nLOCI + nINDIVIDUALS + nPOPULATIONS

* if STACKS version is not specified

For example:

A three locus dataset with two populations and four individuals per population with the STACKS version specified

STACKS Version 1.0
Loci_1
Loci_2
Loci_3
Pop
BON_01 ,  120120 110110 110110
BON_02 ,  100100 110110 110110
BON_03 ,  100100 110110 110110
BON_04 ,  100100 110110 110110
Pop
TAG_01 ,  120120 110110 110110
TAG_02 ,  120120 110110 110110
TAG_03 ,  120120 110110 110110
TAG_04 ,  120120 110110 110110

Alternatively, the loci names can be read in the first row as a single character separated by commas.

Loci_1,Loci_2,Loci,3
Pop
BON_01 ,  120120 110110 110110
BON_02 ,  100100 110110 110110
BON_03 ,  100100 110110 110110
BON_04 ,  100100 110110 110110
Pop
TAG_01 ,  120120 110110 110110
TAG_02 ,  120120 110110 110110
TAG_03 ,  120120 110110 110110
TAG_04 ,  120120 110110 110110

In both formats each row is read in as a single character vector. Sample IDs have the population and sample number separated by a “_”. Between sample ID and the loci is conventional Genepop separator " , " (space,space space). Note if your population label is note separated from the sample number in the Sample IDs refer to the help section for . subset_genepop_rename

Note that input and output “path” variables all require the FULL file path. Relative paths will not work with genepopedit functions.

Most molecular based file formats can be converted to and from Genepop using conversion programs such as the R package adegenet or the program PGDspider.

Installation

You can install genepopedit as a R package using the following 2 steps. Before installation it is recommended that you open up a new R session without any packages loaded. During the devtools installation process some package dependencies may be downloaded or updated. Installation errors can arise if an older version of a package dependency is already loaded.

Step 1 Install the R package devtools

if (!require("devtools")) install.packages("devtools") # to install

Step 2 Install genepopedit:

#install the diveRsity dependency from Github ** this package may no longer be supported by CRAN, depending on your R Version **
#install the package from *Github*
devtools::install_github("kkeenan02/diveRsity")
devtools::install_github("rystanley/genepopedit") 
library(genepopedit) # load the library

Step 3 Install PGDspider and plink. Note these programs are required for the use of genepop_toploc() and PGDspideR() only!

PGDspider link
plink link

For the time being, this function only works with Plink 1.9. Use of Plink 2.0 will create function issues.

** Note that appveyor build ‘failure’ is linked to an error in the build check and not genepopedit itself. This package has been tested on Windows, Linux (Unbuntu & Mint), and IOS operating systems. **

Contributions:

genepopedit was written in collaboration:

Ryan Stanley https://github.com/rystanley - Corresponding developer and maintainer
Ian Bradbury https://bradburygeneticslab.com/
Nick Jeffery https://github.com/NickJeff13
Brendan Wringe https://github.com/bwringe
Sarah Lehnert https://github.com/SarahLehnert

Support funding was inpart provided by the Canadian Healthy Oceans Nework CHONe

This package has been developed to be used by anyone who is looking for a more efficient and repeatable method to manipulate large multi-locus datasets. The package is open, and I encourage you to tinker and look for improvements. I will do my best to respond to any inquiries, add additional functions and-or functionality, and improve the efficiency of the package.

If you don’t understand something, please let me know:
(ryan.stanley at dfo-mpo.gc.ca).
Any ideas on how to improve the functionality are very much appreciated.
If you spot a typo, feel free to edit and send a pull request.

Pull request how-to:

Click the edit this page on the sidebar.
Make the changes using github’s in-page editor and save.
Submit a pull request and include a brief description of your changes. (e.g. “spelling errors” or “indexing error”).

Citation

Full package description and citation now available at Molecular Ecology Resources http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12569/abstract

A Zenodo DOI is also avaiable for the most recent release of genepopedit:

Diagnostics

genepop_detective.R

extract quick meta-data from Genepop file
- Population names
- Population counts
- Sample IDs
- Loci
- Alleles

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file.
variable	variable to report.

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file.
popgroup	population grouping using the “Pop” delimiter (Default: NULL) or a dataframe or path to a csv. The grouping dataframe should have two columns, the first corresponding to the population name and the second to an aggregation vector of common groups. Each population can only be assigned to one group.
wide	logical (default: FALSE) defining whether the output should be cast in ‘wide’ format. Note that ‘wide’ format is accepted as the input for alleleotype_genepop().

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file.
where.plink	A file path to the PLINK installation folder.
where.pgdspider	A file path to the PGDspider installation folder.
maf	Minor allele frequency cutoff (default = 0.05)
path	the filepath & filename of output.

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file.
subs	vector loci names of interest (default: NULL)
keep	logical whether to keep loci specified by subs (default: TRUE) or to keep remaining loci.
spop	populations to be retained (default: NULL).
path	the filepath & filename of output.

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file.
nameframe	a dataframe or path to a csv detailing the original and any edited population names or sampleIDs.
renumber	logical (default: FALSE) whether the sample numbers are to be replaced.
meta	character defining which meta information is being edited. Options are “Pop” (default) or “Ind” for populations or sampleIDs respectively. This parameter must be specified.
path	the filepath & filename of output.

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file.
indiv	vector sample IDs of interest.
keep	logical whether to delete sample IDs specified by indiv (default: TRUE) or delete all other IDs.
path	the filepath & filename of output.

Variable name	Input
input	complete file path to the input file to be converted or an object in the workspace. The first column should denote populations and adjacent columns the allele frequency (major or minor) for each loci (named in the column header).
numsim	the number of simulated individuals to be returned per population (default: 100).
path	the filepath & filename of output.

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file
nsample	object which defines sampling.

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file.
popdef	popgroup is a dataframe or path to a csv. This dataframe contains two columns. Column 1 corresponds to the population names. The next column has the grouping classification corresponding to each population defining parental 1 (“P1”) parental 2 (“P2”) and admixed (“Admixed”) populations.
fname	collective name assigned to each output the output files (3).
path	the path to directory where the BGC files will be saved.

Variable name	Input
genepop	a path to a Genepop file or a dataframe read into the workspace of a Genepop file
distances	A dataframe or path to a text file with your distances between populations. Should contain 2 columns-Populations and Distances.There should be the same number of populations as in the Genepop file.
path	the filepath and filename of output.

Variable name	Input
df	data.frame object in workspace. First column is the sampleID (e.g. “BON_01”) the remaining columns are Loci.
path	the filepath & filename of output.

Variable name	Input
input	complete file path to the input file to be converted.
input_format	format of the input file. This format should match the dropdown menus of pgdSpider in terms of spelling and capitalization (e.g. GENEPOP & FSTAT).
output	complete file path to the defining where the converted file will be stored.
output_format	format of the of the converted file. This format should match the dropdown menus of pgdSpider in terms of spelling and capitalization (e.g. GENEPOP & FSTAT).
spid	complete file path to the .spid file created by pgdSpider defining the conversion between input_format and output_format. Note that parameters of this .spid file must match the conversion and input file as specified by pgdSpider.
where.pgdspider	the filepath to the folder where pgdSpider installation files are stored.

genepopedit

genepopedit

Installation

Contributions:

Citation

Diagnostics

genepop_detective.R

genepop_allelefreq.R

genepop_filter_maf.R

genepop_toploci.R

Manipulation

subset_genepop.R

subset_genepop_rename.R

subset_genepop_aggregate.R

subset_genepop_individuals.R

genepop_ID.R

genepop_reorder.R

Simulation

alleleotype_genepop.R

genepop_sample.R

Transformation

genepop_structure.R

genepop_fstat.R

genepop_newhybrids.R

genepop_assigner.R

genepop_colony.R

genepop_bgc.R

genepop_treemix.R

genepop_GSIsim.R

genepop_hzar.R

genepop_flatten.R

genepop_unflatten.R

Conversion using PGDspider

PGDspideR.R

How to use genepopedit

Preparation

genepop_ID

Diagnostics

genepop_detective

genepop_allelefreq

genepop_filter_maf

genepop_toploci

Manipulation

subset_genepop

subset_genepop_rename

subset_genepop_aggregate

subset_genepop_individual

genepop_reorder

Sampling

genepop_sample

Conversion

genepop_structure

genepop_fstat

genepop_newhybrids

genepop_assigner

genepop_colony

genepop_bgc

genepop_treemix

genepop_GSIsim

genepop_hzar

genepop_flatten

genepop_unflatten

Conversion using PGDspider

PGDspideR

alleleotype_genepop