PolyOrigin Documentation
Documentation for PolyOrigin.jl
's public interface.
See the internals section for internal function docs.
Contents
Index
PolyOrigin
PolyOrigin.PolyAncestry
PolyOrigin.PolyGeno
PolyOrigin.animCondprob
PolyOrigin.calAccuracy!
PolyOrigin.plotCondprob
PolyOrigin.plotMapComp
PolyOrigin.polyOrigin
PolyOrigin.polyOrigin!
PolyOrigin.readPolyAncestry
PolyOrigin.readPolyGeno
PolyOrigin.readTruegeno!
PolyOrigin.savePolyAncestry
Public Interface
PolyOrigin
— ModulePolyOrigin is a package for haplotype reconstruction in connected polyploid F1 populations. See also: polyOrigin
, polyOrigin!
.
PolyOrigin.PolyGeno
— TypePolyGeno
mutable struct that stores genotypic data and pedigree information
PolyGeno(markermap,parentgeno,offspringgeno,parentinfo,offspringinfo,designinfo,delmarker,correction)
constructor from genotypic data and pedigree information.
Fields
markermap::Vector{DataFrame}
: marker map for each chromosome. markermap[c]
gives the dataframe of chromosome c with columns [:marker, :chromosome, :position].
parentgeno::Vector{Matrix}
: genotypic data in parents. parentgeno[c][m,i]
gives the genotype of parent i at marker m of chromosome c.
offspringgeno::Vector{Matrix}
: genotypic data in offspring. offspringgeno[c][m,i]
gives the genotype of offspring i at marker m of chomosome c.
parentinfo::DataFrame
: parent information with columns [:individual, :ploidy].
offspringinfo::DataFrame
: offspring information with columns [:individual,:population,:isoutlier].
designinfo::DataFrame
: design information with columns [:population, :parent1, :parent2, ...]. The dataframe element n_ij
denotes the nubmer of gametes contributed to each offspring in i-th population by j-th parent. n_ij
=2 means that i-th population is produced by self-fertilization of j-th parent. Row sum must be 2.
delmarker::DataFrame
: dataframe for collecting markers that were removed from markermap, in the stages of parental phasing or map refinement. It has columns [:marker, :chromosome, :position].
correction::DataFrame
: dataframe for collecting parental genotype corrections. It has columns [:round,:marker,:chromosome,:parent,:oldgenotype,:newgenotype,:oldnerr,:newnerr].
A genotype in offspringgeno
must be one of following formats 1-3, and a genotype in parentgeno
must be one of following formats 1-4:
dosage
: ranges from 0, 1, ..., ploidy, and missing for missing dosage;readcount
: [c1, c2] where c1 and c2 are the number of reads for alleles 1 and 2, respectively. Missing genotypes are denoted by [0,0].probability
: [p0, p1, ...], where p_i denotes the probability of observed data given dosage i = 0, ..., ploidy, and the probabilities are normalized so that their sum is 1.phasedgeno
: [g1, g2,...], where g_i=1 or 2 for i=1,..., ploidy.
PolyOrigin.PolyAncestry
— TypePolyAncestry
mutable struct the results of haplotype reconstruction.
PolyAncestry(markermap,parentgeno,parentinfo,offspringinfo,designinfo,
statespace,valentprob,genoprob)
Constructor from the results of haplotype reconstruction.
Fields
markermap::Vector{DataFrame}
: marker map for each chromosome. markermap[c]
gives the dataframe of chromosome c with columns [:marker, :chromosome, :position].
parentgeno::Vector{Matrix}
: genotypic data in parents. parentgeno[c][m,i]
gives the genotype of parent i at marker m of chromosome c.
parentinfo::DataFrame
: parent information with columns [:individual, :ploidy].
offspringinfo::DataFrame
: offspring information with columns [:individual,:population].
designinfo::DataFrame
: design information with columns [:population, :parent1, :parent2, ...]. The dataframe element n_ij
denotes the nubmer of gametes contributed to each offspring in i-th population by j-th parent. n_ij
=2 means that i-th population is produced by self-fertilization of j-th parent. Row sum must be 2.
delmarker::DataFrame
: dataframe for collecting markers that were removed from markermap, in the stages of parental phasing or map refinement. It has columns [:marker, :chromosome, :position].
correction::DataFrame
: dataframe for collecting parental genotype corrections. It has columns [:marker,:chromosome,:parent,:oldgenotype,:newgenotype,:oldnerr,:newnerr].
statespace::Dict
: specify valents and origin-genotypes for each population. Each key is a population id, and its value is in turn a dict with keys: "parent", "parentindex", "valent", "valentneighbor", and "groupstate".
valentprob::Union{Nothing,Vector}
, posterior valent probability for each offspring in each chromosome. valentprob[c][o] for offspring o in chromsome c is a matrix with three columns being valentindex, loglike, and posterior probability. Here posterior probability is calcuated from loglike, assuming a discrete uniform prior distribution of valent configurations.
genoprob::Vector
: posterior origin-genotype probability for each offspring in each chromosome. genoprob[c][o] for offspring o in chromsome c is a sparse matrix, with element (m,s) being the probability of origin-genotype s at marker m.
haploprob::Union{Nothing,Vector}
: posterior origin-haplotype probability for each offspring in each chromosome. haploprob[c][o] for offspring o in chromsome c is a sparse matrix, with element (m,s) being the probability of origin-haplotype s at marker m.
PolyOrigin.polyOrigin
— FunctionpolyOrigin(genofile, pedfile, delimchar=',', missingstring="NA",
commentstring="#", keyargs...)
performs parental phasing and ancestral inference from input files and return polyancestry::PolyAncestry. Only ancestral inference is performed in the case of phased parents.
Positional arguments
genofile::AbstractString
: filename for genotypic data file.
pedfile::AbstractString
: filename for pedigree info.
see readPolyGeno
for the requirements of genofile
and pedfile
.
Keyword arguments
delimchar::AbstractChar=','
: text delimiter.
missingstring::AbstractString="NA"
: string code for missing value.
commentstring::AbstractString="#"
: rows that begin with commentstring will be ignored.
isphysmap::Bool=false
: if true, input markermap is physical map, where marker locations are in unit of base pair(bp).
recomrate::Real=1
: recombination rate in unit of 1 cM/Mbp (centiMorgan per million base pair). Valid only if isphysmap=true
.
see keyargs in polyOrigin!(polygeno::PolyGeno, keyargs...)
Output files
outstem.log
: log file saves messages that were printed on console.
outstem_maprefined.csv
: same as the input genofile, except that input marker map is refined.
outstem_parentphased.csv
: same as the input genofile, except that parental genotypes are phased.
outstem_parentphased_corrected.csv
: parented genotypes are further corrected.
outstem_polyancestry.csv
: saves the returned polyancestry. See savePolyAncestry
.
outstem_genoprob.csv
: a concise version of the above file, including genetic map, phased parental genotypes, and posterior origin-genotype probabilities. See savegenoprob
.
outstem_postdoseprob.csv
: same as the input genofile, except that parent genotypes are phased and offspring genotypes are given by the posterior dose probabilities.
Example
julia> polyOrigin(genofile,pedfile)
PolyOrigin.polyOrigin!
— FunctionpolyOrigin!(polygeno::PolyGeno, keyargs...)
performs parental phasing and ancestral inference from polygeno and return polyancestry::PolyAncestry. Only ancestral inference is performed in the case of phased parents. See readPolyGeno
for creating polygeno from inputfiles.
Positional arguments
polygeno::PolyGeno
: a struct storing genotypic data and pedigree info.
Keyword arguments
doseerr::Real=0.01
: genotyping error probability.
seqerr::Real=0.001
: sequencing read error probability for GBS data.
chrpairing_phase::Integer=22
: chromosome pairing in parental phasing, with 22 being only bivalent formations and 44 being bivalent and quadrivalent formations.
chrpairing::Integer=44
: chromosome pairing in offspring decoding, with 22 being only bivalent formations and 44 being bivalent and quadrivalent formations.
chrsubset::Union{Nothing,AbstractRange,AbstractVector}=nothing
: subset of chromosome, with nothing denoting all chromosomes. Delete chromosome indices that are out of range.
snpsubset::Union{Nothing,AbstractRange,AbstractVector}=nothing
: subset of markers to be considered, with nothing denoting all markers. within a chromosome, marker index starts from 1, and marker indices that are larger than the number of markers within the chromosome are deleted.
isparallel::Bool=true
: if true, multicore computing over chromosomes.
isparalleloffspring::Bool=true
: if true, multicore computing over offspring in ancestral inference.
delmarker::Bool=true
: if true, delete markers during parental phasing.
delsiglevel::Real=0.05
: significance level for deleting markers.
maxstuck::Integer=5
: the max number of consecutive iterations that are rejected in a phasing run.
maxiter::Integer=30
: the max number of iterations in a phasing run.
minrun::Integer=3
: if the number of phasing runs having the same parental phases reaches minrun, phasing algorithm will stop before reaching the maxrun.
maxrun::Integer=10
: the max number of phasing runs.
byparent::Union{Nothing,Bool}=nothing
: if true, update parental phases parent by parent; if false, update parental phases one subpopulation by subpopulation. The nothing denotes that it is true if a connected component is a single F1 cross, and false otherwise.
byneighbor::Union{Nothing,Bool}=nothing
: if ture, udpate the combination of bivalent or multivalents in parents by their neighbors; if false, consider all the possible combinations. The nothing denotes that it is true if max ploidy>=6, and false otherwise.
refhapfile::Union{Nothing,AbstractString} = nothing
: reference haplotype file for setting absolute parental phases. It has the same format as the input genofile, except that parental genotypes are phased and offspring genotypes are ignored if exist.
correctthreshold::AbstractFloat=0.15
: a candidate marker is selected for parental error correction if the fraction of offspring genotypic error >= correctthreshold.
refinemap::Bool=false
: if true, refine genetic map.
refineorder::Bool=false
: if true, refine marker mordering.
maxwinsize::Integer=50
: max size of sliding windown in map refinning.
inittemperature::Real=4
: initial temperature of simulated annealing in map refinning.
coolingrate::Real=0.5
: cooling rate of annealing temperature in map refinning.
stripdis::Real=20
: a chromosome end in map refinement is removed if it has a distance gap > stripdis (centiMorgan) and it contains less than 5% markers.
maxdoseerr::Real=0.5
: markers in map refinement are removed it they have error rates > maxdoseerr.
skeletonsize::Integer=50
: the number of markers in the skeleton map that is used to re-scale inter-map distances.
isinfererror::Bool=false
: if true, infer dosage error rate per marker during hapotype reconstruction.
missingstring::AbstractString="NA"
: string code for missing value.
isplot::Bool=false
: if true, plot condprob for all offspring and save in the folder "outstem_plots".
outstem::Union{Nothing,AbstractString}="outstem"
: stem of output filenames. If nothing, no output files.
logfile::Union{Nothing,AbstractString,IO}= (isnothing(outstem) ? nothing : string(outstem,".log"))
: log file or IO for writing log. If nothing, no log file.
workdir::AbstractString = pwd()
: directory for reading and writing files.
verbose::Bool=true
: if true, print messages on console.
Example
julia> polygeno = readPolyGeno(genofile,pedfile)
julia> polyOrigin!(polygeno)
PolyOrigin.readPolyGeno
— FunctionreadPolyGeno(genofile, pedfile, keyargs...)
reads input files and returns polygeno::PolyGeno.
Positional arguments
genofile::AbstractString
: filename for genotypic data. For example, a CSV formatted genofile looks like
marker, chromosome, pos, ind1,ind2, ind3, ...
snp1, 1, 0.14, 0, 2, 1, 4, ...
snp2, 1, 0.16, 4, 0, NA, 2, ...
snp3, 1, 0.21, NA, 3, 0, 1, ...
The first three columns specify the genetic map. Marker IDs in column 1 must be unique, chromosome IDs in column 2 must be consecutive, and positions (in unit of centi-Morgan or base pair) of markers in column 3 must be non-descreasing within a chromosome.
The rest of columns give the genotypes of sampled individuals. The indvidual IDs must be unique. The genotypes of all parents must be represented by one of the following formats 1-4, and the genotypes of all offspring must be represented by one of the following formats 1-3.
dosage
: ranges from 0, 1, ..., ploidy, and NA for missing dosage;readcount
: c1|c2, where c1 and c2 are the number of reads for alleles 1 and 2, respectively. Missing genotypesare given by 0|0probability
: p(0)|p(1)|...|p(ploidy), where p(i) denotes the probability of observed data given dosage i = 0, ..., ploidy, and the probabilities are normalized so that their sum is 1.phasedgeno
: g(1)|g(2)|...|g(ploidy), where g(i)=1 or 2 for i=1,..., ploidy.
All individuals must be in the
pedfile
.
pedfile::AbstractString
: filename for pedigree information. For example, a CSV formatted pedfile looks like
individual, population, motherid, fatherid, ploidy
P1, 0, 0, 0, 4
P3, 0, 0, 0, 4
P3, 0, 0, 0, 4
offspring1, pop1, P1, P2, 4
offspring2, pop1, P1, P2, 4
offspring3, pop2, P1, P3, 4
offspring4, pop2, P1, P3, 4
offspring5, pop3, P2, P3, 4
offspring6, pop4, P3, P3, 4
The pedigree contains three founders (parents), two offspring from the cross beween parents 1 and 2, two offspring from the cross between parents 1 and 3, one offspring from the cross between parents 2 and 3, and one offspring from the selfing of parent 3.
All individual IDs in column 1 must be unique, column 2 denotes ID for the founder population and IDs for each F1 cross or selfing, columns 3 and 4 denotes the parents of each sub-population (motherID and fatherID of founders are set to 0), and column 5 denotes the ploidy level.
All parents and all offspring must be in the
genofile
.
Keyword arguments
missingstring::AbstractString="NA"
: string code for missing value.
delimchar::AbstractChar=','
: text delimiter.
commentstring::AbstractString="#"
: rows that begins with commentstring will be ignored.
isphysmap::Bool=false
: true if input markermap is physical map, where marker locations are in unit of base pair(bp).
recomrate::Real=1
recombination rate in unit of 1 cM/Mbp (centiMorgan per million base pair).
workdir::AbstractString = pwd()
: directory for reading genofile and pedfile
PolyOrigin.readPolyAncestry
— FunctionreadPolyAncestry(genoprobfile, pedfile, missingstring="NA",workdir=pwd())
return a struct of with type PolyAncestry from genoprobfile and pedfile in the directory workdir.
Positional argument
genoprobfile
: file storing conditional genotype probability.
pedfile::AbstractString
: filename for pedigree information.
Keyword arguments
chrpairing::Integer=44
: chromosome pairing that has been used in producing genoprobfile. chrpairing = 22 indicates only bivalent formations and 44 for bivalent and quadrivalent formations.
missingstring::AbstractString="NA"
: string code for missing value.
delimchar::AbstractChar=','
: text delimiter.
commentstring::AbstractString="#"
: rows that begins with commentstring will be ignored.
workdir::AbstractString = pwd()
: directory for reading genoprobfile and pedfile
readPolyAncestry(ancestryfile, missingstring="NA",workdir=pwd())
return a struct of type PolyAncestry from ancestryfile in the directory workdir.
Positional argument
ancestryfile
: file storing polyancestry that is generated by savePolyAncestry
It is one of output files by polyOrigin
.
Keyward arguments
missingstring::AbstractString="NA"
: string code for missing value.
workdir::AbstractString = pwd()
: directory for reading ancestryfile,
PolyOrigin.savePolyAncestry
— FunctionsavePolyAncestry(outfile,polyancestry,missingstring="NA",workdir=pwd())
saves polyancestry into outfile in CSV format.
Positional arguments
outfile::AbstractString
: file for saving polyancestry in the directory workdir. The saved outfile consists of several tables: the first row of each table has two cells: [PolyOrigin-PolyAncestry, nameoftable], and the rest rows denote a dataframe with column names. The following is a list of table names and their descriptions.
designinfo
: design information with columns being [population, parent1, parent2, ...]. Matrix element Dij denotes the nubmer of gametes contributed to each offspring in i-th population by j-th parent. Dij=2 means that i-th population is produced by self-fertilization of j-th parent.parentinfo
: parent information with columns being [individual, ploidy].offspringinfo
: offspring information with columns being [individual, population, ploidy, isoutlier].delmarker
: markers that were removed from markermap, in the stages of parental phasing
or map refinement, with columns being [:marker, :chromosome, :position]
correction
: parental genotype corrections with columns being
[:round, :marker,:chromosome,:parent,:oldgenotype,:newgenotype,:oldnerr,:newnerr].
valentlist
: list of possible bi- or multi-valent formations for each population. The dataframe has columns [population, parentindex, parent, valentindex, valent].valentprob
: posterior valent probability for each offspring in each chromosome. The dataframe has columns [chromosome, individual, population, valentindex, valent, loglike, valentprob].parentgeno
: phased parental genotypes. The dataframe has columns [marker, chromosome, position, parent1, parent2, ...].ancestralgenotype
: list of origin-genotypes for each population. The dataframe has columns [population, parentindex, parent, stateindex, state]genoprob
: marginal posterior probabilities for origin-genotypes. For each offspring at each marker, the posterior probability vector is represented by a sparse vector in the form of I=>V: state index vector I and non-zero probability vector V such that I[k]=V[k] for k=1, ..., K. The elements of vector I or V are delimited by "|".
polyancestry::PolyAncestry
: results of haplotype reconstruction returned from polyOrigin
.
Keyward arguments
missingstring::AbstractString="NA"
: string code for missing value.
workdir::AbstractString = pwd()
: directory for writing outfile.
PolyOrigin.plotMapComp
— FunctionplotMapComp(mapx, mapy;
boundaryline = (1.5,:dot,:black),
plotmarker=(:star, 1, 0.5,:blue,stroke(:blue)),
isannotate= true,
maplabels=["mapx position", "mapy position"])
plot postions of mapx vs those of mapx.
Positional arguments
mapx::Vector{DataFrame}
: marker map for all chromosomes.
mapy::Vector{DataFrame}
: comparing map for all chromosomes.
Keyword arguments
boundaryline=(1.5,:dot,:black)
: vertical lines for chromosome boundaries.
isannotate::Bool=true
: if ture, annotate chromosome ID and kendall correlation.
mmaplabels::Union{Nothing,AbstractString}=nothing
: labels of comparing marker maps.
PolyOrigin.plotCondprob
— FunctionplotCondprob(polyancestry,offspring=nothing,ishaploprob=true,
colorgradient = ColorGradient([:white,:blue,:red]),
boundaryline = (1.5,:dot,:black),
truemarker=(:star, 5, 0.5,:gray,stroke(:gray)),
truegeno=nothing)
plot heatmap for conditional probability.
Positional arguments
polyancestry::PolyAncestry
: polyancestry returned from polyOrigin
.
Keyword arguments
offspring::Union{Nothing,Integer}=nothing
: offsprign index and random offspring by default.
colorgradient::ColorGradient=ColorGradient([:white,:blue,:red])
: color gradient for heatmap
left_margin = :match
: Specifies the extra padding to the left of the subplot
bottom_margin = :match
: Specifies the extra padding to the bottom of the subplot
boundaryline=(1.5,:dot,:black)
: vertical lines for chromosome boundaries.
truemarker=(:star, 5, 0.5,:gray,stroke(:gray))
: scatter markers for true ancestral states.
truegeno::Union{Nothing,NamedTuple}
: provides true parental origins. It is generated by readTruegeno!
.
PolyOrigin.animCondprob
— FunctionanimCondprob(polyancestry,fps=1,outfile=nothing,kewargs...)
animation for plots of conditional probability.
Positional arguments
polyancestry::PolyAncestry
: polyancestry returned from polyOrigin
.
Keyword arguments
fps::Real=1
: number of frames per seconds.
outfile::AbstractString="condprob.gif"
: output file for saving animation.
see plotCondprob
for keyargs.
PolyOrigin.readTruegeno!
— FunctionreadTruegeno!(truefile,polyancestry, keyargs...)
returned truegeno of NamedTuple with truegeno.parentgeno storing phased parental genotypes, and truegeno.offspringgeno storing true parental origins. polyancestry
is modified (1) extra markers in truefile are put into polyancestry.delmarker, (2) absolute phase of polyancestry.parentgeno is set according to truegeno.parentgeno, and (3) polyancestry.genoprob and polyancestry.haploprob are set to be consistent with the absolute phase.
Positional arguments
truefile::AbstractString
: file storing true values, which has the same requirement as the genofile
. The parental genotypes must be given in the phasedgeno
format with alleles being "1" and "2". The offspring origin-genotypes must be given in the phasedgeno
format with allels being parental homologs.
polyancestry::PolyAncestry
: for checking the consistency of marker IDs, chromosome IDs, and individual IDs.
Keyword arguments
delimchar::AbstractChar=','
: text delimiter.
commentstring::AbstractString="#"
: rows that begins with commentstring will be ignored.
workdir::AbstractString = pwd()
: directory for reading truevalue file.
PolyOrigin.calAccuracy!
— FunctioncalAccuracy!(truefile,polyancestry,workdir=pwd(),io=nothing,verbose=true)
calculate phasing and ancestral inference accuracies accoring to the true values in truefile
. The polyancestry
is modified with absolute parental phases and consistent genoprob.
Positional arguments
truefile::AbstractString
: contains true values for parental phases and ancestral genotypes.
polyancestry::PolyAncestry
: a struct returned by polyOrigin
.
Keyward arguments
workdir::AbstractString = pwd()
: directory for writing outfile.
io::Union{Nothing,IOStream}
: stream for writing log.
verbose::Bool=true
: true if print messages on console.
calAccuracy!(truegeno,polyancestry,io=nothing,verbose=true)
calculate phasing and ancestral inference accuracies accoring to the true values in truegeno
. The polyancestry
is modified with absolute parental phases and consistent genoprob.
Positional arguments
truegeno::NamedTuple
: contains true values for parental phases and ancestral genotypes, returned by readTruegeno!
.
polyancestry::PolyAncestry
: a struct returned by polyOrigin
.
Keyward arguments
io::Union{Nothing,IOStream}
: stream for writing log.
verbose::Bool=true
: true if print messages on console.