PolyOrigin Documentation

Documentation for PolyOrigin.jl's public interface.

See the internals section for internal function docs.

Contents

Index

Public Interface

PolyOrigin.PolyGenoType
PolyGeno

mutable struct that stores genotypic data and pedigree information

PolyGeno(markermap,parentgeno,offspringgeno,parentinfo,offspringinfo,designinfo,delmarker,correction)

constructor from genotypic data and pedigree information.

Fields

markermap::Vector{DataFrame}: marker map for each chromosome. markermap[c] gives the dataframe of chromosome c with columns [:marker, :chromosome, :position].

parentgeno::Vector{Matrix}: genotypic data in parents. parentgeno[c][m,i] gives the genotype of parent i at marker m of chromosome c.

offspringgeno::Vector{Matrix}: genotypic data in offspring. offspringgeno[c][m,i] gives the genotype of offspring i at marker m of chomosome c.

parentinfo::DataFrame: parent information with columns [:individual, :ploidy].

offspringinfo::DataFrame: offspring information with columns [:individual,:population,:isoutlier].

designinfo::DataFrame: design information with columns [:population, :parent1, :parent2, ...]. The dataframe element n_ij denotes the nubmer of gametes contributed to each offspring in i-th population by j-th parent. n_ij=2 means that i-th population is produced by self-fertilization of j-th parent. Row sum must be 2.

delmarker::DataFrame: dataframe for collecting markers that were removed from markermap, in the stages of parental phasing or map refinement. It has columns [:marker, :chromosome, :position].

correction::DataFrame: dataframe for collecting parental genotype corrections. It has columns [:round,:marker,:chromosome,:parent,:oldgenotype,:newgenotype,:oldnerr,:newnerr].

Note

A genotype in offspringgeno must be one of following formats 1-3, and a genotype in parentgeno must be one of following formats 1-4:

  1. dosage: ranges from 0, 1, ..., ploidy, and missing for missing dosage;
  2. readcount: [c1, c2] where c1 and c2 are the number of reads for alleles 1 and 2, respectively. Missing genotypes are denoted by [0,0].
  3. probability: [p0, p1, ...], where p_i denotes the probability of observed data given dosage i = 0, ..., ploidy, and the probabilities are normalized so that their sum is 1.
  4. phasedgeno: [g1, g2,...], where g_i=1 or 2 for i=1,..., ploidy.
source
PolyOrigin.PolyAncestryType
PolyAncestry

mutable struct the results of haplotype reconstruction.

PolyAncestry(markermap,parentgeno,parentinfo,offspringinfo,designinfo,
    statespace,valentprob,genoprob)

Constructor from the results of haplotype reconstruction.

Fields

markermap::Vector{DataFrame}: marker map for each chromosome. markermap[c] gives the dataframe of chromosome c with columns [:marker, :chromosome, :position].

parentgeno::Vector{Matrix}: genotypic data in parents. parentgeno[c][m,i] gives the genotype of parent i at marker m of chromosome c.

parentinfo::DataFrame: parent information with columns [:individual, :ploidy].

offspringinfo::DataFrame: offspring information with columns [:individual,:population].

designinfo::DataFrame: design information with columns [:population, :parent1, :parent2, ...]. The dataframe element n_ij denotes the nubmer of gametes contributed to each offspring in i-th population by j-th parent. n_ij=2 means that i-th population is produced by self-fertilization of j-th parent. Row sum must be 2.

delmarker::DataFrame: dataframe for collecting markers that were removed from markermap, in the stages of parental phasing or map refinement. It has columns [:marker, :chromosome, :position].

correction::DataFrame: dataframe for collecting parental genotype corrections. It has columns [:marker,:chromosome,:parent,:oldgenotype,:newgenotype,:oldnerr,:newnerr].

statespace::Dict: specify valents and origin-genotypes for each population. Each key is a population id, and its value is in turn a dict with keys: "parent", "parentindex", "valent", "valentneighbor", and "groupstate".

valentprob::Union{Nothing,Vector}, posterior valent probability for each offspring in each chromosome. valentprob[c][o] for offspring o in chromsome c is a matrix with three columns being valentindex, loglike, and posterior probability. Here posterior probability is calcuated from loglike, assuming a discrete uniform prior distribution of valent configurations.

genoprob::Vector: posterior origin-genotype probability for each offspring in each chromosome. genoprob[c][o] for offspring o in chromsome c is a sparse matrix, with element (m,s) being the probability of origin-genotype s at marker m.

haploprob::Union{Nothing,Vector}: posterior origin-haplotype probability for each offspring in each chromosome. haploprob[c][o] for offspring o in chromsome c is a sparse matrix, with element (m,s) being the probability of origin-haplotype s at marker m.

source
PolyOrigin.polyOriginFunction
polyOrigin(genofile, pedfile, delimchar=',', missingstring="NA",
    commentstring="#", keyargs...)

performs parental phasing and ancestral inference from input files and return polyancestry::PolyAncestry. Only ancestral inference is performed in the case of phased parents.

Positional arguments

genofile::AbstractString: filename for genotypic data file.

pedfile::AbstractString: filename for pedigree info.

see readPolyGeno for the requirements of genofile and pedfile.

Keyword arguments

delimchar::AbstractChar=',': text delimiter.

missingstring::AbstractString="NA": string code for missing value.

commentstring::AbstractString="#": rows that begin with commentstring will be ignored.

isphysmap::Bool=false: if true, input markermap is physical map, where marker locations are in unit of base pair(bp).

recomrate::Real=1: recombination rate in unit of 1 cM/Mbp (centiMorgan per million base pair). Valid only if isphysmap=true.

see keyargs in polyOrigin!(polygeno::PolyGeno, keyargs...)

Output files

outstem.log: log file saves messages that were printed on console.

outstem_maprefined.csv: same as the input genofile, except that input marker map is refined.

outstem_parentphased.csv: same as the input genofile, except that parental genotypes are phased.

outstem_parentphased_corrected.csv: parented genotypes are further corrected.

outstem_polyancestry.csv: saves the returned polyancestry. See savePolyAncestry.

outstem_genoprob.csv: a concise version of the above file, including genetic map, phased parental genotypes, and posterior origin-genotype probabilities. See savegenoprob.

outstem_postdoseprob.csv: same as the input genofile, except that parent genotypes are phased and offspring genotypes are given by the posterior dose probabilities.

Example

julia> polyOrigin(genofile,pedfile)
source
PolyOrigin.polyOrigin!Function
polyOrigin!(polygeno::PolyGeno, keyargs...)

performs parental phasing and ancestral inference from polygeno and return polyancestry::PolyAncestry. Only ancestral inference is performed in the case of phased parents. See readPolyGeno for creating polygeno from inputfiles.

Positional arguments

polygeno::PolyGeno: a struct storing genotypic data and pedigree info.

Keyword arguments

doseerr::Real=0.01: genotyping error probability.

seqerr::Real=0.001: sequencing read error probability for GBS data.

chrpairing_phase::Integer=22: chromosome pairing in parental phasing, with 22 being only bivalent formations and 44 being bivalent and quadrivalent formations.

chrpairing::Integer=44: chromosome pairing in offspring decoding, with 22 being only bivalent formations and 44 being bivalent and quadrivalent formations.

chrsubset::Union{Nothing,AbstractRange,AbstractVector}=nothing: subset of chromosome, with nothing denoting all chromosomes. Delete chromosome indices that are out of range.

snpsubset::Union{Nothing,AbstractRange,AbstractVector}=nothing: subset of markers to be considered, with nothing denoting all markers. within a chromosome, marker index starts from 1, and marker indices that are larger than the number of markers within the chromosome are deleted.

isparallel::Bool=true: if true, multicore computing over chromosomes.

isparalleloffspring::Bool=true: if true, multicore computing over offspring in ancestral inference.

delmarker::Bool=true: if true, delete markers during parental phasing.

delsiglevel::Real=0.05: significance level for deleting markers.

maxstuck::Integer=5: the max number of consecutive iterations that are rejected in a phasing run.

maxiter::Integer=30: the max number of iterations in a phasing run.

minrun::Integer=3: if the number of phasing runs having the same parental phases reaches minrun, phasing algorithm will stop before reaching the maxrun.

maxrun::Integer=10: the max number of phasing runs.

byparent::Union{Nothing,Bool}=nothing: if true, update parental phases parent by parent; if false, update parental phases one subpopulation by subpopulation. The nothing denotes that it is true if a connected component is a single F1 cross, and false otherwise.

byneighbor::Union{Nothing,Bool}=nothing: if ture, udpate the combination of bivalent or multivalents in parents by their neighbors; if false, consider all the possible combinations. The nothing denotes that it is true if max ploidy>=6, and false otherwise.

refhapfile::Union{Nothing,AbstractString} = nothing: reference haplotype file for setting absolute parental phases. It has the same format as the input genofile, except that parental genotypes are phased and offspring genotypes are ignored if exist.

correctthreshold::AbstractFloat=0.15: a candidate marker is selected for parental error correction if the fraction of offspring genotypic error >= correctthreshold.

refinemap::Bool=false: if true, refine genetic map.

refineorder::Bool=false: if true, refine marker mordering.

maxwinsize::Integer=50: max size of sliding windown in map refinning.

inittemperature::Real=4: initial temperature of simulated annealing in map refinning.

coolingrate::Real=0.5: cooling rate of annealing temperature in map refinning.

stripdis::Real=20: a chromosome end in map refinement is removed if it has a distance gap > stripdis (centiMorgan) and it contains less than 5% markers.

maxdoseerr::Real=0.5: markers in map refinement are removed it they have error rates > maxdoseerr.

skeletonsize::Integer=50: the number of markers in the skeleton map that is used to re-scale inter-map distances.

isinfererror::Bool=false: if true, infer dosage error rate per marker during hapotype reconstruction.

missingstring::AbstractString="NA": string code for missing value.

isplot::Bool=false: if true, plot condprob for all offspring and save in the folder "outstem_plots".

outstem::Union{Nothing,AbstractString}="outstem": stem of output filenames. If nothing, no output files.

logfile::Union{Nothing,AbstractString,IO}= (isnothing(outstem) ? nothing : string(outstem,".log")): log file or IO for writing log. If nothing, no log file.

workdir::AbstractString = pwd(): directory for reading and writing files.

verbose::Bool=true: if true, print messages on console.

Example

julia> polygeno = readPolyGeno(genofile,pedfile)
julia> polyOrigin!(polygeno)
source
PolyOrigin.readPolyGenoFunction
readPolyGeno(genofile, pedfile, keyargs...)

reads input files and returns polygeno::PolyGeno.

Positional arguments

genofile::AbstractString: filename for genotypic data. For example, a CSV formatted genofile looks like

    marker, chromosome, pos, ind1,ind2, ind3, ...
    snp1, 1, 0.14, 0, 2, 1, 4, ...
    snp2, 1, 0.16, 4, 0, NA, 2, ...
    snp3, 1, 0.21, NA, 3, 0, 1, ...
Note
  • The first three columns specify the genetic map. Marker IDs in column 1 must be unique, chromosome IDs in column 2 must be consecutive, and positions (in unit of centi-Morgan or base pair) of markers in column 3 must be non-descreasing within a chromosome.

  • The rest of columns give the genotypes of sampled individuals. The indvidual IDs must be unique. The genotypes of all parents must be represented by one of the following formats 1-4, and the genotypes of all offspring must be represented by one of the following formats 1-3.

    1. dosage: ranges from 0, 1, ..., ploidy, and NA for missing dosage;
    2. readcount: c1|c2, where c1 and c2 are the number of reads for alleles 1 and 2, respectively. Missing genotypesare given by 0|0
    3. probability: p(0)|p(1)|...|p(ploidy), where p(i) denotes the probability of observed data given dosage i = 0, ..., ploidy, and the probabilities are normalized so that their sum is 1.
    4. phasedgeno: g(1)|g(2)|...|g(ploidy), where g(i)=1 or 2 for i=1,..., ploidy.
  • All individuals must be in the pedfile.

pedfile::AbstractString: filename for pedigree information. For example, a CSV formatted pedfile looks like

    individual, population, motherid, fatherid, ploidy
    P1, 0, 0, 0, 4
    P3, 0, 0, 0, 4
    P3, 0, 0, 0, 4
    offspring1, pop1, P1, P2, 4
    offspring2, pop1, P1, P2, 4
    offspring3, pop2, P1, P3, 4
    offspring4, pop2, P1, P3, 4
    offspring5, pop3, P2, P3, 4
    offspring6, pop4, P3, P3, 4
Note
  • The pedigree contains three founders (parents), two offspring from the cross beween parents 1 and 2, two offspring from the cross between parents 1 and 3, one offspring from the cross between parents 2 and 3, and one offspring from the selfing of parent 3.

  • All individual IDs in column 1 must be unique, column 2 denotes ID for the founder population and IDs for each F1 cross or selfing, columns 3 and 4 denotes the parents of each sub-population (motherID and fatherID of founders are set to 0), and column 5 denotes the ploidy level.

  • All parents and all offspring must be in the genofile.

Keyword arguments

missingstring::AbstractString="NA": string code for missing value.

delimchar::AbstractChar=',': text delimiter.

commentstring::AbstractString="#": rows that begins with commentstring will be ignored.

isphysmap::Bool=false: true if input markermap is physical map, where marker locations are in unit of base pair(bp).

recomrate::Real=1 recombination rate in unit of 1 cM/Mbp (centiMorgan per million base pair).

workdir::AbstractString = pwd(): directory for reading genofile and pedfile

source
PolyOrigin.readPolyAncestryFunction
readPolyAncestry(genoprobfile, pedfile, missingstring="NA",workdir=pwd())

return a struct of with type PolyAncestry from genoprobfile and pedfile in the directory workdir.

Positional argument

genoprobfile: file storing conditional genotype probability.

pedfile::AbstractString: filename for pedigree information.

Keyword arguments

chrpairing::Integer=44: chromosome pairing that has been used in producing genoprobfile. chrpairing = 22 indicates only bivalent formations and 44 for bivalent and quadrivalent formations.

missingstring::AbstractString="NA": string code for missing value.

delimchar::AbstractChar=',': text delimiter.

commentstring::AbstractString="#": rows that begins with commentstring will be ignored.

workdir::AbstractString = pwd(): directory for reading genoprobfile and pedfile

source
readPolyAncestry(ancestryfile, missingstring="NA",workdir=pwd())

return a struct of type PolyAncestry from ancestryfile in the directory workdir.

Positional argument

ancestryfile: file storing polyancestry that is generated by savePolyAncestry It is one of output files by polyOrigin.

Keyward arguments

missingstring::AbstractString="NA": string code for missing value.

workdir::AbstractString = pwd(): directory for reading ancestryfile,

source
PolyOrigin.savePolyAncestryFunction
savePolyAncestry(outfile,polyancestry,missingstring="NA",workdir=pwd())

saves polyancestry into outfile in CSV format.

Positional arguments

outfile::AbstractString: file for saving polyancestry in the directory workdir. The saved outfile consists of several tables: the first row of each table has two cells: [PolyOrigin-PolyAncestry, nameoftable], and the rest rows denote a dataframe with column names. The following is a list of table names and their descriptions.

  1. designinfo: design information with columns being [population, parent1, parent2, ...]. Matrix element Dij denotes the nubmer of gametes contributed to each offspring in i-th population by j-th parent. Dij=2 means that i-th population is produced by self-fertilization of j-th parent.

  2. parentinfo: parent information with columns being [individual, ploidy].

  3. offspringinfo: offspring information with columns being [individual, population, ploidy, isoutlier].

  4. delmarker: markers that were removed from markermap, in the stages of parental phasing

or map refinement, with columns being [:marker, :chromosome, :position]

  1. correction: parental genotype corrections with columns being

[:round, :marker,:chromosome,:parent,:oldgenotype,:newgenotype,:oldnerr,:newnerr].

  1. valentlist: list of possible bi- or multi-valent formations for each population. The dataframe has columns [population, parentindex, parent, valentindex, valent].

  2. valentprob: posterior valent probability for each offspring in each chromosome. The dataframe has columns [chromosome, individual, population, valentindex, valent, loglike, valentprob].

  3. parentgeno: phased parental genotypes. The dataframe has columns [marker, chromosome, position, parent1, parent2, ...].

  4. ancestralgenotype: list of origin-genotypes for each population. The dataframe has columns [population, parentindex, parent, stateindex, state]

  5. genoprob: marginal posterior probabilities for origin-genotypes. For each offspring at each marker, the posterior probability vector is represented by a sparse vector in the form of I=>V: state index vector I and non-zero probability vector V such that I[k]=V[k] for k=1, ..., K. The elements of vector I or V are delimited by "|".

polyancestry::PolyAncestry: results of haplotype reconstruction returned from polyOrigin.

Keyward arguments

missingstring::AbstractString="NA": string code for missing value.

workdir::AbstractString = pwd(): directory for writing outfile.

source
PolyOrigin.plotMapCompFunction
plotMapComp(mapx, mapy;
    boundaryline = (1.5,:dot,:black),
    plotmarker=(:star, 1, 0.5,:blue,stroke(:blue)),
    isannotate= true,
    maplabels=["mapx position", "mapy position"])

plot postions of mapx vs those of mapx.

Positional arguments

mapx::Vector{DataFrame}: marker map for all chromosomes.

mapy::Vector{DataFrame}: comparing map for all chromosomes.

Keyword arguments

boundaryline=(1.5,:dot,:black): vertical lines for chromosome boundaries.

isannotate::Bool=true: if ture, annotate chromosome ID and kendall correlation.

mmaplabels::Union{Nothing,AbstractString}=nothing: labels of comparing marker maps.

source
PolyOrigin.plotCondprobFunction
plotCondprob(polyancestry,offspring=nothing,ishaploprob=true,
    colorgradient = ColorGradient([:white,:blue,:red]),
    boundaryline = (1.5,:dot,:black),
    truemarker=(:star, 5, 0.5,:gray,stroke(:gray)),
    truegeno=nothing)

plot heatmap for conditional probability.

Positional arguments

polyancestry::PolyAncestry: polyancestry returned from polyOrigin.

Keyword arguments

offspring::Union{Nothing,Integer}=nothing: offsprign index and random offspring by default.

colorgradient::ColorGradient=ColorGradient([:white,:blue,:red]): color gradient for heatmap

left_margin = :match: Specifies the extra padding to the left of the subplot

bottom_margin = :match: Specifies the extra padding to the bottom of the subplot

boundaryline=(1.5,:dot,:black): vertical lines for chromosome boundaries.

truemarker=(:star, 5, 0.5,:gray,stroke(:gray)): scatter markers for true ancestral states.

truegeno::Union{Nothing,NamedTuple}: provides true parental origins. It is generated by readTruegeno!.

source
PolyOrigin.animCondprobFunction
animCondprob(polyancestry,fps=1,outfile=nothing,kewargs...)

animation for plots of conditional probability.

Positional arguments

polyancestry::PolyAncestry: polyancestry returned from polyOrigin.

Keyword arguments

fps::Real=1: number of frames per seconds.

outfile::AbstractString="condprob.gif": output file for saving animation.

see plotCondprob for keyargs.

source
PolyOrigin.readTruegeno!Function
readTruegeno!(truefile,polyancestry, keyargs...)

returned truegeno of NamedTuple with truegeno.parentgeno storing phased parental genotypes, and truegeno.offspringgeno storing true parental origins. polyancestry is modified (1) extra markers in truefile are put into polyancestry.delmarker, (2) absolute phase of polyancestry.parentgeno is set according to truegeno.parentgeno, and (3) polyancestry.genoprob and polyancestry.haploprob are set to be consistent with the absolute phase.

Positional arguments

truefile::AbstractString: file storing true values, which has the same requirement as the genofile. The parental genotypes must be given in the phasedgeno format with alleles being "1" and "2". The offspring origin-genotypes must be given in the phasedgeno format with allels being parental homologs.

polyancestry::PolyAncestry: for checking the consistency of marker IDs, chromosome IDs, and individual IDs.

Keyword arguments

delimchar::AbstractChar=',': text delimiter.

commentstring::AbstractString="#": rows that begins with commentstring will be ignored.

workdir::AbstractString = pwd(): directory for reading truevalue file.

source
PolyOrigin.calAccuracy!Function
calAccuracy!(truefile,polyancestry,workdir=pwd(),io=nothing,verbose=true)

calculate phasing and ancestral inference accuracies accoring to the true values in truefile. The polyancestry is modified with absolute parental phases and consistent genoprob.

Positional arguments

truefile::AbstractString: contains true values for parental phases and ancestral genotypes.

polyancestry::PolyAncestry: a struct returned by polyOrigin.

Keyward arguments

workdir::AbstractString = pwd(): directory for writing outfile.

io::Union{Nothing,IOStream}: stream for writing log.

verbose::Bool=true: true if print messages on console.

source
calAccuracy!(truegeno,polyancestry,io=nothing,verbose=true)

calculate phasing and ancestral inference accuracies accoring to the true values in truegeno. The polyancestry is modified with absolute parental phases and consistent genoprob.

Positional arguments

truegeno::NamedTuple: contains true values for parental phases and ancestral genotypes, returned by readTruegeno!.

polyancestry::PolyAncestry: a struct returned by polyOrigin.

Keyward arguments

io::Union{Nothing,IOStream}: stream for writing log.

verbose::Bool=true: true if print messages on console.

source