vcf - Extrapolating variance components from Weir-Fst on Vcftools -


vcftools --vcf all.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop pop1.txt --weir-fst-pop pop2.txt --out fst.pop1.pop2 

the above script computes fst distances on 1000 genomes population data using weir , cokerham's 1984 formula. formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).

the output directly provides result of formula not components program calculated arrive @ final result. how can ask vcftools output values a,b,c?

if can data format hierfstat, can variance components varcomp.glob. is:

  1. use vcftools --012 genotypes
  2. convert 0/1/2/-1 hierfstat format (eg., 11/12/22/na)
  3. load data hierfstat , compute (see below)

r example:

library(hierfstat) data = read.table("hierfstat.txt", header=t, sep="\t") levels = data.frame(data$popid) loci = data[,2:ncol(data)] res = varcomp.glob(levels=levels, loci=loci, diploid=t) print(res$loc) print(res$f) 

fst each locus (row) therefore (without hierarchical design), res$loc: res$loc[1]/sum(res$loc). if have more complicated sampling, you'll need interpret variance components differently.

--update per comment--

i in pandas, language do. it's text replacement exercise. .012 file dataframe , convert below. read in row row numpy b/c have tons of snps, read_csv work, too.

import pandas pd import numpy np z12_data = [] i, line in enumerate(open(z12_file)):     line = line.strip()     line = [int(x) x in line.split("\t")]     z12_data.append(np.array(line))     if % 10 == 0:         print z12_data = np.array(z12_data) z12_df = pd.dataframe(z12_data) z12_df = z12_df.drop(0, axis=1) z12_df.columns = pd.series(z12_df.columns)-1  hierf_trans = {0:11, 1:12, 2:22, -1:'na'} def apply_hierf_trans(series):     return [hierf_trans[x] if x in hierf_trans else x x in series] hierf = df.apply(apply_hierf_trans) hierf.to_csv("hierfstat.txt", header=true, index=false, sep="\t") 

then, you'd read file hierfstat.txt r, these loci. you'd need specify levels in sampling design (e.g., population). call varcomp.glob() variance components. have parallel version of here if want use it.

note specifying 0 reference allele, in case. may want, maybe not. calculate minor allele frequency , make 2 minor allele, depends on study goal.


Comments