从VCF(变异调用文件)加载numpy数组。
项目描述
从VCF(变异调用格式)文件加载数据到numpy数组中,并(可选)从那里加载到HDF5文件。
安装
安装需要numpy和cython
$ pip install cython $ pip install numpy $ pip install vcfnp
……或者
$ git clone --recursive git://github.com/alimanfoo/vcfnp.git $ cd vcfnp $ python setup.py build_ext --inplace
使用方法
对于从Python的使用,请参阅IPython笔记本示例,或尝试
>>> from __future__ import print_function, division >>> import numpy as np >>> import matplotlib >>> matplotlib.use('TkAgg') >>> import matplotlib.pyplot as plt >>> import vcfnp >>> vcfnp.__version__ '2.0.0' >>> filename = 'fixture/sample.vcf' >>> # load data from fixed fields (including INFO) ... v = vcfnp.variants(filename, cache=True).view(np.recarray) [vcfnp] 2015-01-23 11:10:46.670723 :: caching is enabled [vcfnp] 2015-01-23 11:10:46.670830 :: cache file available [vcfnp] 2015-01-23 11:10:46.670866 :: loading from cache file fixture/sample.vcf.vcfnp_cache/variants.npy >>> # print some simple variant metrics ... print('found %s variants (%s SNPs)' % (v.size, np.count_nonzero(v.is_snp))) found 9 variants (5 SNPs) >>> print('QUAL mean (std): %s (%s)' % (np.mean(v.QUAL), np.std(v.QUAL))) QUAL mean (std): 25.0667 (22.816) >>> # plot a histogram of variant depth ... fig = plt.figure(1) >>> ax = fig.add_subplot(111) >>> ax.hist(v.DP) (array([ 4., 0., 0., 0., 0., 0., 1., 2., 0., 2.]), array([ 0. , 1.4, 2.8, 4.2, 5.6, 7. , 8.4, 9.8, 11.2, 12.6, 14. ]), <a list of 10 Patch objects>) >>> ax.set_title('DP histogram') <matplotlib.text.Text object at 0x7f28f18f5c50> >>> ax.set_xlabel('DP') <matplotlib.text.Text object at 0x7f28f207c3c8> >>> plt.show() >>> # load data from sample columns ... c = vcfnp.calldata_2d(filename, cache=True).view(np.recarray) >>> # print some simple genotype metrics ... count_phased = np.count_nonzero(c.is_phased) >>> count_variant = np.count_nonzero(np.any(c.genotype > 0, axis=2)) >>> count_missing = np.count_nonzero(~c.is_called) >>> print('calls (phased, variant, missing): %s (%s, %s, %s)' ... % (c.flatten().size, count_phased, count_variant, count_missing)) calls (phased, variant, missing): 27 (14, 12, 2) >>> # plot a histogram of genotype quality ... fig = plt.figure(2) >>> ax = fig.add_subplot(111) >>> ax.hist(c.GQ.flatten()) (array([ 15., 0., 1., 1., 0., 1., 2., 4., 2., 1.]), array([ 0. , 6.1, 12.2, 18.3, 24.4, 30.5, 36.6, 42.7, 48.8, 54.9, 61. ]), <a list of 10 Patch objects>) >>> ax.set_title('GQ histogram') <matplotlib.text.Text object at 0x7f28f1eb1cc0> >>> ax.set_xlabel('GQ') <matplotlib.text.Text object at 0x7f28f18d4fd0> >>> plt.show()
还提供了命令行脚本,以便并行转换VCF文件到按基因组区域拆分的NPY数组。例如,以下命令将创建一个包含第二个100kb染色体变异数组的NPY文件
$ vcf2npy \ --vcf /path/to/my.vcf \ --fasta /path/to/ref.fa \ --output-dir /path/to/npy/output \ --array-type variants \ --chromosome chr20 \ --task-size 100000 \ --task-index 2 \ --progress 1000
对于那些有权访问运行Sun Grid Engine的集群的用户,提供了一个脚本以并行转换作业数组提交作业,例如
$ qsub_vcf2npy \ --vcf /path/to/my.vcf \ --fasta /path/to/ref.fa \ --output-dir /path/to/npy/output \ --array-type variants \ --chromosome chr20 \ --task-size 100000 \ --progress 1000 \ -l h_vmem=1G \ -N test_vcfnp \ -j y \ -o /path/to/sge/logs \ -q shortrun.q
应该很容易将此脚本修改为在其他并行计算平台上运行,请参阅脚本文件夹中的源代码。
还提供了一个脚本,用于将来自多个NPY文件的数据加载到单个HDF5文件中。例如,在将VCF文件转换为100kb变异和calldata_2d NPY拆分后,运行类似以下命令
$ vcfnpy2hdf5 \ --vcf /path/to/my.vcf \ --input-dir /path/to/npy/output \ --output /path/to/my.h5
如果想要按染色体分组数据,可以为每个染色体单独做如下操作
$ vcfnpy2hdf5 \ --vcf /path/to/my.vcf \ --input-dir /path/to/npy/output \ --input-filename-template {array_type}.chr20*.npy \ --output /path/to/my.h5 \ --group chr20
还有一个脚本可以在本地计算机上并行处理VCF文件并将其加载到HDF5文件中,例如
$ vcf2hdf5_parallel \ --vcf /path/to/my.vcf \ --fasta /path/to/refseq.fa
最后,还有一个脚本可以将VCF文件的固定字段转换为CSV,例如
$ vcf2csv \ --vcf /path/to/my.vcf \ --dialect excel-tab \ --flatten-filter
致谢
基于Erik Garrison的vcflib。
项目详情
关闭
vcfnp-2.3.0.tar.gz 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 23db095e82ff7108729760b07de91ef3b2d58a543f9ed068eae0bce88b663def |
|
MD5 | af8ea21f3b9cdc9ac5e664af72609122 |
|
BLAKE2b-256 | 907d89a5124b086f3b5af6838a12f8d0ca92d772a486bef70bacc264df7594a0 |