从VCF(变异调用文件)加载numpy数组。
项目描述
从VCF(变异调用格式)文件加载数据到numpy数组中,并(可选)从那里加载到HDF5文件。
安装
安装需要numpy和cython
$ pip install cython $ pip install numpy $ pip install vcfnp
……或者
$ git clone --recursive git://github.com/alimanfoo/vcfnp.git $ cd vcfnp $ python setup.py build_ext --inplace
使用方法
对于从Python的使用,请参阅IPython笔记本示例,或尝试
>>> from __future__ import print_function, division
>>> import numpy as np
>>> import matplotlib
>>> matplotlib.use('TkAgg')
>>> import matplotlib.pyplot as plt
>>> import vcfnp
>>> vcfnp.__version__
'2.0.0'
>>> filename = 'fixture/sample.vcf'
>>> # load data from fixed fields (including INFO)
... v = vcfnp.variants(filename, cache=True).view(np.recarray)
[vcfnp] 2015-01-23 11:10:46.670723 :: caching is enabled
[vcfnp] 2015-01-23 11:10:46.670830 :: cache file available
[vcfnp] 2015-01-23 11:10:46.670866 :: loading from cache file fixture/sample.vcf.vcfnp_cache/variants.npy
>>> # print some simple variant metrics
... print('found %s variants (%s SNPs)' % (v.size, np.count_nonzero(v.is_snp)))
found 9 variants (5 SNPs)
>>> print('QUAL mean (std): %s (%s)' % (np.mean(v.QUAL), np.std(v.QUAL)))
QUAL mean (std): 25.0667 (22.816)
>>> # plot a histogram of variant depth
... fig = plt.figure(1)
>>> ax = fig.add_subplot(111)
>>> ax.hist(v.DP)
(array([ 4., 0., 0., 0., 0., 0., 1., 2., 0., 2.]), array([ 0. , 1.4, 2.8, 4.2, 5.6, 7. , 8.4, 9.8, 11.2,
12.6, 14. ]), <a list of 10 Patch objects>)
>>> ax.set_title('DP histogram')
<matplotlib.text.Text object at 0x7f28f18f5c50>
>>> ax.set_xlabel('DP')
<matplotlib.text.Text object at 0x7f28f207c3c8>
>>> plt.show()
>>> # load data from sample columns
... c = vcfnp.calldata_2d(filename, cache=True).view(np.recarray)
>>> # print some simple genotype metrics
... count_phased = np.count_nonzero(c.is_phased)
>>> count_variant = np.count_nonzero(np.any(c.genotype > 0, axis=2))
>>> count_missing = np.count_nonzero(~c.is_called)
>>> print('calls (phased, variant, missing): %s (%s, %s, %s)'
... % (c.flatten().size, count_phased, count_variant, count_missing))
calls (phased, variant, missing): 27 (14, 12, 2)
>>> # plot a histogram of genotype quality
... fig = plt.figure(2)
>>> ax = fig.add_subplot(111)
>>> ax.hist(c.GQ.flatten())
(array([ 15., 0., 1., 1., 0., 1., 2., 4., 2., 1.]), array([ 0. , 6.1, 12.2, 18.3, 24.4, 30.5, 36.6, 42.7, 48.8,
54.9, 61. ]), <a list of 10 Patch objects>)
>>> ax.set_title('GQ histogram')
<matplotlib.text.Text object at 0x7f28f1eb1cc0>
>>> ax.set_xlabel('GQ')
<matplotlib.text.Text object at 0x7f28f18d4fd0>
>>> plt.show()
还提供了命令行脚本,以便并行转换VCF文件到按基因组区域拆分的NPY数组。例如,以下命令将创建一个包含第二个100kb染色体变异数组的NPY文件
$ vcf2npy \
--vcf /path/to/my.vcf \
--fasta /path/to/ref.fa \
--output-dir /path/to/npy/output \
--array-type variants \
--chromosome chr20 \
--task-size 100000 \
--task-index 2 \
--progress 1000
对于那些有权访问运行Sun Grid Engine的集群的用户,提供了一个脚本以并行转换作业数组提交作业,例如
$ qsub_vcf2npy \
--vcf /path/to/my.vcf \
--fasta /path/to/ref.fa \
--output-dir /path/to/npy/output \
--array-type variants \
--chromosome chr20 \
--task-size 100000 \
--progress 1000 \
-l h_vmem=1G \
-N test_vcfnp \
-j y \
-o /path/to/sge/logs \
-q shortrun.q
应该很容易将此脚本修改为在其他并行计算平台上运行,请参阅脚本文件夹中的源代码。
还提供了一个脚本,用于将来自多个NPY文件的数据加载到单个HDF5文件中。例如,在将VCF文件转换为100kb变异和calldata_2d NPY拆分后,运行类似以下命令
$ vcfnpy2hdf5 \
--vcf /path/to/my.vcf \
--input-dir /path/to/npy/output \
--output /path/to/my.h5
如果想要按染色体分组数据,可以为每个染色体单独做如下操作
$ vcfnpy2hdf5 \
--vcf /path/to/my.vcf \
--input-dir /path/to/npy/output \
--input-filename-template {array_type}.chr20*.npy \
--output /path/to/my.h5 \
--group chr20
还有一个脚本可以在本地计算机上并行处理VCF文件并将其加载到HDF5文件中,例如
$ vcf2hdf5_parallel \
--vcf /path/to/my.vcf \
--fasta /path/to/refseq.fa
最后,还有一个脚本可以将VCF文件的固定字段转换为CSV,例如
$ vcf2csv \
--vcf /path/to/my.vcf \
--dialect excel-tab \
--flatten-filter
致谢
基于Erik Garrison的vcflib。
项目详情
关闭
vcfnp-2.3.0.tar.gz 的哈希值
| 算法 | 哈希摘要 | |
|---|---|---|
| SHA256 | 23db095e82ff7108729760b07de91ef3b2d58a543f9ed068eae0bce88b663def |
|
| MD5 | af8ea21f3b9cdc9ac5e664af72609122 |
|
| BLAKE2b-256 | 907d89a5124b086f3b5af6838a12f8d0ca92d772a486bef70bacc264df7594a0 |