将bed或序列字典文件中的区域分割成块并散列
项目描述
chunked-scatter 和 scatter-regions
chunked-scatter
工具接受bed文件、fasta索引、序列字典或vcf文件作为输入,并将contigs/chromosomes分割成给定大小的重叠块。然后,这些块将被放置在新bed文件中,每个文件一个染色体。为了避免创建数千个文件,小的染色体将被合并。
scatter-regions
工具以类似的方式工作,但默认值和标志针对创建GATK工具的基因组散列进行了调整。
安装
- 使用pip安装:
pip install chunked-scatter
- 使用conda安装:
conda install chunked-scatter
使用方法
chunked-scatter
usage: chunked-scatter [-h] [-p PREFIX] [-S] [-P] [-c SIZE]
[-m MINIMUM_BP_PER_FILE] [-o OVERLAP]
INPUT
Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Each contig/region will be split into multiple overlapping
regions, which will be written to a new bed file. Each contig will be placed
in a new file, unless the length of the contigs/regions doesn't exceed a given
number.
positional arguments:
INPUT The input file. The format is detected by the
extension. Supported extensions are: '.bed', '.dict',
'.fai', '.vcf', '.vcf.gz', '.bcf'.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The prefix of the ouput files. Output will be named
like: <PREFIX><N>.bed, in which N is an incrementing
number. Default 'scatter-'.
-S, --split-contigs If set, contigs are allowed to be split up over
multiple files.
-P, --print-paths If set prints paths of the output files to STDOUT.
This makes the program usable in scripts and
worfklows.
-c SIZE, --chunk-size SIZE
The size of the chunks. The first chunk in a region or
contig will be exactly length SIZE, subsequent chunks
will SIZE + OVERLAP and the final chunk may be
anywhere from 0.5 to 1.5 times SIZE plus overlap. If a
region (or contig) is smaller than SIZE the original
regions will be returned. Defaults to 1e6
-m MINIMUM_BP_PER_FILE, --minimum-bp-per-file MINIMUM_BP_PER_FILE
The minimum number of bases represented within a
single output bed file. If an input contig or region
is smaller than this MINIMUM_BP_PER_FILE, then the
next contigs/regions will be placed in the same file
untill this minimum is met. Defaults to 45e6.
-o OVERLAP, --overlap OVERLAP
The number of bases which each chunk should overlap
with the preceding one. Defaults to 150.
scatter-regions
usage: scatter-regions [-h] [-p PREFIX] [-S] [-P] [-s SCATTER_SIZE] INPUT
Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Creates a bed file where the contigs add up approximately to
the given scatter size.
positional arguments:
INPUT The input file. The format is detected by the
extension. Supported extensions are: '.bed', '.dict',
'.fai', '.vcf', '.vcf.gz', '.bcf'.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The prefix of the ouput files. Output will be named
like: <PREFIX><N>.bed, in which N is an incrementing
number. Default 'scatter-'.
-S, --split-contigs If set, contigs are allowed to be split up over
multiple files.
-P, --print-paths If set prints paths of the output files to STDOUT.
This makes the program usable in scripts and
worfklows.
-s SCATTER_SIZE, --scatter-size SCATTER_SIZE
The maximum size for the regions over which to
scatter. If contigs are not split, and a contig is
bigger than the maximum size, the contig will be
placed in its own file. Default: 1000000000.
示例
bed文件
给定位于/data/regions.bed
的bed文件
chr1 100 1000
chr1 2000 16000
chr2 5000 10000
以下命令
chunked-scatter -p /data/scatter_ -m 1000 -c 5000 /data/regions.bed
将生成以下两个输出文件
/data/scatter_0.bed
:chr1 100 1000 chr1 2000 7000 chr1 6850 12000 chr1 11850 16000
/data/scatter_1.bed
:chr2 5000 10000
dict文件
给定位于/data/ref.dict
的dict文件
@SQ SN:chr1 LN:3000000
@SQ SN:chr2 LN:500000
以下命令
chunked-scatter -p /data/scatter_ /data/regions.bed
将在/data/scatter_0.bed
生成以下输出文件
chr1 0 1000000
chr1 999850 2000000
chr1 1999850 3000000
chr2 0 500000
项目详情
下载文件
下载适用于您平台的文件。如果您不确定选择哪个,请了解有关 安装包 的更多信息。
源代码分发
chunked-scatter-1.0.0.tar.gz (9.4 kB 查看哈希值)
构建分发
chunked_scatter-1.0.0-py3-none-any.whl (12.5 kB 查看哈希值)
关闭
chunked-scatter-1.0.0.tar.gz 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 2635b3e4097fe9f22240f9b946eac812a185fefc28cea5cbe03281321675a02b |
|
MD5 | 1a2c062f2bb5bf571473857fa633e4d0 |
|
BLAKE2b-256 | c529f70d069845c1daf6ae4c74b5f19a8a09d0d3927857dbd69fc1dc3a9aeb4f |
关闭
chunked_scatter-1.0.0-py3-none-any.whl 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | e221fbe878025a012b9e36f7503a9999a4ac192db206fd2949b91b422240f951 |
|
MD5 | 3cb602f7f50041aa6efe46f80410c918 |
|
BLAKE2b-256 | 852dfd57870bdde4a868204e059ae9a94ece54ca2ef8fc49329e15aac9417742 |