跳转到主要内容

将bed或序列字典文件中的区域分割成块并散列

项目描述

chunked-scatter 和 scatter-regions

chunked-scatter 工具接受bed文件、fasta索引、序列字典或vcf文件作为输入,并将contigs/chromosomes分割成给定大小的重叠块。然后,这些块将被放置在新bed文件中,每个文件一个染色体。为了避免创建数千个文件,小的染色体将被合并。

scatter-regions 工具以类似的方式工作,但默认值和标志针对创建GATK工具的基因组散列进行了调整。

安装

使用方法

chunked-scatter

usage: chunked-scatter [-h] [-p PREFIX] [-S] [-P] [-c SIZE]
                       [-m MINIMUM_BP_PER_FILE] [-o OVERLAP]
                       INPUT

Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Each contig/region will be split into multiple overlapping
regions, which will be written to a new bed file. Each contig will be placed
in a new file, unless the length of the contigs/regions doesn't exceed a given
number.

positional arguments:
  INPUT                 The input file. The format is detected by the
                        extension. Supported extensions are: '.bed', '.dict',
                        '.fai', '.vcf', '.vcf.gz', '.bcf'.

optional arguments:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        The prefix of the ouput files. Output will be named
                        like: <PREFIX><N>.bed, in which N is an incrementing
                        number. Default 'scatter-'.
  -S, --split-contigs   If set, contigs are allowed to be split up over
                        multiple files.
  -P, --print-paths     If set prints paths of the output files to STDOUT.
                        This makes the program usable in scripts and
                        worfklows.
  -c SIZE, --chunk-size SIZE
                        The size of the chunks. The first chunk in a region or
                        contig will be exactly length SIZE, subsequent chunks
                        will SIZE + OVERLAP and the final chunk may be
                        anywhere from 0.5 to 1.5 times SIZE plus overlap. If a
                        region (or contig) is smaller than SIZE the original
                        regions will be returned. Defaults to 1e6
  -m MINIMUM_BP_PER_FILE, --minimum-bp-per-file MINIMUM_BP_PER_FILE
                        The minimum number of bases represented within a
                        single output bed file. If an input contig or region
                        is smaller than this MINIMUM_BP_PER_FILE, then the
                        next contigs/regions will be placed in the same file
                        untill this minimum is met. Defaults to 45e6.
  -o OVERLAP, --overlap OVERLAP
                        The number of bases which each chunk should overlap
                        with the preceding one. Defaults to 150.

scatter-regions

usage: scatter-regions [-h] [-p PREFIX] [-S] [-P] [-s SCATTER_SIZE] INPUT

Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Creates a bed file where the contigs add up approximately to
the given scatter size.

positional arguments:
  INPUT                 The input file. The format is detected by the
                        extension. Supported extensions are: '.bed', '.dict',
                        '.fai', '.vcf', '.vcf.gz', '.bcf'.

optional arguments:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        The prefix of the ouput files. Output will be named
                        like: <PREFIX><N>.bed, in which N is an incrementing
                        number. Default 'scatter-'.
  -S, --split-contigs   If set, contigs are allowed to be split up over
                        multiple files.
  -P, --print-paths     If set prints paths of the output files to STDOUT.
                        This makes the program usable in scripts and
                        worfklows.
  -s SCATTER_SIZE, --scatter-size SCATTER_SIZE
                        The maximum size for the regions over which to
                        scatter. If contigs are not split, and a contig is
                        bigger than the maximum size, the contig will be
                        placed in its own file. Default: 1000000000.

示例

bed文件

给定位于/data/regions.bed的bed文件

chr1	100	1000
chr1	2000	16000
chr2	5000	10000

以下命令

chunked-scatter -p /data/scatter_ -m 1000 -c 5000 /data/regions.bed

将生成以下两个输出文件

  • /data/scatter_0.bed:
    chr1	100	1000
    chr1	2000	7000
    chr1	6850	12000
    chr1	11850	16000
    
  • /data/scatter_1.bed:
    chr2	5000	10000
    

dict文件

给定位于/data/ref.dict的dict文件

@SQ	SN:chr1	LN:3000000
@SQ SN:chr2 LN:500000

以下命令

chunked-scatter -p /data/scatter_ /data/regions.bed

将在/data/scatter_0.bed生成以下输出文件

chr1	0	1000000
chr1	999850	2000000
chr1	1999850	3000000
chr2	0	500000

项目详情


下载文件

下载适用于您平台的文件。如果您不确定选择哪个,请了解有关 安装包 的更多信息。

源代码分发

chunked-scatter-1.0.0.tar.gz (9.4 kB 查看哈希值)

上传时间 源代码

构建分发

chunked_scatter-1.0.0-py3-none-any.whl (12.5 kB 查看哈希值)

上传时间 Python 3

由以下提供支持