精确的宿主读取移除

项目描述

敌对者

敌对者能够从短读和长读（元）基因组中精确移除宿主序列，消耗成对的或不成对的 fastq[.gz] 输入。内置功能 – 首次运行时将下载人类参考基因组。默认情况下，敌对者非常精确，移除的微生物读取量比现有方法少一个数量级，同时从1000 Genomes Project样本中移除了 >99.5% 的真实人类读取。为了最大限度地保留微生物读取，请使用针对细菌和/或病毒基因组进行屏蔽的现有索引，或使用内置屏蔽实用工具创建自己的索引。可以通过使用 --rename 替换读取头为整数（用于隐私和更小的FASTQ）。使用现有的快速工具（Minimap2/Bowtie2和Samtools）进行繁重的计算。Bowtie2是短（成对）读取的默认对齐器，而Minimap2是长读取的默认对齐器。在基准测试中，细菌Illumina读取以32Mbp/s（210k reads/sec）的速率进行了去污染，细菌ONT读取以22Mbp/s的速率进行，使用8个对齐线程。默认情况下，敌对者对短读取去污染需要4GB内存，对长读取（Minimap2）需要13GB内存。更多信息Benchmark可以在论文和博客文章中找到。请创建一个问题来报告问题或其他方式联系寻求帮助、建议等。

参考基因组（索引）

默认索引 human-t2t-hla 包含 T2T-CHM13v2.0 和 IPD-IMGT/HLA v3.51，运行 Hostile 时会自动下载，除非指定了其他索引。使用屏蔽索引可能略微提高微生物序列的保留率，如下所示。索引 human-t2t-hla-argos985 对 985个参考级细菌基因组（包括常见的人类病原体）进行屏蔽，而 human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 则进一步对所有已知的病毒和噬菌体基因组进行全面屏蔽。当病毒序列的保留是优先考虑时，应使用后者。要使用标准索引，只需将索引名称作为 --index 参数的值传递即可，它会负责下载和缓存相关索引。可以使用 --offline 标志禁用自动下载，--index 可以接受自定义参考基因组或 Bowtie2 索引的路径。由牛津大学的 ModMedMicro 研究单元提供的对象存储。

名称	组成	日期	屏蔽位置
`human-t2t-hla` (默认)	T2T-CHM13v2.0 + IPD-IMGT/HLA v3.51	2023-07	0 (0%)
`human-t2t-hla-argos985`	`human-t2t-hla` 对 985 FDA-ARGOS 细菌基因组进行屏蔽	2023-07	317,973 (0.010%)
`human-t2t-hla.rs-viral-202401_ml-phage-202401`	`human-t2t-hla` 对 18,719 个 RefSeq 病毒和 26,928 个 Millard 实验室噬菌体基因组进行屏蔽	2024-01	1,172,993 (0.037%)
`human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401`	`human-t2t-hla` 对 985 FDA-ARGOS 细菌、18,719 个 RefSeq 病毒和 26,928 个 Millard 实验室噬菌体基因组进行屏蔽	2024-01	1,473,260 (0.046%)
`human-t2t-hla-argos985-mycob140`	`human-t2t-hla` 对 985 FDA-ARGOS 细菌和 140 分枝杆菌基因组进行屏蔽	2023-07	319,752 (0.010%)

在论文中评估了 human-t2t-hla 和 human-t2t-hla-argos985-mycob140 的性能

使用  安装

由于存在非 Python 依赖项（Bowtie2、Minimap2、Samtools 和 Bedtools），建议使用 conda/mamba 或 Docker 进行安装。Hostile 在 Ubuntu Linux 22.04、MacOS 12 和 WSL（Windows Subsystem for Linux）下进行了测试。

Conda/mamba

conda create -y -n hostile -c conda-forge -c bioconda hostile
conda activate hostile

Docker

wget https://raw.githubusercontent.com/bede/hostile/main/Dockerfile
docker build . --platform linux/amd64

还提供了 Biocontainer 图像，但请注意，这通常落后于最新发布的版本

索引安装（可选）

首次运行时，Hostile 会自动下载和缓存默认索引 human-t2t-hla，这意味着无需预先下载索引。尽管如此

要下载和缓存默认索引（human-t2t-hla），请运行 hostile fetch
要列出可用的索引，请运行 hostile fetch --list
要下载和缓存其他标准索引，例如运行 hostile fetch --name human-t2t-hla-argos985
要使用自定义基因组（例如使用 hostile mask 制作），请使用 hostile clean 并指定 --index path/to/genome.fa（minimap2）或 --index path/to/index（无文件扩展名；Bowtie2）
要更改索引存储的位置，将环境变量 HOSTILE_CACHE_DIR 设置为您选择的目录。运行 hostile fetch --list 以验证。

命令行使用

$ hostile clean -h
usage: hostile clean [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] [--aligner {bowtie2,minimap2,auto}] [--index INDEX]
                     [--invert] [--rename] [--reorder] [--out-dir OUT_DIR] [--threads THREADS]
                     [--aligner-args ALIGNER_ARGS] [--force] [--offline] [--debug]

Remove reads aligning to an index from fastq[.gz] input files

options:
  -h, --help            show this help message and exit
  --fastq1 FASTQ1       path to forward fastq[.gz] file
  --fastq2 FASTQ2       optional path to reverse fastq[.gz] file
                        (default: None)
  --aligner {bowtie2,minimap2,auto}
                        alignment algorithm. Default is Bowtie2 (paired reads) & Minimap2 (unpaired reads)
                        (default: auto)
  --index INDEX         name of standard index or path to custom genome/index
                        (default: human-t2t-hla)
  --invert              keep only reads aligning to the target genome (and their mates if applicable)
                        (default: False)
  --rename              replace read names with incrementing integers
                        (default: False)
  --reorder             ensure deterministic output order
                        (default: False)
  --out-dir OUT_DIR     path to output directory
                        (default: /Users/bede/Research/Git/hostile)
  --threads THREADS     number of alignment threads. A sensible default is chosen automatically
                        (default: 5)
  --aligner-args ALIGNER_ARGS
                        additional arguments for alignment
                        (default: )
  --force               overwrite existing output files
                        (default: False)
  --offline             disable automatic index download
                        (default: False)
  --debug               show debug messages
                        (default: False)

短读，默认索引

$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz
INFO: Hostile version 1.0.0. Mode: paired short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
[
    {
        "version": "1.0.0",
        "aligner": "bowtie2",
        "index": "human-t2t-hla",
        "options": [],
        "fastq1_in_name": "human_1_1.fastq.gz",
        "fastq1_in_path": "/Users/bede/human_1_1.fastq.gz",
        "fastq1_out_name": "human_1_1.clean_1.fastq.gz",
        "fastq1_out_path": "/Users/bede/human_1_1.clean_1.fastq.gz",
        "reads_in": 2,
        "reads_out": 0,
        "reads_removed": 2,
        "reads_removed_proportion": 1.0,
        "fastq2_in_name": "human_1_2.fastq.gz",
        "fastq2_in_path": "/Users/bede/human_1_2.fastq.gz",
        "fastq2_out_name": "human_1_2.clean_2.fastq.gz",
        "fastq2_out_path": "/Users/bede/human_1_2.clean_2.fastq.gz"
    }
]

短读，掩码索引，保存日志

$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz --index human-t2t-hla-argos985 > log.json
INFO: Hostile version 1.0.0. Mode: paired short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete

短非配对读，保存日志

默认情况下，假设单个 fastq 为长读。在去除非配对短读的污染时，通过指定 --aligner bowtie2 来覆盖此设置。

$ hostile clean --aligner bowtie2 --fastq1 tests/data/human_1_1.fastq.gz > log.json
INFO: Hostile version 1.0.0. Mode: short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete

长读

$ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz
INFO: Hostile version 1.0.0. Mode: long read (Minimap2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
[
    {
        "version": "1.0.0",
        "aligner": "minimap2",
        "index": "human-t2t-hla",
        "options": [],
        "fastq1_in_name": "tuberculosis_1_1.fastq.gz",
        "fastq1_in_path": "/Users/bede/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz",
        "fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz",
        "fastq1_out_path": "/Users/bede/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz",
        "reads_in": 1,
        "reads_out": 1,
        "reads_removed": 0,
        "reads_removed_proportion": 0.0
    }
]

Python 使用

from pathlib import Path
from hostile.lib import clean_fastqs, clean_paired_fastqs

# Long reads, defaults
clean_fastqs(
    fastqs=[Path("reads.fastq.gz")],
)

# Paired short reads, various options, capture log
log = clean_paired_fastqs(
    fastqs=[(Path("reads_1.fastq.gz"), Path("reads_2.fastq.gz"))],
    index="human-t2t-hla-argos985",
    out_dir=Path("decontaminated-reads"),
  	rename=True,
    force=True,
    threads=4
)

print(log)

掩码参考基因组

mask 子命令可以轻松创建自定义掩码参考基因组，并实现特定目标生物体的最大保留

hostile mask human.fasta lots-of-bacterial-genomes.fasta --threads 8

您可能希望将现有的某个参考基因组作为起点。掩码使用 Minimap2 将提供的靶基因组中的 150mer 与参考基因组进行比对，并使用 bedtools 掩码所有比对区域。同时创建掩码基因组（用于 Minimap2）和掩码 Bowtie2 索引。

限制

Hostile 优先保留微生物序列，而不是丢弃宿主序列。如果您努力去除每个最后的人类序列，其他方法可能更适合您。
使用所有可用的 CPU 内核并不总是会提高性能。因此，在运行时会自动选择一个合理的默认值。
Minimap2 在开始去除污染之前，对人类基因组索引有 30-90 秒的开销。令人惊讶的是，加载预构建索引并没有显著加快。我希望能在未来版本中减轻这一点。

引用

Bede Constantinides, Martin Hunt, Derrick W Crook, Hostile: 准确去除微生物宿主序列，Bioinformatics，2023; btad728，https://doi.org/10.1093/bioinformatics/btad728

@article{10.1093/bioinformatics/btad728,
    author = {Constantinides, Bede and Hunt, Martin and Crook, Derrick W},
    title = {Hostile: accurate decontamination of microbial host sequences},
    journal = {Bioinformatics},
    volume = {39},
    number = {12},
    pages = {btad728},
    year = {2023},
    month = {12},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btad728},
    url = {https://doi.org/10.1093/bioinformatics/btad728},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/12/btad728/54850422/btad728.pdf},
}

开发安装

git clone https://github.com/bede/hostile.git
cd hostile
conda env create -y -f environment.yml
conda activate hostile
pip install --editable '.[dev]'
pytest

项目详情

发布历史发布通知 | RSS 源

此版本

1.1.0

2024 年 4 月 10 日

1.0.0

2024 年 1 月 22 日

0.4.0

2023 年 11 月 24 日

0.3.0

2023 年 11 月 22 日

0.2.0

2023 年 11 月 10 日

0.1.0

2023 年 7 月 23 日

0.0.3

2023 年 7 月 12 日

0.0.2

2023 年 7 月 6 日

0.0.1

2023 年 6 月 20 日

下载文件

下载适用于您平台的应用程序。如果您不确定选择哪个，请了解更多关于安装包的信息。

源代码发行版

hostile-1.1.0.tar.gz （1.7 MB 查看哈希值）

上传时间 2024 年 4 月 10 日 源代码

构建发行版

hostile-1.1.0-py3-none-any.whl （17.0 kB 查看哈希值）

上传时间 2024 年 4 月 10 日 Python 3

hostile-1.1.0.tar.gz 的哈希值

hostile-1.1.0.tar.gz 的哈希值
算法	哈希摘要
SHA256	`eee390f97ac9f669f10792a3fb487d92b9cec518c0b072338b3654a162965e2e`
MD5	`ca54d5f7a070a6023cd68f1ee665637f`
BLAKE2b-256	`a6d3c9ae7689fc9db16ec0a572a580f9c7908b954bdf7d85b59ddf42219a820a`

哈希值 for hostile-1.1.0-py3-none-any.whl

hostile-1.1.0-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`4a68d65387da1a1915e452541073b1c90d04468b8d737c5eff245222427a8c98`
MD5	`5f3672447d56358a8b887037cbf39ae6`
BLAKE2b-256	`b977432f826c0aa0b129388424df041c4b0bfad4186dc4717e2aa9f6a4b2a246`

敌对者 1.1.0

导航

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

项目描述

敌对者

参考基因组（索引）

使用  安装

索引安装（可选）

命令行使用

Python 使用

掩码参考基因组

限制

引用

开发安装

项目详情

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

发布历史发布通知 | RSS 源

下载文件

源代码发行版

构建发行版

敌对者 1.1.0

导航

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

项目描述

敌对者

参考基因组（索引）

使用 安装

索引安装（可选）

命令行使用

Python 使用

掩码参考基因组

限制

引用

开发安装

项目详情

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

发布历史 发布通知 | RSS 源

下载文件

源代码发行版

构建发行版

使用安装

发布历史发布通知 | RSS 源