跳转到主要内容

下载并绘制GenBank组装数据集

项目描述

sourmash_plugin_directsketch

PyPI Conda Version DOI

tl;dr - 直接下载和绘制数据

关于

命令

  • gbsketch - 通过访问号下载和绘制NCBI组装数据集
  • urlsketch - 直接从URL下载和绘制

此插件试图通过下载文件、检查提供的或可访问的md5sum,并将草图绘制到sourmash zipfile中,来改进sourmash数据库的生成。如果需要,也可以保存FASTA文件。它相当快,但仍然处于alpha级别。这里有龙。

安装

Linux

选项1(推荐):创建conda环境并将安装到其中

conda create -n directsketch sourmash_plugin_directsketch # create and install
conda activate directsketch # activate

选项2:不创建环境安装

conda install sourmash_plugin_directsketch

其他平台

在其他平台上,您可以使用如下要求创建conda环境

curl -JLO https://raw.githubusercontent.com/sourmash-bio/sourmash_plugin_directsketch/main/environment.yml
conda env create -f environment.yml

然后激活环境并使用pip安装sourmash_plugin_directsketch

conda activate directsketch
pip install sourmash_plugin_directsketch

使用注意事项

如果您正在构建大型数据库(超过20k文件),我们强烈建议您使用批处理zipfile(v0.4+)以方便重启。如果您遇到意外的失败,并且使用单个zipfile输出(默认),则gbsketch/urlsketch将不得不重新下载和重新绘制所有文件。如果您改用--batch-size设置批大小,例如10000,则gbsketch/urlsketch可以加载任何已完成的写入的批处理zipfile,从而避免重新生成这些签名。对于gbsketch,批大小表示每个zip中包含的访问号数量,与一个访问号相关联的所有签名都组合在一个单独的zip中。对于urlsketch,批大小表示每个zip中包含的总签名数。请注意,批处理将使用--output文件来构建批处理文件名,因此如果您提供了output.zip,则您的批处理将分别是output.1.zipoutput.2.zip等。

运行命令

gbsketch

通过访问号下载和绘制NCBI组装数据集

创建输入文件

首先,创建一个文件,例如acc.csv,其中包含GenBank标识符和草图名称。

accession,name,ftp_path
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-45,
GCA_000175555.1,GCA_000175555.1 ACUK01000506.1 Saccharolobus solfataricus 98/2,

必须存在三列:accessionnameftp_pathftp_path列可以为空(如上所示),但不得存在其他额外列。

什么是ftp_path?

如果您未提供ftp_pathgbsketch将使用accession为您查找ftp_path

如果您选择提供,ftp_path必须是NCBI汇编摘要文件中的ftp_path列。

仅供参考

运行

要测试gbsketch,您可以下载一个csv文件并运行

curl -JLO https://raw.githubusercontent.com/sourmash-bio/sourmash_plugin_directsketch/main/tests/test-data/acc.csv
sourmash scripts gbsketch acc.csv -o test-gbsketch.zip -f out_fastas -k --failed test.failed.csv --checksum-fail test.checksum-failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1

要检查是否正确创建了zip,您可以运行

sourmash sig summarize test-gbsketch.zip

并且您应该看到以下输出

** loading from 'test-gbsketch.zip'
path filetype: ZipFileLinearIndex
location: /path/to/your/test-gbsketch.zip
is database? yes
has manifest? yes
num signatures: 5
** examining manifest...
total hashes: 10815
summary of sketches:
   2 sketches with dna, k=21, scaled=1000, abund      2884 total hashes
   2 sketches with dna, k=31, scaled=1000, abund      2823 total hashes
   1 sketches with protein, k=10, scaled=100, abund   5108 total hashes

完整用法

usage:  gbsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [--batch-size BATCH_SIZE] [-k] [--download-only] --failed FAILED --checksum-fail CHECKSUM_FAIL [-p PARAM_STRING] [-c CORES]
                 [-r RETRY_TIMES] [-g | -m]
                 input_csv

download and sketch GenBank assembly datasets

positional arguments:
  input_csv             a txt file or csv file containing accessions in the first column

options:
  -h, --help            show this help message and exit
  -q, --quiet           suppress non-error output
  -d, --debug           provide debugging output
  -o OUTPUT, --output OUTPUT
                        output zip file for the signatures
  -f FASTAS, --fastas FASTAS
                        Write fastas here
  --batch-size BATCH_SIZE
                        Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows gbsketch to recover after unexpected failures, rather than needing to
                        restart sketching from scratch. Default: write all sigs to single zipfile.
  -k, --keep-fasta      write FASTA files in addition to sketching. Default: do not write FASTA files
  --download-only       just download genomes; do not sketch
  --failed FAILED       csv of failed accessions and download links (should be mostly protein).
  --checksum-fail CHECKSUM_FAIL
                        csv of accessions where the md5sum check failed or the md5sum file was improperly formatted or could not be downloaded
  -p PARAM_STRING, --param-string PARAM_STRING
                        parameter string for sketching (default: k=31,scaled=1000)
  -c CORES, --cores CORES
                        number of cores to use (default is all available)
  -r RETRY_TIMES, --retry-times RETRY_TIMES
                        number of times to retry failed downloads
  -g, --genomes-only    just download and sketch genome (DNA) files
  -m, --proteomes-only  just download and sketch proteome (protein) files

urlsketch

直接从url下载和草图

创建输入文件

首先创建一个文件,例如acc-url.csv,包含标识符、草图名称和其他所需信息。

accession,name,moltype,md5sum,download_filename,url
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,dna,47b9fb20c51f0552b87db5d44d5d4566,GCA_000961135.2_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_genomic.fna.gz
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,protein,fb7920fb8f3cf5d6ab9b6b754a5976a4,GCA_000961135.2_protein.urlsketch.faa.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_protein.faa.gz
GCA_000175535.1,GCA_000175535.1 Chlamydia muridarum MopnTet14 (agent of mouse pneumonitis) strain=MopnTet14,dna,a1a8f1c6dc56999c73fe298871c963d1,GCA_000175535.1_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/175/535/GCA_000175535.1_ASM17553v1/GCA_000175535.1_ASM17553v1_genomic.fna.gz

必须存在六列

  • accession - 访问号或唯一标识符。理想情况下不包含空格。
  • name - 草图的完整名称。
  • moltype - 文件是'dna'还是'protein'?
  • md5sum - 预期md5sum(可选,如果提供,将在下载后进行检查)
  • download_filename - FASTA下载的文件名。如果使用--keep-fastas则必须提供,对签名也很有用(保存在sig数据中)。
  • url - 文件的直接链接

运行

要运行位于tests/test-data/acc-url.csv的测试访问号文件,请运行

sourmash scripts urlsketch tests/test-data/acc-url.csv -o test-urlsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1

完整用法

usage:  urlsketch [-h] [-q] [-d] [-o OUTPUT] [--batch-size BATCH_SIZE] [-f FASTAS] [-k] [--download-only] --failed FAILED [--checksum-fail CHECKSUM_FAIL] [-p PARAM_STRING] [-c CORES]
                  [-r RETRY_TIMES]
                  input_csv

download and sketch GenBank assembly datasets

positional arguments:
  input_csv             a txt file or csv file containing accessions in the first column

options:
  -h, --help            show this help message and exit
  -q, --quiet           suppress non-error output
  -d, --debug           provide debugging output
  -o OUTPUT, --output OUTPUT
                        output zip file for the signatures
  --batch-size BATCH_SIZE
                        Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows urlsketch to recover after unexpected failures, rather than needing to
                        restart sketching from scratch. Default: write all sigs to single zipfile.
  -f FASTAS, --fastas FASTAS
                        Write fastas here
  -k, --keep-fasta, --keep-fastq
                        write FASTA/Q files in addition to sketching. Default: do not write FASTA files
  --download-only       just download genomes; do not sketch
  --failed FAILED       csv of failed accessions and download links.
  --checksum-fail CHECKSUM_FAIL
                        csv of accessions where the md5sum check failed. If not provided, md5sum failures will be written to the download failures file (no additional md5sum information).
  -p PARAM_STRING, --param-string PARAM_STRING
                        parameter string for sketching (default: k=31,scaled=1000)
  -c CORES, --cores CORES
                        number of cores to use (default is all available)
  -r RETRY_TIMES, --retry-times RETRY_TIMES
                        number of times to retry failed downloads

行为准则

本项目遵循sourmash行为准则

支持

我们建议在directsketch问题跟踪器main sourmash问题跟踪器中提交问题。

作者

  • N. Tessa Pierce-Ward

开发文档

sourmash_plugin_directsketchhttps://github.com/sourmash-bio/sourmash_plugin_directsketch上开发。

测试

运行

pytest tests

生成发布版本

Cargo.toml中增加版本号并推送。

在github上创建一个新的发布版本。

然后拉取,并

make sdist

然后执行make upload_sdist

如果它不可用,您可能需要运行pip install twine

项目详情


下载文件

下载适用于您的平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。

源分布

sourmash_plugin_directsketch-0.4.0.tar.gz (191.0 kB 查看散列值)

上传时间:

由以下支持