跳转到主要内容

foldcomp通过有效地压缩蛋白质结构的扭转角。它将骨架原子压缩到8个字节,每个残基的侧链额外4-5个字节,一个平均大小的350个残基的蛋白质需要约4.2kb。Foldcomp是一个C++库,具有Python绑定。

项目描述

Foldcomp

Foldcomp通过有效地压缩蛋白质结构的扭转角。它将骨架原子压缩到8个字节,每个残基的侧链额外4-5个字节,因此一个平均大小的350个残基的蛋白质需要约6kb。

Foldcomp高效的压缩格式存储蛋白质结构,每个残基只需13个字节,与直接保存3D坐标相比,所需存储空间减少了10倍。我们通过将骨架的扭转角以及侧链角度编码为紧凑的二进制文件格式(FCZ)来实现这种减少。

Foldcomp目前仅支持压缩单链PDB文件


Left panel: Foldcomp data format, saving amino acid residue in 13 byte. Top right panel:  Foldcomp decompression is as fast as gzip. Bottom right panel: Foldcomp compression ratio is higher than pulchra and gzip.

出版物

Hyunbin Kim, Milot Mirdita, Martin Steinegger, Foldcomp: a library and format for compressing and indexing large protein structure sets, Bioinformatics, 2023;, btad153,

使用方法

安装Foldcomp

# Install Foldcomp Python package
pip install foldcomp

# Download static binaries for Linux
wget https://mmseqs.com/foldcomp/foldcomp-linux-x86_64.tar.gz

# Download static binaries for Linux (ARM64)
wget https://mmseqs.com/foldcomp/foldcomp-linux-arm64.tar.gz

# Download binary for macOS
wget https://mmseqs.com/foldcomp/foldcomp-macos-universal.tar.gz

# Download binary for Windows (x64)
wget https://mmseqs.com/foldcomp/foldcomp-windows-x64.zip

可执行文件

# Compression
foldcomp compress <pdb|cif> [<fcz>]
foldcomp compress [-t number] <dir|tar(.gz)> [<dir|tar|db>]

# Decompression
foldcomp decompress <fcz|tar> [<pdb>]
foldcomp decompress [-t number] <dir|tar(.gz)|db> [<dir|tar>]

# Decompressing a subset of Foldcomp database
foldcomp decompress [-t number] --id-list <idlist.txt> <db> [<dir|tar>]

# Extraction of sequence or pLDDT
foldcomp extract [--plddt|--amino-acid] <fcz> [<fasta>]
foldcomp extract [--plddt|--amino-acid] [-t number] <dir|tar(.gz)|db> [<fasta_out>]

# Check
foldcomp check <fcz>
foldcomp check [-t number] <dir|tar(.gz)|db>

# RMSD
foldcomp rmsd <pdb|cif> <pdb|cif>

# Options
 -h, --help           print this help message
 -v, --version        print version
 -t, --threads        threads for (de)compression of folders/tar files [default=1]
 -r, --recursive      recursively look for files in directory [default=0]
 -f, --file           input is a list of files [default=0]
 -a, --alt            use alternative atom order [default=false]
 -b, --break          interval size to save absolute atom coordinates [default=25]
 -z, --tar            save as tar file [default=false]
 -d, --db             save as database [default=false]
 -y, --overwrite      overwrite existing files [default=false]
 -l, --id-list        a file of id list to be processed (only for database input)
 --skip-discontinuous skip PDB with with discontinuous residues (only batch compression)
 --check              check FCZ before and skip entries with error (only for batch decompression)
 --plddt              extract pLDDT score (only for extraction mode)
 --fasta              extract amino acid sequence (only for extraction mode)
 --no-merge           do not merge output files (only for extraction mode)
 --time               measure time for compression/decompression

下载数据库

我们为多个大型预测蛋白质结构集合提供预构建的数据库,并有一个Python辅助程序来下载数据库文件。

您可以使用以下命令下载AlphaFoldDB Swiss-Prot

python -c "import foldcomp; foldcomp.setup('afdb_swissprot_v4');

目前我们提供的数据库有

  • ESMAtlas 完整版(v0 + v2023_02):foldcomp.setup('esmatlas')

  • ESMAtlas v2023_02:foldcomp.setup('esmatlas_v2023_02')

  • ESMAtlas 高质量:foldcomp.setup('highquality_clust30')

    注意: 我们跳过了所有具有不连续残基或其他问题的结构。以下是受影响预测的列表;完整版 (~21M),高质量 (~100k),v2023_02 (~10k)

  • AlphaFoldDB Uniprot: foldcomp.setup('afdb_uniprot_v4')

  • AlphaFoldDB Swiss-Prot: foldcomp.setup('afdb_swissprot_v4')

  • AlphaFoldDB 模型生物: foldcomp.setup('h_sapiens')

    • a_thaliana, c_albicans, c_elegans, d_discoideum, d_melanogaster, d_rerio, e_coli, g_max, h_sapiens, m_jannaschii, m_musculus, o_sativa, r_norvegicus, s_cerevisiae, s_pombe, z_mays
  • AlphaFoldDB 集群代表: foldcomp.setup('afdb_rep_v4')

  • AlphaFoldDB 集群代表(暗群):foldcomp.setup('afdb_rep_dark_v4')

如果您需要其他预构建的数据集,请通过我们的 GitHub 问题 与我们联系。

如果您在下载数据库时遇到问题,可以直接导航到我们的 下载服务器 并下载所需的文件。例如,afdb_uniprot_v4afdb_uniprot_v4.indexafdb_uniprot_v4.dbtypeafdb_uniprot_v4.lookup 和可选的 afdb_uniprot_v4.source

Python API

您可以在示例笔记本中找到更多关于使用 Foldcomp Python 接口的深入示例:[点击打开 Colab](https://colab.research.google.com/github/steineggerlab/foldcomp/blob/master/foldcomp-py-examples.ipynb)

import foldcomp
# 01. Handling a FCZ file
# Open a fcz file
with open("test/compressed.fcz", "rb") as fcz:
  fcz_binary = fcz.read()

  # Decompress
  (name, pdb) = foldcomp.decompress(fcz_binary) # pdb_out[0]: file name, pdb_out[1]: pdb binary string

  # Save to a pdb file
  with open(name, "w") as pdb_file:
    pdb_file.write(pdb)

  # Get data as dictionary
  data_dict = foldcomp.get_data(fcz_binary) # foldcomp.get_data(pdb) also works
  # Keys: phi, psi, omega, torsion_angles, residues, bond_angles, coordinates
  data_dict["phi"] # phi angles (C-N-CA-C)
  data_dict["psi"] # psi angles (N-CA-C-N)
  data_dict["omega"] # omega angles (CA-C-N-CA)
  data_dict["torsion_angles"] # torsion angles of the backbone as list (phi + psi + omega)
  data_dict["bond_angles"] # bond angles of the backbone as list
  data_dict["residues"] # amino acid residues as string
  data_dict["coordinates"] # coordinates of the backbone as list

# 02. Iterate over a database of FCZ files
# Open a foldcomp database
ids = ["d1asha_", "d1it2a_"]
with foldcomp.open("test/example_db", ids=ids) as db:
  # Iterate through database
  for (name, pdb) in db:
      # save entries as seperate pdb files
      with open(name + ".pdb", "w") as pdb_file:
        pdb_file.write(pdb)

数据库子集

如果您正在处理数百万条条目,我们建议使用 mmseqs2createsubdb 命令对数据库进行子集处理。以下命令可以用于根据给定 ID 子集 AlphaFold Uniprot DB。

# mmseqs createsubdb --subdb-mode 0 --id-mode 1 id_list.txt input_foldcomp_db output_foldcomp_db
mmseqs createsubdb --subdb-mode 0 --id-mode 1 id_list.txt afdb_uniprot_v4 afdb_subset

请注意,afdb_uniprot_v4 中的 ID 格式为 AF-A0A5S3Y9Q7-F1-model_v4

社区贡献

贡献者

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于 安装包 的信息。

源代码发行版

foldcomp-0.0.7.tar.gz (21.0 kB 查看哈希值)

上传时间 源代码

构建发行版

foldcomp-0.0.7-cp311-cp311-win_amd64.whl (154.3 kB 查看哈希值)

上传时间 CPython 3.11 Windows x86-64

foldcomp-0.0.7-cp311-cp311-musllinux_1_1_x86_64.whl (798.0 kB 查看哈希值)

上传时间 CPython 3.11 musllinux: musl 1.1+ x86-64

foldcomp-0.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.6 kB 查看哈希值)

上传于 CPython 3.11 manylinux: glibc 2.17+ x86-64

foldcomp-0.0.7-cp311-cp311-macosx_10_9_x86_64.whl (248.0 kB 查看哈希值)

上传于 CPython 3.11 macOS 10.9+ x86-64

foldcomp-0.0.7-cp311-cp311-macosx_10_9_universal2.whl (473.3 kB 查看哈希值)

上传于 CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

foldcomp-0.0.7-cp310-cp310-win_amd64.whl (154.3 kB 查看哈希值)

上传于 CPython 3.10 Windows x86-64

foldcomp-0.0.7-cp310-cp310-musllinux_1_1_x86_64.whl (798.0 kB 查看哈希值)

上传于 CPython 3.10 musllinux: musl 1.1+ x86-64

foldcomp-0.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.6 kB 查看哈希值)

上传于 CPython 3.10 manylinux: glibc 2.17+ x86-64

foldcomp-0.0.7-cp310-cp310-macosx_10_9_x86_64.whl (248.0 kB 查看哈希值)

上传于 CPython 3.10 macOS 10.9+ x86-64

foldcomp-0.0.7-cp310-cp310-macosx_10_9_universal2.whl (473.3 kB 查看哈希值)

上传于 CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

foldcomp-0.0.7-cp39-cp39-win_amd64.whl (154.3 kB 查看哈希值)

上传于 CPython 3.9 Windows x86-64

foldcomp-0.0.7-cp39-cp39-musllinux_1_1_x86_64.whl (798.0 kB 查看哈希值)

上传于 CPython 3.9 musllinux: musl 1.1+ x86-64

foldcomp-0.0.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.6 kB 查看哈希值)

上传于 CPython 3.9 manylinux: glibc 2.17+ x86-64

foldcomp-0.0.7-cp39-cp39-macosx_10_9_x86_64.whl (248.0 kB 查看哈希值)

上传于 CPython 3.9 macOS 10.9+ x86-64

foldcomp-0.0.7-cp39-cp39-macosx_10_9_universal2.whl (473.3 kB 查看哈希值)

上传于 CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

foldcomp-0.0.7-cp38-cp38-win_amd64.whl (154.3 kB 查看哈希值)

上传于 CPython 3.8 Windows x86-64

foldcomp-0.0.7-cp38-cp38-musllinux_1_1_x86_64.whl (798.0 kB 查看哈希值)

上传于 CPython 3.8 musllinux: musl 1.1+ x86-64

foldcomp-0.0.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.6 kB 查看哈希值)

上传于 CPython 3.8 manylinux: glibc 2.17+ x86-64

foldcomp-0.0.7-cp38-cp38-macosx_10_9_x86_64.whl (248.0 kB 查看哈希值)

上传于 CPython 3.8 macOS 10.9+ x86-64

foldcomp-0.0.7-cp38-cp38-macosx_10_9_universal2.whl (473.3 kB 查看哈希值)

上传于 CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64)

foldcomp-0.0.7-cp37-cp37m-win_amd64.whl (154.2 kB 查看哈希值)

上传于 CPython 3.7m Windows x86-64

foldcomp-0.0.7-cp37-cp37m-musllinux_1_1_x86_64.whl (797.7 kB 查看哈希值)

上传于 CPython 3.7m musllinux: musl 1.1+ x86-64

foldcomp-0.0.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.5 kB 查看哈希值)

上传于 CPython 3.7m manylinux: glibc 2.17+ x86-64

foldcomp-0.0.7-cp37-cp37m-macosx_10_9_x86_64.whl (248.2 kB 查看哈希值)

上传于 CPython 3.7m macOS 10.9+ x86-64

由以下支持

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误记录 StatusPage StatusPage 状态页面