跳转到主要内容

让蛋白质折叠对所有人可及。在google colab和您的机器上预测蛋白质结构

项目描述

ColabFold - v1.5.5

有关v1.5版本中更改的详细信息,请参阅变更日志

通过Google Colab使蛋白质折叠对所有人开放!

笔记本 单体 复合物 mmseqs2 jackhmmer 模板
AlphaFold2_mmseqs2
AlphaFold2_batch
AlphaFold2(来自Deepmind)
relax_amber(放松输入结构)
ESMFold 也许
BETA(开发中)笔记本
RoseTTAFold2 进行中
OmegaFold 也许
AlphaFold2_advanced_v2(新实验笔记本)

查看wiki页面 旧退休笔记本 了解不支持的笔记本。

常见问题解答

  • 我在哪里可以与其他ColabFold用户聊天?
  • 我可以使用模型进行<强>Molecular Replacement吗?
    • 是的,但请<强>小心,bfactor列填充了pLDDT置信度值(越高越好)。Phenix.phaser期望一个"真实"的bfactor,其中(越低越好)。参见Claudia Millán的帖子
  • 最大长度是多少?
    • 限制取决于Google-Colab提供的免费GPU fingers-crossed
    • 对于GPU:Tesla T4Tesla P100具有约16G的最大长度约为2000
    • 对于GPU:Tesla K80具有约12G的最大长度约为1000
    • 要检查您的GPU,请打开一个新的代码单元格并输入!nvidia-smi
  • 在本地计算机上使用MMseqs2 MSA服务器(cf.run_mmseqs2)是否可以?
    • 如果您从单个IP进行串行查询,则可以从本地计算机访问服务器。请不要使用多台计算机查询服务器。
  • 我可以在哪里下载ColabFold使用的数据库?
  • 我想渲染预测结构的自定义图像,如何通过pLDDT着色?
    • 在AlphaFold结构的pymol中:spectrum b, red_yellow_green_cyan_blue, minimum=50, maximum=90
    • 如果您想使用AlphaFold颜色(感谢Konstantin Korotkov)
      set_color n0, [0.051, 0.341, 0.827]
      set_color n1, [0.416, 0.796, 0.945]
      set_color n2, [0.996, 0.851, 0.212]
      set_color n3, [0.992, 0.490, 0.302]
      color n0, b < 100; color n1, b < 90
      color n2, b < 70;  color n3, b < 50
      
    • 在RoseTTAFold结构的pymol中:spectrum b, red_yellow_green_cyan_blue, minimum=0.5, maximum=0.9
  • AlphaFold2_advanced和AlphaFold2_mmseqs2 (_batch)笔记本在复合预测之间有什么区别?
    • 我们目前有两种不同的蛋白质复合物预测方法:(1)使用带有残基索引跳转的AlphaFold2模型,以及(2)使用AlphaFold2-多聚体模型。AlphaFold2_advanced支持(1),而AlphaFold2_mmseqs2 (_batch)支持(2)。
  • localcolabfold和pip可安装的colabfold_batch之间有什么区别?
    • LocalColabFold是一个安装脚本,旨在使ColabFold功能可在本地用户机器上使用。它支持广泛的操作系统,例如Windows 10或更高版本(使用Windows Subsystem for Linux 2)、macOS和Linux。
  • 有没有办法在没有从头开始重新运行alphafold/colabfold的情况下放松结构?
  • 我可以在哪里找到以前开发但现在已退休的旧笔记本?
  • 我可以在哪里找到ColabFold使用的MSA服务器数据库的历史记录?

本地运行

有关如何在本地安装ColabFold的说明,请参阅localcolabfold或查看我们的wiki页面,了解如何在Docker中运行ColabFold。

使用MSA服务器生成小规模本地结构/复合预测的MSA

当您将包含您的序列的FASTA或CSV文件传递给colabfold_batch时,它将自动查询公共MSA服务器以生成MSA。您可能希望将其分为两个步骤,以更好地利用GPU资源

# Query the MSA server and predict the structure on local GPU in one go:
colabfold_batch input_sequences.fasta out_dir

# Split querying MSA server and GPU predictions into two steps
colabfold_batch input_sequences.fasta out_dir --msa-only
colabfold_batch input_sequences.fasta out_dir

生成大规模结构/复合预测的MSA

首先在具有足够存储空间(940GB!)的磁盘上创建数据库目录。根据您所在的位置,这可能需要几个小时

注意:创建数据库和执行ColabFold MSA服务器中的序列搜索使用的MMseqs2 71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1(2023年5月28日)。如果您想获得与服务器相同的MSA,请使用此版本。

MMSEQS_NO_INDEX=1 ./setup_databases.sh /path/to/db_folder

如果MMseqs2未安装在您的PATH中,请在colabfold_search中的mmseqs中添加--mmseqs <mmseqs路径>

# This needs a lot of CPU
colabfold_search --mmseqs /path/to/bin/mmseqs input_sequences.fasta /path/to/db_folder msas
# This needs a GPU
colabfold_batch msas predictions

这将在包含所有输入多重序列比对并以a3m格式排列的中间文件夹msas中创建,并在包含所有预测pdb、json和png文件的predictions文件夹中创建。

上述过程通过在调用数据库设置脚本之前设置环境变量MMSEQS_NO_INDEX=1来禁用MMseqs2对各种ColabFold数据库的预索引。对于大多数colabfold_search的使用场景,预计算索引不是必需的,可能会损害搜索速度。预计算的索引对于ColabFold服务器快速响应时间至关重要,其中整个数据库永久保存在内存中。在任何情况下,批量搜索都需要约128GB RAM的机器,或者如果数据库要永久保存在RAM中,则需要超过1TB RAM的机器。

在某些情况下,使用预计算数据库仍然可能有用。对于以下情况,请在没有MMSEQS_NO_INDEX环境变量的情况下调用setup_databases.sh脚本

(0)如上所述,如果您想设置一个服务器。

(1)如果预计算索引存储在非常快速的存储系统(例如,NVMe-SSDs)上,则从磁盘读取索引可能比动态计算更快。在这种情况下,应该在调用setup_databases.sh的同一台机器上执行搜索,因为预计算的索引是为适应给定主内存大小而创建的。此外,传递--db-load-mode 0选项以确保在使用之前从存储系统中一次性读取数据库。

(2)快速单次查询搜索需要将完整的索引(.idx文件)保留在内存中。这可以通过例如使用vmtouch来完成。因此,这种类型的搜索需要至少768GB到1TB RAM的机器来运行ColabfoldDB。如果索引在内存中,请使用colabfold_search中的--db-load-mode 2参数来避免索引加载开销。

如果没有创建索引(设置了MMSEQS_NO_INDEX=1),则--db-load-mode不执行任何操作,可以忽略。

教程和演示

  • 在波士顿蛋白质设计和建模俱乐部上展示的ColabFold教程。[视频] [幻灯片]

基于ColabFold或助手的项目

致谢

  • 我们感谢 RoseTTAFoldAlphaFold 团队,他们出色地将软件开源。
  • 同时感谢 David Koes 的优秀 py3Dmol 插件,没有它,这些笔记本将非常无聊!
  • 由 Sergey Ovchinnikov (@sokrypton)、Milot Mirdita (@milot_mirdita) 和 Martin Steinegger (@thesteinegger) 创建的 Colab。

我该如何引用这项工作?

  • Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S 和 Steinegger M. ColabFold:使蛋白质折叠对所有人开放。
    Nature Methods (2022) doi: 10.1038/s41592-022-01488-1
  • 如果您使用 AlphaFold,请也引用
    Jumper et al. "Highly accurate protein structure prediction with AlphaFold."
    Nature (2021) doi: 10.1038/s41586-021-03819-2
  • 如果您使用 AlphaFold-multimer,请也引用
    Evans et al. "Protein complex prediction with AlphaFold-Multimer."
    biorxiv (2021) doi: 10.1101/2021.10.04.463034v1
  • 如果您使用 RoseTTAFold,请也引用
    Minkyung et al. "Accurate prediction of protein structures and interactions using a three-track neural network."
    Science (2021) doi: 10.1126/science.abj8754

DOI


旧版本更新

  31Jul2023: 2023/07/31: The ColabFold MSA server is back to normal
             It was using older DB (UniRef30 2202/PDB70 220313) from 27th ~8:30 AM CEST to 31st ~11:10 AM CEST.
  27Jul2023: ColabFold MSA server issue:
             We are using the backup server with old databases
             (UniRef30 2202/PDB70 220313) starting from ~8:30 AM CEST until we resolve the issue.
             Resolved on 31Jul2023 ~11:10 CEST.
  12Jun2023: New databases! UniRef30 updated to 2302 and PDB to 230517.
             We now use PDB100 instead of PDB70 (see notes in the [main](https://colabfold.com) notebook).
  12Jun2023: We introduced a new default pairing strategy:
             Previously, for multimer predictions with more than 2 chains,
             we only pair if all sequences taxonomically match ("complete" pairing).
             The new default "greedy" strategy pairs any taxonomically matching subsets.
  30Apr2023: Amber is working again in our ColabFold Notebook
  29Apr2023: Amber is not working in our Notebook due to Colab update
  18Feb2023: v1.5.2 - fixing: fixing memory leak for large proteins
                    - fixing: --use_dropout (random seed was not changing between recycles)
  06Feb2023: v1.5.1 - fixing: --save-all/--save-recycles
  04Feb2023: v1.5.0 - ColabFold updated to use AlphaFold v2.3.1!
  03Jan2023: The MSA server's faulty hardware from 12/26 was replaced.
             There were intermittent failures on 12/26 and 1/3. Currently,
             there are no known issues. Let us know if you experience any.
  10Oct2022: Bugfix: random_seed was not being used for alphafold-multimer.
             Same structure was returned regardless of defined seed. This
             has been fixed!
  13Jul2022: We have set up a new ColabFold MSA server provided by Korean
             Bioinformation Center. It provides accelerated MSA generation,
             we updated the UniRef30 to 2022_02 and PDB/PDB70 to 220313.
  11Mar2022: We use in default AlphaFold-multimer-v2 weights for complex modeling.
             We also offer the old complex modes "AlphaFold-ptm" or "AlphaFold-multimer-v1"
  04Mar2022: ColabFold now uses a much more powerful server for MSAs and searches through the ColabFoldDB instead of BFD/MGnify.
             Please let us know if you observe any issues.
  26Jan2022: AlphaFold2_mmseqs2, AlphaFold2_batch and colabfold_batch's multimer complexes predictions are
             now in default reranked by iptmscore*0.8+ptmscore*0.2 instead of ptmscore
  16Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages.
  17Aug2021: If you see any errors, please report them.
  17Aug2021: We are still debugging the MSA generation procedure...
  20Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages.
             To avoid Google Colab from crashing, for large MSA we did -diff 1000 to get
             1K most diverse sequences. This caused some large MSA to degrade in quality,
             as sequences close to query were being merged to single representive.
             We are working on updating the server (today) to fix this, by making sure
             that both diverse and sequences close to query are included in the final MSA.
             We'll post update here when update is complete.
  21Aug2021  The MSA issues should now be resolved! Please report any errors you see.
             In short, to reduce MSA size we filter (qsc > 0.8, id > 0.95) and take 3K
             most diverse sequences at different qid (sequence identity to query) intervals
             and merge them. More specifically 3K sequences at qid at (0→0.2),(0.2→0.4),
             (0.4→0.6),(0.6→0.8) and (0.8→1). If you submitted your sequence between
             16Aug2021 and 20Aug2021, we recommend submitting again for best results!
  21Aug2021  The use_templates option in AlphaFold2_mmseqs2 is not properly working. We are
             working on fixing this. If you are not using templates, this does not affect the
             the results. Other notebooks that do not use_templates are unaffected.
  21Aug2021  The templates issue is resolved!
  11Nov2021  [AlphaFold2_mmseqs2] now uses Alphafold-multimer for complex (homo/hetero-oligomer) modeling.
             Use [AlphaFold2_advanced] notebook for the old complex prediction logic.
  11Nov2021  ColabFold can be installed locally using pip!
  14Nov2021  Template based predictions works again in the Alphafold2_mmseqs2 notebook.
  14Nov2021  WARNING "Single-sequence" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken
             starting 11Nov2021. The MMseqs2 MSA was being used regardless of selection.
  14Nov2021  "Single-sequence" mode is now fixed.
  20Nov2021  WARNING "AMBER" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken
             starting 11Nov2021. Unrelaxed proteins were returned instead.
  20Nov2021  "AMBER" is fixed thanks to Kevin Pan

项目详情


下载文件

下载适用于您的平台的文件。如果您不确定选择哪个,请了解有关 安装软件包 的更多信息。

源分布

colabfold-1.5.5.tar.gz (66.1 kB 查看哈希)

上传时间

构建分布

colabfold-1.5.5-py3-none-any.whl (65.0 kB 查看哈希)

上传时间 Python 3

由以下支持