蛋白质嵌入生成和可视化的管道

这些详情尚未由PyPI 验证

项目链接

项目描述

Bio Embeddings

了解bio_embeddings的资源

通过嵌入快速从序列预测蛋白质结构和功能：embed.protein.properties。
阅读当前文档：docs.bioembeddings.com。
与我们聊天：chat.bioembeddings.com。
我们在ISMB 2020 & LMRL 2020会议上以演讲的形式介绍了bio_embeddings管道。您可以在YouTube上找到演讲：https://www.youtube.com/watch?v=NucUA0QiOe0&feature=youtu.be，F1000上的海报：https://f1000research.com/posters/9-876，以及我们的Current Protocol Manuscript。
查看管道配置示例 a 和 notebooks。

项目目标

通过提供单一、一致的接口和接近零摩擦，简化基于语言模型生物序列表示的迁移学习。
可重现的工作流程
表示深度（不同实验室在不同数据集上为不同目的训练的不同模型）
丰富的示例，处理复杂性为用户（例如，CUDA OOM 抽象）以及详细的警告和错误消息。

项目包括

基于在生物序列表示上训练的开放模型（SeqVec、ProtTrans、UniRep 等）的通用 Python 嵌入器
一个管道
- 将序列嵌入到矩阵表示（每个氨基酸）或向量表示（每个序列），可用于训练学习模型或分析目的
- 使用 UMAP 或 t-SNE 将每个序列嵌入投影到低维表示（用于轻量级数据处理和可视化）
- 将低维的每个序列嵌入集可视化到 2D 和 3D 交互图中（带注释和不带注释）
- 使用监督方法（如有可用）和无监督方法（例如，通过网络分析）从每个序列和每个氨基酸嵌入中提取注释
一个网络服务器，将管道封装成分布式 API，以实现可扩展和一致的工作流程

安装

您可以通过 pip 安装 bio_embeddings 或使用 Docker。

Pip

安装管道的方法如下

pip install bio-embeddings[all]

要安装不稳定版本，请按照以下方法安装管道

pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"

Docker

我们提供 Docker 镜像 ghcr.io/bioembeddings/bio_embeddings。简单的使用示例

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

有关说明，请参阅 docker 目录中的示例，您还可以使用 ghcr.io/bioembeddings/bio_embeddings:latest，它基于最新的提交构建。

安装说明

bio_embeddings 是为具有 GPU 功能并安装了 CUDA 的 Unix 机器开发的。如果您的设置与此不同，您可能会遇到一些不一致（例如，速度受到没有 GPU 和 CUDA 的显著影响）。对于 Windows 用户，我们强烈建议使用 Windows Subsystem for Linux。

哪种模型适合您？

每种模型都有其优点和缺点（速度、特异性、内存占用...）。没有“万能”的模型，我们鼓励您在尝试新的探索性项目时至少尝试两种不同的模型。

模型 prottrans_bert_bfd、prottrans_albert_bfd、seqvec 和 prottrans_xlnet_uniref100 都是为了系统预测而训练的。从这个池中，我们认为最佳模型是 prottrans_bert_bfd，其次是 seqvec，它已经建立较长时间且使用不同的原则（LSTM 对 Transformer）。

使用和示例

我们强烈建议您查看 examples 目录中的管道示例，以及 notebooks 目录中的后处理管道运行和通用嵌入器使用。

安装软件包后，您可以

使用管道如下
```
bio_embeddings config.yml
```
配置文件的蓝图，此存储库的 examples 目录中有一个示例设置。

通过 Python 使用通用嵌入器对象，例如

from bio_embeddings.embed import SeqVecEmbedder

embedder = SeqVecEmbedder()

embedding = embedder.embed("SEQVENCE")

更多示例可以在此存储库的 notebooks 目录中找到。

引用

如果您在研究中使用 bio_embeddings，我们将非常感谢您引用以下论文

Dallago, C.，Schütze, K.，Heinzinger, M.，Olenyi, T.，Littmann, M.，Lu, A. X.，Yang, K. K.，Min, S.，Yoon, S.，Morton, J. T.，& Rost, B.（2021）。从深度学习到可视化和预测蛋白质集的学习嵌入。Current Protocols，1，e113。doi：10.1002/cpz1.113

相应的 bibtex

@article{https://doi.org/10.1002/cpz1.113,
author = {Dallago, Christian and Schütze, Konstantin and Heinzinger, Michael and Olenyi, Tobias and Littmann, Maria and Lu, Amy X. and Yang, Kevin K. and Min, Seonwoo and Yoon, Sungroh and Morton, James T. and Rost, Burkhard},
title = {Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets},
journal = {Current Protocols},
volume = {1},
number = {5},
pages = {e113},
keywords = {deep learning embeddings, machine learning, protein annotation pipeline, protein representations, protein visualization},
doi = {https://doi.org/10.1002/cpz1.113},
url = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpz1.113},
eprint = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpz1.113},
year = {2021}
}

Additionally, we invite you to cite the work from others that was collected in `bio_embeddings` (see section _"Tools by category"_ below). We are working on an enhanced user guide which will include proper references to all citable work collected in `bio_embeddings`.

贡献者

克里斯蒂安·达尔拉戈（负责人）
康斯坦丁·舒策
托比亚斯·奥莱尼
迈克尔·海因茨inger

工具列表不完整（更多详细信息请参阅下一节）

Fastext
Glove
Word2Vec
SeqVec（《https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8》）
- SeqVecSec和SeqVecLoc用于二级结构和亚细胞定位预测
ProtTrans（ProtBert、ProtAlbert、ProtT5）（《https://doi.org/10.1101/2020.07.12.199554》）
- ProtBertSec和ProtBertLoc用于二级结构和亚细胞定位预测
UniRep（《https://www.nature.com/articles/s41592-019-0598-1”）
ESM/ESM1b（《https://www.biorxiv.org/content/10.1101/622803v3”）
PLUS（《https://github.com/mswzeus/PLUS/”）
CPCProt（《https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf”）
PB-Tucker（《https://www.biorxiv.org/content/10.1101/2021.01.21.427551v1”）
GoPredSim（《https://www.nature.com/articles/s41598-020-80786-0”）
DeepBlast（《https://www.biorxiv.org/content/10.1101/2020.11.03.365932v1”）

数据集

prottrans_t5_xl_u50人类蛋白质组全精度的残基和序列嵌入 + 二级结构预测 + 亚细胞定位预测：[DOI链接图片]

按类别划分的工具

管道

align
- DeepBlast（《https://www.biorxiv.org/content/10.1101/2020.11.03.365932v1”）
embed
- 在BFD上训练的ProtTrans BERT（《https://doi.org/10.1101/2020.07.12.199554”）
- SeqVec（《https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8》）
- 在BFD上训练的ProtTrans ALBERT（《https://doi.org/10.1101/2020.07.12.199554”）
- 在UniRef100上训练的ProtTrans XLNet（《https://doi.org/10.1101/2020.07.12.199554”）
- 在BFD上训练的ProtTrans T5（《https://doi.org/10.1101/2020.07.12.199554”）
- 在BFD上训练并在UniRef50上微调的ProtTrans T5（内部）
- UniRep（《https://www.nature.com/articles/s41592-019-0598-1”）
- ESM/ESM1b（《https://www.biorxiv.org/content/10.1101/622803v3”）
- PLUS（《https://github.com/mswzeus/PLUS/”）
- CPCProt（《https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf”）
项目
- t-SNE
- UMAP
- PB-Tucker（《https://www.biorxiv.org/content/10.1101/2021.01.21.427551v1”）
可视化
- 2D/3D序列嵌入空间
提取
- 监督
  - SeqVec：DSSP3、DSSP8、无序、亚细胞定位和膜结合性，如《https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8》中所述
  - 如《https://doi.org/10.1101/2020.07.12.199554》中报道的ProtBertSec和ProtBertLoc
- 无监督
  - 通过序列级（减少嵌入）、成对距离（如欧几里得距离goPredSim，更多选项可用，例如余弦距离）

通用嵌入器

在BFD上训练的ProtTrans BERT（《https://doi.org/10.1101/2020.07.12.199554”）
SeqVec（《https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8》）
在BFD上训练的ProtTrans ALBERT（《https://doi.org/10.1101/2020.07.12.199554”）
在UniRef100上训练的ProtTrans XLNet（《https://doi.org/10.1101/2020.07.12.199554”）
在BFD上训练的ProtTrans T5（《https://doi.org/10.1101/2020.07.12.199554”）
在BFD上训练并在UniRef50上微调的ProtTrans T5（《https://doi.org/10.1101/2020.07.12.199554”）
Fastext
Glove
Word2Vec
UniRep（《https://www.nature.com/articles/s41592-019-0598-1”）
ESM/ESM1b（《https://www.biorxiv.org/content/10.1101/622803v3”）
PLUS（《https://github.com/mswzeus/PLUS/”）
CPCProt（《https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf”）

bio_embeddings-0.2.2.tar.gz的哈希值

bio_embeddings-0.2.2.tar.gz的哈希值
算法	哈希摘要
SHA256	`a82d27d895d971bb30339071440f9c5e75d31cf9f4a8a600d17efe7bea0e8615`
MD5	`c8230dc10c403b0b247964b50b43cedb`
BLAKE2b-256	`877864390601f0bef1431194260931d8f494d7bc47df35113e540f715156b443`

bio_embeddings-0.2.2-py3-none-any.whl的哈希值

bio_embeddings-0.2.2-py3-none-any.whl的哈希值
算法	哈希摘要
SHA256	`83879950107bb51b1b592029d44ec5bc01f750618348e66f06a49835f754155e`
MD5	`99c4277fd5b247e3904f9f4fd81c5986`
BLAKE2b-256	`71d2ba2400d0d46117877e8cf376b5bb47790c1bfa48305287afd8e8e09d6f7f`

bio-embeddings 0.2.2

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

Bio Embeddings

安装

Pip

Docker

安装说明

哪种模型适合您？

使用和示例

引用

贡献者

工具列表不完整（更多详细信息请参阅下一节）

数据集

按类别划分的工具

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史发布通知 | RSS源

下载文件

源分发

构建版本

bio-embeddings 0.2.2

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

Bio Embeddings

安装

Pip

Docker

安装说明

哪种模型适合您？

使用和示例

引用

贡献者

工具列表不完整（更多详细信息请参阅下一节）

数据集

按类别划分的工具

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史 发布通知 | RSS源

下载文件

源分发

构建版本

发布历史发布通知 | RSS源