oakx-spacy
项目描述
oakx-spacy
ALPHA
用法
非开发者
创建首选的虚拟环境(conda
、poetry
、venv
等)。使用pip install
安装oakx-spacy
。
pip install oakx-spacy
接下来,需要下载/安装所需的模型(Spacy和/或SciSpacy)。以下是可以用的模型列表。
Spacy模型
针对CPU优化的英语流水线。为了安装以下任何模型,请运行python -m spacy download en_core_web_xxx
en_core_web_sm
:组件:tok2vec、tagger、parser、senter、ner、attribute_ruler、lemmatizer。en_core_web_md
:组件:tok2vec、tagger、parser、senter、ner、attribute_ruler、lemmatizer。en_core_web_lg
:组件:tok2vec、tagger、parser、senter、ner、attribute_ruler、lemmatizer。en_core_web_trf
:组件:transformer、tagger、parser、ner、attribute_ruler、lemmatizer。
SciSpacy 模型
为了安装以下任何模型,请使用相应的行在 pyproject.toml
中
例如,如果需要训练CRAFT语料库的模型,请按照以下步骤操作
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz
可用模型
en_ner_craft_md
:在CRAFT语料库上训练的spaCy NER模型。en_ner_jnlpba_md
:在JNLPBA语料库上训练的spaCy NER模型。en_ner_bc5cdr_md
:在BC5CDR语料库上训练的spaCy NER模型。en_ner_bionlp13cg_md
:在BIONLP13CG语料库上训练的spaCy NER模型。en_core_sci_scibert
:具有约785k词汇量和以allennai/scibert-base为转换模型的完整spaCy生物医学数据处理管道。en_core_sci_sm
:用于生物医学数据的完整spaCy管道。en_core_sci_md
:具有更大词汇量和50k词向量的用于生物医学数据的完整spaCy管道。en_core_sci_lg
:具有更大词汇量和600k词向量的用于生物医学数据的完整spaCy管道。
SciSpacy链接器
这些与scispacy
包一起预安装。可用链接器包括
umls
:链接到统一医学语言系统,级别0、1、2和9。这包含约3M个概念。mesh
:链接到医学主题词表。它包含一组质量更高的实体,用于PubMed的索引。MeSH包含约30k个实体。注意:MeSH KB直接从MeSH本身导出,因此使用与其他KB不同的唯一标识符。rxnorm
:链接到RxNorm本体。RxNorm包含约100k个概念,专注于临床药物的规范化名称。它由其他在药房管理和药物相互作用中常用的药物词汇表组成,包括First Databank、Micromedex和Gold Standard Drug Database。go
:链接到基因本体。基因本体包含约67k个概念,专注于基因的功能。hpo
:链接到人类表型本体。人类表型本体包含16k个概念,专注于人类疾病中遇到的现象异常。
开发者
克隆存储库
git clone https://github.com/hrshdhgd/oakx-spacy.git
安装poetry
pip install poetry
SciSpacy 模型
在 pyproject.toml
中,取消注释对应所需模型的2行。例如,如果所需的模型是CRAFT语料库,则取消注释以下内容
[tool.poetry.dependencies.en_ner_craft_md]
url = "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz"
安装依赖项
poetry install
Spacy模型
与普通用户类似的说明。只需确保通过poetry run
命令前缀即可。
默认模型设置为en_ner_craft_md
,默认链接器设置为umls
。
它的工作原理
使用本体
输入参数可以表示为 spacy:sqlite:obo:name-of-ontology
,例如 spacy:sqlite:obo:bero
。
- 一个
.txt
文件[runoak -i spacy:sqlite:obo:bero annotate --text-file tests/input/text.txt
] - 需要标注的单词。[
runoak -i spacy:sqlite:obo:bero annotate Myeloid derived suppressor cells \(MDSC\) are immature myeloid cells with immunosuppressive activity.
] 应产生
info: 'JsonObj(alias_map=JsonObj(**{''rdfs:label'': [''Myeloid-Derived Suppressor
Cell'']}))'
subject_end: 30
subject_label: Myeloid-Derived Suppressor Cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 0
subject_text_id: NCIT:C129908
---
info: 'JsonObj(alias_map=JsonObj(**{''rdfs:label'': [''Immature Myeloid Cell'']}))'
subject_end: 64
subject_label: Immature Myeloid Cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 43
subject_text_id: NCIT:C113503
使用SciSpacy。
输入参数可以表示为 spacy:linker-name
,例如 spacy:mesh
。此插件有两个可能的输入
- 一个
.txt
文件[runoak -i spacy: annotate --text-file text.txt
] - 需要标注的单词。[
runoak -i spacy: annotate Myeloid derived suppressor cells \(MDSC\) are immature myeloid cells with immunosuppressive activity.
] 应产生(缩短)
confidence: 0.9999999403953552
info: JsonObj(aliases=['t cell suppressor', 'suppressor cell', 'T suppressor cell',
'suppressor cells', 'Suppressor cell', 'suppressor T lymphocyte', 'cells suppressor
t', 'Suppressor cells', 'Suppressor cell (cell)'], canonical_name='Suppressor T
Lymphocyte', concept_id='C0038856', definition='subpopulation of CD8+ T-lymphocytes
which suppress antibody production or inhibit cellular immune responses.', types=['T025'])
subject_end: 30
subject_label: suppressor cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 15
subject_text_id: C0038856
---
...
---
confidence: 0.8391554355621338
info: JsonObj(aliases=['Myeloid Cell Leukemia Sequence 1', 'Myeloid Cell Leukemia
Sequence 1 Protein', 'Induced Myeloid Leukemia Cell Differentiation Protein Mcl-1',
'Myeloid Cell Factor-1', 'Myeloid Cell Factor 1', 'Induced Myeloid Leukemia Cell
Differentiation Protein Mcl 1', 'Factor-1, Myeloid Cell', 'Cell Factor-1, Myeloid'],
canonical_name='Myeloid Cell Leukemia Sequence 1 Protein', concept_id='C1510444',
definition='A member of the myeloid leukemia factor (MLF) protein family with multiple
alternatively spliced transcript variants encoding different protein isoforms. In
hematopoietic cells, it is located mainly in the nucleus, and in non-hematopoietic
cells, primarily in the cytoplasm with a punctate nuclear localization. MLF1 plays
a role in cell cycle differentiation.', types=['T116', 'T123'])
subject_end: 64
subject_label: myeloid cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 52
subject_text_id: C1510444
致谢
此cookiecutter项目是从oakx-plugin-cookiecutter模板开发的,并将使用cruft保持最新。
项目详情
下载文件
下载适用于您平台的文件。如果您不确定选择哪个,请了解更多关于安装包的信息。
源代码分布
oakx_spacy-0.1.6.tar.gz (12.7 kB 查看哈希值)
构建分布
oakx_spacy-0.1.6-py3-none-any.whl (11.3 kB 查看哈希值)
关闭
oakx_spacy-0.1.6.tar.gz的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | c8776122bb19099c2026d2a55efc03d87ed42230a0c91db4b40e12420ac57fff |
|
MD5 | 3b0232acf1deb04032c2489008f09633 |
|
BLAKE2b-256 | 1f207b9b699ee2e80211813c228aeff0874fe49c17383a50b07189e157bd1469 |
关闭
oakx_spacy-0.1.6-py3-none-any.whl的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | f5caafb5416d302099f137acaaf1cea7f4522cb817c2f3654219be76b7480cc7 |
|
MD5 | b2dbe078e5b51b2bd0e574f6fb30062e |
|
BLAKE2b-256 | fb0fd159f7f4b50bfaa3e0bd07232b07fe0491a5f39424f77da218230175eb36 |