跳转到主要内容

spaCy包装器,用于Hugging Face Transformers管道

项目描述

spacy-huggingface-pipelines: 使用预训练的transformer模型进行文本和标记分类

本软件包提供了spaCy组件,用于仅进行推理的Hugging Face Transformers管道

PyPi GitHub

特性

🚀 安装

从pip安装软件包将自动安装所有依赖项,包括PyTorch和spaCy。

pip install -U pip setuptools wheel
pip install spacy-huggingface-pipelines

对于GPU安装,请遵循spaCy带有GPU的快速入门,例如。

pip install -U spacy[cuda-autodetect]

如果您在安装PyTorch时遇到问题,请遵循官方网站上针对您特定的操作系统和需求的说明。

📖 文档

此模块提供了用于推理仅有的transformers TokenClassificationPipelineTextClassificationPipeline管道的spaCy包装器。

模型在初始化时从Hugging Face Hub下载(如果本地缓存中没有),或者也可以从本地路径加载。

注意,在调用nlp.to_disk时,transformer模型数据不会与管道一起保存,因此如果您在有限互联网访问的环境中加载管道,请确保模型可用在您的transformers缓存目录中,并在需要时启用离线模式。

标记分类

hf_token_pipe的配置设置

[components.hf_token_pipe]
factory = "hf_token_pipe"
model = "dslim/bert-base-NER"     # Model name or path
revision = "main"                 # Model revision
aggregation_strategy = "average"  # "simple", "first", "average", "max"
stride = 16                       # If stride >= 0, process long texts in
                                  # overlapping windows of the model max
                                  # length. The value is the length of the
                                  # window overlap in transformer tokenizer
                                  # tokens, NOT the length of the stride.
kwargs = {}                       # Any additional arguments for
                                  # TokenClassificationPipeline
alignment_mode = "strict"         # "strict", "contract", "expand"
annotate = "ents"                 # "ents", "pos", "spans", "tag"
annotate_spans_key = null         # Doc.spans key for annotate = "spans"
scorer = null                     # Optional scorer

TokenClassificationPipeline设置

spaCy设置

  • alignment_mode确定transformer预测如何与spaCy标记边界对齐,如Doc.char_span中所述。
  • annotateannotate_spans_key配置如何将注释保存到spaCy文档中。您可以将其保存为token.tag_token.pos_(仅适用于UPOS标记)、doc.entsdoc.spans

示例

  1. 将命名实体注释保存为Doc.ents
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("hf_token_pipe", config={"model": "dslim/bert-base-NER"})
doc = nlp("My name is Sarah and I live in London")
print(doc.ents)
# (Sarah, London)
  1. 将命名实体注释保存为Doc.spans[spans_key]并将分数保存为Doc.spans[spans_key].attrs["scores"]
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "dslim/bert-base-NER",
        "annotate": "spans",
        "annotate_spans_key": "bert-base-ner",
    },
)
doc = nlp("My name is Sarah and I live in London")
print(doc.spans["bert-base-ner"])
# [Sarah, London]
print(doc.spans["bert-base-ner"].attrs["scores"])
# [0.99854773, 0.9996215]
  1. 将细粒度标记保存为Token.tag
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "QCRI/bert-base-multilingual-cased-pos-english",
        "annotate": "tag",
    },
)
doc = nlp("My name is Sarah and I live in London")
print([t.tag_ for t in doc])
# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']
  1. 将粗粒度标记保存为Token.pos
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={"model": "vblagoje/bert-english-uncased-finetuned-pos", "annotate": "pos"},
)
doc = nlp("My name is Sarah and I live in London")
print([t.pos_ for t in doc])
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']

文本分类

hf_text_pipe的配置设置

[components.hf_text_pipe]
factory = "hf_text_pipe"
model = "distilbert-base-uncased-finetuned-sst-2-english"  # Model name or path
revision = "main"                 # Model revision
kwargs = {}                       # Any additional arguments for
                                  # TextClassificationPipeline
scorer = null                     # Optional scorer

输入文本根据transformers模型的最大长度被截断。

TextClassificationPipeline设置

  • model:模型名称或路径。
  • revision:模型修订版。对于生产使用,建议使用特定的git提交而不是默认的main
  • kwargs:传递给TextClassificationPipeline的任何附加参数。

示例

import spacy

nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_text_pipe",
    config={"model": "distilbert-base-uncased-finetuned-sst-2-english"},
)
doc = nlp("This is great!")
print(doc.cats)
# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}

批处理和GPU

标记和文本分类都支持使用nlp.pipe进行批处理。

for doc in nlp.pipe(texts, batch_size=256):
    do_something(doc)

如果组件在处理批处理时遇到错误(例如,在空文本上),nlp.pipe将回退到逐个处理文本。如果在单个文本上遇到错误,将显示警告,并将文档返回而不进行额外注释。

切换到GPU

import spacy
spacy.require_gpu()

for doc in nlp.pipe(texts):
    do_something(doc)

错误报告和问题

请通过spaCy问题跟踪器报告错误或在讨论板上打开新主题以报告其他问题。

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于 安装包 的信息。

源代码分发

spacy_huggingface_pipelines-0.0.4.tar.gz (11.7 kB 查看哈希值)

上传时间 源代码

构建分发

spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl (11.2 kB 查看哈希值)

上传时间 Python 2 Python 3

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误日志 StatusPage StatusPage 状态页面