spaCy包装器,用于Hugging Face Transformers管道
项目描述
spacy-huggingface-pipelines: 使用预训练的transformer模型进行文本和标记分类
本软件包提供了spaCy组件,用于仅进行推理的Hugging Face Transformers管道。
特性
- 应用预训练的transformer模型,如
dslim/bert-base-NER
和distilbert-base-uncased-finetuned-sst-2-english
。
🚀 安装
从pip安装软件包将自动安装所有依赖项,包括PyTorch和spaCy。
pip install -U pip setuptools wheel
pip install spacy-huggingface-pipelines
对于GPU安装,请遵循spaCy带有GPU的快速入门,例如。
pip install -U spacy[cuda-autodetect]
如果您在安装PyTorch时遇到问题,请遵循官方网站上针对您特定的操作系统和需求的说明。
📖 文档
此模块提供了用于推理仅有的transformers TokenClassificationPipeline
和TextClassificationPipeline
管道的spaCy包装器。
模型在初始化时从Hugging Face Hub下载(如果本地缓存中没有),或者也可以从本地路径加载。
注意,在调用nlp.to_disk
时,transformer模型数据不会与管道一起保存,因此如果您在有限互联网访问的环境中加载管道,请确保模型可用在您的transformers缓存目录中,并在需要时启用离线模式。
标记分类
hf_token_pipe
的配置设置
[components.hf_token_pipe]
factory = "hf_token_pipe"
model = "dslim/bert-base-NER" # Model name or path
revision = "main" # Model revision
aggregation_strategy = "average" # "simple", "first", "average", "max"
stride = 16 # If stride >= 0, process long texts in
# overlapping windows of the model max
# length. The value is the length of the
# window overlap in transformer tokenizer
# tokens, NOT the length of the stride.
kwargs = {} # Any additional arguments for
# TokenClassificationPipeline
alignment_mode = "strict" # "strict", "contract", "expand"
annotate = "ents" # "ents", "pos", "spans", "tag"
annotate_spans_key = null # Doc.spans key for annotate = "spans"
scorer = null # Optional scorer
TokenClassificationPipeline
设置
model
:模型名称或路径。revision
:模型修订版。对于生产使用,建议使用特定的git提交而不是默认的main
。stride
:对于stride >= 0
,文本在重叠窗口中处理,其中stride
设置指定窗口之间重叠的标记数(不是步长长度)。如果stride
是None
,则文本可能会被截断。stride
仅支持快速标记器。aggregation_strategy
:聚合策略确定在单词内部子词未收到相同预测标记的情况下,单词级别的标记。请参阅:https://hugging-face.cn/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategykwargs
:传递给TokenClassificationPipeline
的任何附加参数。
spaCy设置
alignment_mode
确定transformer预测如何与spaCy标记边界对齐,如Doc.char_span
中所述。annotate
和annotate_spans_key
配置如何将注释保存到spaCy文档中。您可以将其保存为token.tag_
、token.pos_
(仅适用于UPOS标记)、doc.ents
或doc.spans
。
示例
- 将命名实体注释保存为
Doc.ents
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("hf_token_pipe", config={"model": "dslim/bert-base-NER"})
doc = nlp("My name is Sarah and I live in London")
print(doc.ents)
# (Sarah, London)
- 将命名实体注释保存为
Doc.spans[spans_key]
并将分数保存为Doc.spans[spans_key].attrs["scores"]
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={
"model": "dslim/bert-base-NER",
"annotate": "spans",
"annotate_spans_key": "bert-base-ner",
},
)
doc = nlp("My name is Sarah and I live in London")
print(doc.spans["bert-base-ner"])
# [Sarah, London]
print(doc.spans["bert-base-ner"].attrs["scores"])
# [0.99854773, 0.9996215]
- 将细粒度标记保存为
Token.tag
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={
"model": "QCRI/bert-base-multilingual-cased-pos-english",
"annotate": "tag",
},
)
doc = nlp("My name is Sarah and I live in London")
print([t.tag_ for t in doc])
# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']
- 将粗粒度标记保存为
Token.pos
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={"model": "vblagoje/bert-english-uncased-finetuned-pos", "annotate": "pos"},
)
doc = nlp("My name is Sarah and I live in London")
print([t.pos_ for t in doc])
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']
文本分类
hf_text_pipe
的配置设置
[components.hf_text_pipe]
factory = "hf_text_pipe"
model = "distilbert-base-uncased-finetuned-sst-2-english" # Model name or path
revision = "main" # Model revision
kwargs = {} # Any additional arguments for
# TextClassificationPipeline
scorer = null # Optional scorer
输入文本根据transformers模型的最大长度被截断。
TextClassificationPipeline
设置
model
:模型名称或路径。revision
:模型修订版。对于生产使用,建议使用特定的git提交而不是默认的main
。kwargs
:传递给TextClassificationPipeline
的任何附加参数。
示例
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_text_pipe",
config={"model": "distilbert-base-uncased-finetuned-sst-2-english"},
)
doc = nlp("This is great!")
print(doc.cats)
# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}
批处理和GPU
标记和文本分类都支持使用nlp.pipe
进行批处理。
for doc in nlp.pipe(texts, batch_size=256):
do_something(doc)
如果组件在处理批处理时遇到错误(例如,在空文本上),nlp.pipe
将回退到逐个处理文本。如果在单个文本上遇到错误,将显示警告,并将文档返回而不进行额外注释。
切换到GPU
import spacy
spacy.require_gpu()
for doc in nlp.pipe(texts):
do_something(doc)
错误报告和问题
请通过spaCy问题跟踪器报告错误或在讨论板上打开新主题以报告其他问题。
项目详情
关闭
spacy_huggingface_pipelines-0.0.4.tar.gz 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 35b409ed7d20c5b36d788912570e3444ec1b0c344255e847bf722b3286279e95 |
|
MD5 | a967cf1c4dab40128fe57b518177cee3 |
|
BLAKE2b-256 | 38ca07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06 |
关闭
spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 9e38ee6eba7a11fca32b7d14f38259f7805eec211e8959105a90c95915168b00 |
|
MD5 | b52ebc695fda9cffba6ccb53b0758b5c |
|
BLAKE2b-256 | ba691cf6333eebaadf8517f59b9dec676f42f5fef8b13a29eaf2cd2922470868 |