跳转到主要内容

前沿实验性的spaCy组件和功能

项目描述

spacy-experimental:前沿实验性的spaCy组件和功能

此包包含spaCy v3.x的实验性组件和功能,例如模型架构、管道组件和实用工具。

tests pypi Version

安装

使用pip安装

python -m pip install -U pip setuptools wheel
python -m pip install spacy-experimental

使用spacy-experimental

组件和功能可能在任何版本中修改或删除,因此如果您正在尝试特定的组件,请始终指定确切的版本作为包要求,例如。

spacy-experimental==0.147.0

然后您可以将实验性组件添加到您的配置或从spacy_experimental导入

[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"

组件

可训练的基于字符的标记器

两个可训练的标记器将标记化表示为对单个字符的序列标记问题,并使用现有的spaCy标记器和NER架构进行标记。

在spaCy管道中,一个简单的“预处理标记器”作为管道标记器应用于将每个文档分割成单个字符,而可训练的标记器是一个管道组件,它重新标记文档。预处理标记器需要手动在配置或使用spacy.blank()中进行配置

nlp = spacy.blank(
    "en",
    config={
        "nlp": {
            "tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
        }
    },
)

这两个标记器在重新标记过程中分别重置任何现有的标记或实体注释。

基于字符的标记器标记器

在标记器版本experimental_char_tagger_tokenizer中,标记问题以字符级别的标记表示(标记开始为T),标记内部为I),和标记外部为O)。这种表示来自Elephant: Sequence Labeling for Word and Sentence Segmentation (Evang et al., 2013)。

This is a sentence.
TIIIOTIOTOTIIIIIIIT

使用选项annotate_sentsS替换每个句子的第一个标记的T,该组件预测标记和句子边界。

This is a sentence.
SIIIOTIOTOTIIIIIIIT

experimental_char_tagger_tokenizer的配置摘录

[nlp]
pipeline = ["experimental_char_tagger_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_tagger_tokenizer]
factory = "experimental_char_tagger_tokenizer"
annotate_sents = true
scorer = {"@scorers":"spacy-experimental.tokenizer_senter_scorer.v1"}

[components.experimental_char_tagger_tokenizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.experimental_char_tagger_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false

[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2

基于字符的NER标记器

在命名实体识别(NER)版本中,每个标记中的字符都是实体的一部分。

T	B-TOKEN
h	I-TOKEN
i	I-TOKEN
s	I-TOKEN
 	O
i	B-TOKEN
s	I-TOKEN
	O
a	B-TOKEN
 	O
s	B-TOKEN
e	I-TOKEN
n	I-TOKEN
t	I-TOKEN
e	I-TOKEN
n	I-TOKEN
c	I-TOKEN
e	I-TOKEN
.	B-TOKEN

以下是对 experimental_char_ner_tokenizer 的配置示例。

[nlp]
pipeline = ["experimental_char_ner_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}

[components]

[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
scorer = {"@scorers":"spacy-experimental.tokenizer_scorer.v1"}

[components.experimental_char_ner_tokenizer.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.experimental_char_ner_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.experimental_char_ner_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false

[components.experimental_char_ner_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2

NER版本目前不支持句子边界,但可以通过使用B-SENT实体类型轻松扩展。

双线性解析器

这是一个双线性依存关系解析器,类似于在[深度双线性注意力神经依存关系解析](Deep Biaffine Attention for Neural Dependency Parsing)(Dozat & Manning, 2016)中提出的方法。解析器由两部分组成:边预测器和边标签器。例如

[components.experimental_arc_predicter]
factory = "experimental_arc_predicter"

[components.experimental_arc_labeler]
factory = "experimental_arc_labeler"

边预测器要求在训练期间,先前组件(如senter)设置句子边界。因此,必须将此类组件添加到annotating_components中。

[training]
annotating_components = ["senter"]

双线性解析器示例项目中提供了双线性解析器管道的示例。

跨度查找器

SpanFinder是一个新的实验性组件,通过标记潜在的起始和结束标记来识别跨度边界。它是一种机器学习方法,用于以更高的精度提出候选跨度。

SpanFinder使用以下参数

  • threshold:预测跨度的概率阈值。
  • predicted_key:预测跨度保存到的SpanGroup的名称。
  • training_key:从其中读取训练跨度的SpanGroup的名称。
  • max_length:预测跨度的最大长度。设置为0时无限制。默认为0
  • min_length:预测跨度的最小长度。设置为0时无限制。默认为0

以下是包含SpanCategorizerSpanFinder配置示例。

[nlp]
lang = "en"
pipeline = ["tok2vec","span_finder","spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.span_finder]
factory = "experimental_span_finder"
threshold = 0.35
predicted_key = "span_candidates"
training_key = ${vars.spans_key}
min_length = 0
max_length = 0

[components.span_finder.scorer]
@scorers = "spacy-experimental.span_finder_scorer.v1"
predicted_key = ${components.span_finder.predicted_key}
training_key = ${vars.spans_key}

[components.span_finder.model]
@architectures = "spacy-experimental.SpanFinder.v1"

[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO=2

[components.span_finder.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[components.spancat]
factory = "spancat"
max_positive = null
spans_key = ${vars.spans_key}
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[components.spancat.suggester]
@misc = "spacy-experimental.span_finder_suggester.v1"
predicted_key = ${components.span_finder.predicted_key}

本包包含一个spaCy项目,展示了如何训练和使用SpanFinder以及SpanCategorizer

核心成分

CoreferenceResolverSpanResolver被设计为一起使用,以构建一个共指解析管道,允许您识别文档中哪些跨度引用同一事物。每个组件还包括一个架构和评分器。有关更多详细信息,请参阅spaCy主文档中的页面。

有关如何使用组件构建管道的示例,请参阅示例共指项目

架构

目前无。

其他

分词器

  • spacy-experimental.char_pretokenizer.v1:将文本分词成单个字符。

评分器

  • spacy-experimental.tokenizer_scorer.v1:评分分词。
  • spacy-experimental.tokenizer_senter_scorer.v1:评分分词和句子分割。

杂项

spancat的推荐函数

子树推荐器:使用依存关系注释来推荐具有其句法后代的标记。

  • spacy-experimental.subtree_suggester.v1
  • spacy-experimental.ngram_subtree_suggester.v1

词组推荐器:使用名词词组迭代器推荐名词词组,该迭代器需要POS和依存关系注释。

  • spacy-experimental.chunk_suggester.v1
  • spacy-experimental.ngram_chunk_suggester.v1

句子推荐器:使用句子边界来推荐句子跨度。

  • spacy-experimental.sentence_suggester.v1
  • spacy-experimental.ngram_sentence_suggester.v1

该包还包含一个merge_suggesters函数,可以用于组合多个推荐器的建议。

以下是使用和未使用ngram功能与subtree suggester一起使用的两个配置示例。

[components.spancat.suggester]
@misc = "spacy-experimental.subtree_suggester.v1"
[components.spancat.suggester]
@misc = "spacy-experimental.ngram_subtree_suggester.v1"
sizes = [1, 2, 3]

请注意,所有推荐函数都在@misc中注册。

错误报告和问题

请通过spaCy问题跟踪器报告错误或在讨论板上发起新的主题来解决其他问题。

旧版文档

有关早期版本中组件的详细信息,请参阅较早标记版本中的README。

项目详情


下载文件

下载您平台对应的文件。如果您不确定选择哪个,请了解更多关于 安装包 的信息。

源代码分发

spacy-experimental-0.6.4.tar.gz (59.3 kB 查看哈希值)

上传时间 源代码

构建分发

spacy_experimental-0.6.4-cp311-cp311-win_amd64.whl (562.3 kB 查看哈希值)

上传时间 CPython 3.11 Windows x86-64

spacy_experimental-0.6.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (667.8 kB 查看哈希值)

上传时间 CPython 3.11 manylinux: glibc 2.17+ x86-64

spacy_experimental-0.6.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (626.0 kB 查看哈希值)

上传时间 CPython 3.11 manylinux: glibc 2.17+ ARM64

spacy_experimental-0.6.4-cp311-cp311-macosx_11_0_arm64.whl (689.1 kB 查看哈希值)

上传时间 CPython 3.11 macOS 11.0+ ARM64

spacy_experimental-0.6.4-cp311-cp311-macosx_10_9_x86_64.whl (720.9 kB 查看哈希值)

上传时间 CPython 3.11 macOS 10.9+ x86-64

spacy_experimental-0.6.4-cp310-cp310-win_amd64.whl (561.6 kB 查看哈希值)

上传时间 CPython 3.10 Windows x86-64

spacy_experimental-0.6.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (669.5 kB 查看哈希值)

上传时间 CPython 3.10 manylinux: glibc 2.17+ x86-64

spacy_experimental-0.6.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (627.3 kB 查看哈希值)

上传于 CPython 3.10 manylinux: glibc 2.17+ ARM64

spacy_experimental-0.6.4-cp310-cp310-macosx_11_0_arm64.whl (703.5 kB 查看哈希)

上传于 CPython 3.10 macOS 11.0+ ARM64

spacy_experimental-0.6.4-cp310-cp310-macosx_10_9_x86_64.whl (739.4 kB 查看哈希)

上传于 CPython 3.10 macOS 10.9+ x86-64

spacy_experimental-0.6.4-cp39-cp39-win_amd64.whl (573.4 kB 查看哈希)

上传于 CPython 3.9 Windows x86-64

spacy_experimental-0.6.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (674.8 kB 查看哈希)

上传于 CPython 3.9 manylinux: glibc 2.17+ x86-64

spacy_experimental-0.6.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (634.0 kB 查看哈希)

上传于 CPython 3.9 manylinux: glibc 2.17+ ARM64

spacy_experimental-0.6.4-cp39-cp39-macosx_11_0_arm64.whl (707.6 kB 查看哈希)

上传于 CPython 3.9 macOS 11.0+ ARM64

spacy_experimental-0.6.4-cp39-cp39-macosx_10_9_x86_64.whl (742.8 kB 查看哈希)

上传于 CPython 3.9 macOS 10.9+ x86-64

spacy_experimental-0.6.4-cp38-cp38-win_amd64.whl (574.4 kB 查看哈希)

上传于 CPython 3.8 Windows x86-64

spacy_experimental-0.6.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (689.2 kB 查看哈希)

上传于 CPython 3.8 manylinux: glibc 2.17+ x86-64

spacy_experimental-0.6.4-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (645.1 kB 查看哈希值)

上传时间: CPython 3.8 manylinux: glibc 2.17+ ARM64

spacy_experimental-0.6.4-cp38-cp38-macosx_11_0_arm64.whl (689.8 kB 查看哈希值)

上传时间: CPython 3.8 macOS 11.0+ ARM64

spacy_experimental-0.6.4-cp38-cp38-macosx_10_9_x86_64.whl (719.2 kB 查看哈希值)

上传时间: CPython 3.8 macOS 10.9+ x86-64

spacy_experimental-0.6.4-cp37-cp37m-win_amd64.whl (561.9 kB 查看哈希值)

上传时间: CPython 3.7m Windows x86-64

spacy_experimental-0.6.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (664.0 kB 查看哈希值)

上传时间: CPython 3.7m manylinux: glibc 2.17+ x86-64

spacy_experimental-0.6.4-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (628.0 kB 查看哈希值)

上传时间: CPython 3.7m manylinux: glibc 2.17+ ARM64

spacy_experimental-0.6.4-cp37-cp37m-macosx_10_9_x86_64.whl (707.2 kB 查看哈希值)

上传时间: CPython 3.7m macOS 10.9+ x86-64

spacy_experimental-0.6.4-cp36-cp36m-win_amd64.whl (620.7 kB 查看哈希值)

上传时间: CPython 3.6m Windows x86-64

spacy_experimental-0.6.4-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (663.4 kB 查看哈希值)

上传时间: CPython 3.6m manylinux: glibc 2.17+ x86-64

spacy_experimental-0.6.4-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (626.9 kB 查看哈希值)

上传于 CPython 3.6m manylinux: glibc 2.17+ ARM64

spacy_experimental-0.6.4-cp36-cp36m-macosx_10_9_x86_64.whl (705.7 kB 查看哈希值)

上传于 CPython 3.6m macOS 10.9+ x86-64

支持者