跳转到主要内容

你是否曾经遇到过需要Spacy TextCategorizer但没时间从头开始训练的情况?Classy Classification是你的最佳选择!

项目描述

Classy Classification

你是否曾经遇到过需要Spacy TextCategorizer但没时间从头开始训练的情况?Classy Classification是你的最佳选择!对于使用sentence-transformersspaCy模型的少样本分类,提供一个包含标签和示例的字典,或者只为Huggingface zero-shot classifiers的零样本分类提供一个标签列表。

Current Release Version pypi Version PyPi downloads Code style: black

安装

pip install classy-classification

SetFit支持

我收到了很多关于SetFit支持的请求,但我决定为这个功能创建一个独立的包。请随意查看。❤️

快速入门

SpaCy嵌入

import spacy
# or import standalone
# from classy_classification import ClassyClassifier

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe(
    "classy_classification",
    config={
        "data": data,
        "model": "spacy"
    }
)

print(nlp("I am looking for kitchen appliances.")._.cats)

# Output:
#
# [{"furniture" : 0.21}, {"kitchen": 0.79}]

句子级分类

import spacy

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

nlp.add_pipe(
    "classy_classification",
    config={
        "data": data,
        "model": "spacy",
        "include_sent": True
    }
)

print(nlp("I am looking for kitchen appliances. And I love doing so.").sents[0]._.cats)

# Output:
#
# [[{"furniture" : 0.21}, {"kitchen": 0.79}]

定义随机种子和详细程度

nlp.add_pipe(
    "classy_classification",
    config={
        "data": data,
        "verbose": True,
        "config": {"seed": 42}
    }
)

多标签分类

有时需要多个标签才能完全描述文本的内容。在这种情况下,我们希望使用多标签实现,这里标签分数之和不受限于1。只需将相同的训练数据传递给多个键。

import spacy

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa.",
               "We have a new dinner table.",
               "There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens.",
                "We have a new dinner table."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens.",
                "We have a new dinner table.",
                "There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens.",
                "We have a new dinner table."]
}

nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "classy_classification",
    config={
        "data": data,
        "model": "spacy",
        "multi_label": True,
    }
)

print(nlp("I am looking for furniture and kitchen equipment.")._.cats)

# Output:
#
# [{"furniture": 0.92}, {"kitchen": 0.91}]

异常检测

有时进行异常检测或二元分类是有价值的。这可以通过使用二元训练数据集来实现,但我还实现了对使用单个标签进行异常检测的OneClassSVM的支持。[注意:此方法不返回概率,但数据格式化为标签-分数值对,以确保一致性。

方法1

import spacy

data_binary = {
    "inlier": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "outlier": ["Text about kitchen equipment",
                "This text is about politics",
                "Comments about AI and stuff."]
}

nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "classy_classification",
    config={
        "data": data_binary,
    }
)

print(nlp("This text is a random text")._.cats)

# Output:
#
# [{'inlier': 0.2926672385488411, 'outlier': 0.707332761451159}]

方法2

import spacy

data_singular = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa.",
               "We have a new dinner table."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "classy_classification",
    config={
        "data": data_singular,
    }
)

print(nlp("This text is a random text")._.cats)

# Output:
#
# [{'furniture': 0, 'not_furniture': 1}]

句子转换器嵌入

import spacy

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

nlp = spacy.blank("en")
nlp.add_pipe(
    "classy_classification",
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "device": "gpu"
    }
)

print(nlp("I am looking for kitchen appliances.")._.cats)

# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]

Huggingface零样本分类器

import spacy

data = ["furniture", "kitchen"]

nlp = spacy.blank("en")
nlp.add_pipe(
    "classy_classification",
    config={
        "data": data,
        "model": "typeform/distilbert-base-uncased-mnli",
        "cat_type": "zero",
        "device": "gpu"
    }
)

print(nlp("I am looking for kitchen appliances.")._.cats)

# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]

致谢

灵感来源

Huggingface提供了一些针对少量/零样本分类的不错模型,但这些模型并不是针对多语言方法定制的。Rasa NLU有一种不错的处理方法,但它在Rasa/chatbots代码库中嵌入得太深,难以在外部使用。此外,将sentence-transformersHuggingface零样本集成到默认的词嵌入中似乎是合理的。最后,我决定与Spacy集成,因为如果你想要快速而简单的东西,训练定制的Spacy TextCategorizer似乎很麻烦。

或者给我买杯咖啡

"Buy Me A Coffee"

独立使用,不依赖spaCy

from classy_classification import ClassyClassifier

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

classifier = ClassyClassifier(data=data)
classifier("I am looking for kitchen appliances.")
classifier.pipe(["I am looking for kitchen appliances."])

# overwrite training data
classifier.set_training_data(data=data)
classifier("I am looking for kitchen appliances.")

# overwrite [embedding model](https://sbert.net.cn/docs/pretrained_models.html)
classifier.set_embedding_model(model="paraphrase-MiniLM-L3-v2")
classifier("I am looking for kitchen appliances.")

# overwrite SVC config
classifier.set_classification_model(
    config={
        "C": [1, 2, 5, 10, 20, 100],
        "kernel": ["linear"],
        "max_cross_validation_folds": 5
    }
)
classifier("I am looking for kitchen appliances.")

保存和加载模型

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}
classifier = classyClassifier(data=data)

with open("./classifier.pkl", "wb") as f:
    pickle.dump(classifier, f)

f = open("./classifier.pkl", "rb")
classifier = pickle.load(f)
classifier("I am looking for kitchen appliances.")

项目详情


下载文件

为您的平台下载文件。如果您不确定选择哪个,请了解更多关于安装包的信息。

源代码分发

classy-classification-1.0.0.tar.gz (14.6 kB 查看哈希)

上传时间 源代码

构建分发

classy_classification-1.0.0-py3-none-any.whl (15.5 kB 查看哈希)

上传时间 Python 3

由以下支持