跳转到主要内容

使用LLM进行文本数据合成和伪标签

项目描述

🦠 Mutate

一个用于使用大型语言模型(LLM)合成文本数据集的库。Mutate通过读取数据集中的示例,并使用自动生成的少量样本提示生成相似示例。

1. 安装

pip install mutate-nlp

pip install git+https://github.com/infinitylogesh/mutate

2. 使用

Open In Colab

2.1 从本地csv文件合成文本数据

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-125M",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    shot_count=5,
                    class_names=["pos","neg"])

#Loop through the generator to synthesize examples by class
for synthesized_examples  in text_synth_gen:
    print(synthesized_examples)
显示输出
{
    "text": ["The story was very dull and was a waste of my time. This was not a film I would ever watch. The acting was bad. I was bored. There were no surprises. They showed one dinosaur,",
    "I did not like this film. It was a slow and boring film, it didn't seem to have any plot, there was nothing to it. The only good part was the ending, I just felt that the film should have ended more abruptly."]
    "label":["neg","neg"]
}

{
    "text":["The Bell witch is one of the most interesting, yet disturbing films of recent years. It’s an odd and unique look at a very real, but very dark issue. With its mixture of horror, fantasy and fantasy adventure, this film is as much a horror film as a fantasy film. And it‘s worth your time. While the movie has its flaws, it is worth watching and if you are a fan of a good fantasy or horror story, you will not be disappointed."],
    "label":["pos"]
}

# and so on .....

2.2 从🤗数据集合成文本数据

底层,Mutate使用神奇的🤗数据集库进行数据集处理,因此它支持🤗数据集。

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains customer service queries expressing the mentioned intent"

synthesizerGen = pipe("banking77",
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    # if the `text_column` doesn't have a meaningful value
                    text_column_alias="Queries",
                    label_column_alias="Intent", # if the `label_column` doesn't have a meaningful value
                    shot_count=5,
                    dataset_args=["en"])


for exp in synthesizerGen:
    print(exp)
显示输出
{"text":["How can i know if my account has been activated? (This is the one that I am confused about)",
         "Thanks! My card activated"],
"label":["activate_my_card",
         "activate_my_card"]
}

{
"text": ["How do i activate this new one? Is it possible?",
         "what is the activation process for this card?"],
"label":["activate_my_card",
         "activate_my_card"]
}

# and so on .....

2.3 我很幸运:无限循环数据集以无限生成示例

注意:无限循环数据集有较高的机会生成重复示例。

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    class_names=["pos","neg"],
                    # Flag to generate indefinite examples
                    infinite_loop=True)

#Infinite loop
for exp in synthesizerGen:
    print(exp)

3. 支持

3.1 当前支持

  • 文本分类数据集合成:使用因果LLM(类似于GPT)进行文本分类数据集的少量样本文本数据合成

3.2 路线图

  • 其他类型的文本数据集合成 - NER,句子对等
  • 对更高质量的生成的微调支持
  • 伪标签

4. 致谢

5. 参考文献

从大型语言模型生成示例的想法受到以下作品的启发:

项目详情


下载文件

下载您平台对应的文件。如果您不确定选择哪个,请了解更多关于 安装包 的信息。

源代码分布

mutate-nlp-0.1.2.tar.gz (12.2 kB 查看哈希值)

上传时间 源代码

构建分布

mutate_nlp-0.1.2-py3-none-any.whl (14.6 kB 查看哈希值)

上传时间 Python 3

由...