使用LLM进行文本数据合成和伪标签

这些详情尚未由PyPI验证

项目描述

🦠 Mutate

一个用于使用大型语言模型（LLM）合成文本数据集的库。Mutate通过读取数据集中的示例，并使用自动生成的少量样本提示生成相似示例。

1. 安装

pip install mutate-nlp

或

pip install git+https://github.com/infinitylogesh/mutate

2. 使用

2.1 从本地csv文件合成文本数据

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-125M",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    shot_count=5,
                    class_names=["pos","neg"])

#Loop through the generator to synthesize examples by class
for synthesized_examples  in text_synth_gen:
    print(synthesized_examples)

显示输出

{
    "text": ["The story was very dull and was a waste of my time. This was not a film I would ever watch. The acting was bad. I was bored. There were no surprises. They showed one dinosaur,",
    "I did not like this film. It was a slow and boring film, it didn't seem to have any plot, there was nothing to it. The only good part was the ending, I just felt that the film should have ended more abruptly."]
    "label":["neg","neg"]
}

{
    "text":["The Bell witch is one of the most interesting, yet disturbing films of recent years. It’s an odd and unique look at a very real, but very dark issue. With its mixture of horror, fantasy and fantasy adventure, this film is as much a horror film as a fantasy film. And it‘s worth your time. While the movie has its flaws, it is worth watching and if you are a fan of a good fantasy or horror story, you will not be disappointed."],
    "label":["pos"]
}

# and so on .....

2.2 从🤗数据集合成文本数据

底层，Mutate使用神奇的🤗数据集库进行数据集处理，因此它支持🤗数据集。

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains customer service queries expressing the mentioned intent"

synthesizerGen = pipe("banking77",
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    # if the `text_column` doesn't have a meaningful value
                    text_column_alias="Queries",
                    label_column_alias="Intent", # if the `label_column` doesn't have a meaningful value
                    shot_count=5,
                    dataset_args=["en"])


for exp in synthesizerGen:
    print(exp)

显示输出

{"text":["How can i know if my account has been activated? (This is the one that I am confused about)",
         "Thanks! My card activated"],
"label":["activate_my_card",
         "activate_my_card"]
}

{
"text": ["How do i activate this new one? Is it possible?",
         "what is the activation process for this card?"],
"label":["activate_my_card",
         "activate_my_card"]
}

# and so on .....

2.3 我很幸运：无限循环数据集以无限生成示例

注意：无限循环数据集有较高的机会生成重复示例。

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    class_names=["pos","neg"],
                    # Flag to generate indefinite examples
                    infinite_loop=True)

#Infinite loop
for exp in synthesizerGen:
    print(exp)

3. 支持

3.1 当前支持

文本分类数据集合成：使用因果LLM（类似于GPT）进行文本分类数据集的少量样本文本数据合成

3.2 路线图

其他类型的文本数据集合成 - NER，句子对等
对更高质量的生成的微调支持
伪标签

4. 致谢

EleutherAI 为民主化大型LM。
此库使用🤗 数据集和 🤗 Transformers 进行数据集和模型处理。

5. 参考文献

从大型语言模型生成示例的想法受到以下作品的启发：

《几个额外的示例可能值亿万个参数》 by Yuval Kirstain, Patrick Lewis, Sebastian Riedel, Omer Levy
《GPT3Mix: 利用大规模语言模型进行文本增强》 by Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, Woomyeong Park
《使用预训练的Transformer模型进行数据增强》 by Varun Kumar, Ashutosh Choudhary, Eunah Cho

项目详情

这些详情尚未由PyPI验证

发布历史发布通知 | RSS源

此版本

0.1.2

2023年1月12日

0.1.1

2023年1月12日

0.1.0

2023年1月12日

下载文件

下载您平台对应的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源代码分布

mutate-nlp-0.1.2.tar.gz (12.2 kB 查看哈希值)

上传时间 2023年1月12日 源代码

构建分布

mutate_nlp-0.1.2-py3-none-any.whl (14.6 kB 查看哈希值)

上传时间 2023年1月12日 Python 3

mutate-nlp-0.1.2.tar.gz的哈希值

mutate-nlp-0.1.2.tar.gz的哈希值
算法	哈希摘要
SHA256	`90ca032b6dc0f23f078e227989d65093c70899f242a29a41e49bf35c5abfbb5a`
MD5	`9bddd78bde1a53eb53b29d0be8e0725a`
BLAKE2b-256	`dabdd8894002ef120e2c4aa3746a5295bb9fecdb0250127c88f813fa8aacbff1`

mutate_nlp-0.1.2-py3-none-any.whl的哈希值

mutate_nlp-0.1.2-py3-none-any.whl的哈希值
算法	哈希摘要
SHA256	`df2da42c79aab44fbed1ad4547fdceceacfd3eba3cc2af78d1fc405f8df15ede`
MD5	`6c8f9275c33bc132940fee4d32276547`
BLAKE2b-256	`504e49085526c6c34028167b34b187e9c3f5fca78e9ec077b3a64ed9704c9675`

mutate-nlp 0.1.2

导航

验证详情

维护者

未验证详情

元数据

分类器

项目描述

🦠 Mutate

1. 安装

2. 使用

2.1 从本地csv文件合成文本数据

2.2 从🤗数据集合成文本数据

2.3 我很幸运：无限循环数据集以无限生成示例

3. 支持

3.1 当前支持

3.2 路线图

4. 致谢

5. 参考文献

项目详情

验证详情

维护者

未验证详情

元数据

分类器

发布历史发布通知 | RSS源

下载文件

源代码分布

构建分布

mutate-nlp 0.1.2

导航

验证详情

维护者

未验证详情

元数据

分类器

项目描述

🦠 Mutate

1. 安装

2. 使用

2.1 从本地csv文件合成文本数据

2.2 从🤗数据集合成文本数据

2.3 我很幸运：无限循环数据集以无限生成示例

3. 支持

3.1 当前支持

3.2 路线图

4. 致谢

5. 参考文献

项目详情

验证详情

维护者

未验证详情

元数据

分类器

发布历史 发布通知 | RSS源

下载文件

源代码分布

构建分布

发布历史发布通知 | RSS源