使用LLM进行文本数据合成和伪标签
项目描述
🦠 Mutate
一个用于使用大型语言模型(LLM)合成文本数据集的库。Mutate通过读取数据集中的示例,并使用自动生成的少量样本提示生成相似示例。
1. 安装
pip install mutate-nlp
或
pip install git+https://github.com/infinitylogesh/mutate
2. 使用
2.1 从本地csv文件合成文本数据
from mutate import pipeline
pipe = pipeline("text-classification-synthesis",
model="EleutherAI/gpt-neo-125M",
device=1)
task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"
# returns a python generator
text_synth_gen = pipe("csv",
data_files=["local/path/sentiment_classfication.csv"],
task_desc=task_desc,
text_column="text",
label_column="label",
text_column_alias="Comment",
label_column_alias="sentiment",
shot_count=5,
class_names=["pos","neg"])
#Loop through the generator to synthesize examples by class
for synthesized_examples in text_synth_gen:
print(synthesized_examples)
显示输出
{
"text": ["The story was very dull and was a waste of my time. This was not a film I would ever watch. The acting was bad. I was bored. There were no surprises. They showed one dinosaur,",
"I did not like this film. It was a slow and boring film, it didn't seem to have any plot, there was nothing to it. The only good part was the ending, I just felt that the film should have ended more abruptly."]
"label":["neg","neg"]
}
{
"text":["The Bell witch is one of the most interesting, yet disturbing films of recent years. It’s an odd and unique look at a very real, but very dark issue. With its mixture of horror, fantasy and fantasy adventure, this film is as much a horror film as a fantasy film. And it‘s worth your time. While the movie has its flaws, it is worth watching and if you are a fan of a good fantasy or horror story, you will not be disappointed."],
"label":["pos"]
}
# and so on .....
2.2 从🤗数据集合成文本数据
底层,Mutate使用神奇的🤗数据集库进行数据集处理,因此它支持🤗数据集。
from mutate import pipeline
pipe = pipeline("text-classification-synthesis",
model="EleutherAI/gpt-neo-2.7B",
device=1)
task_desc = "Each item in the following contains customer service queries expressing the mentioned intent"
synthesizerGen = pipe("banking77",
task_desc=task_desc,
text_column="text",
label_column="label",
# if the `text_column` doesn't have a meaningful value
text_column_alias="Queries",
label_column_alias="Intent", # if the `label_column` doesn't have a meaningful value
shot_count=5,
dataset_args=["en"])
for exp in synthesizerGen:
print(exp)
显示输出
{"text":["How can i know if my account has been activated? (This is the one that I am confused about)",
"Thanks! My card activated"],
"label":["activate_my_card",
"activate_my_card"]
}
{
"text": ["How do i activate this new one? Is it possible?",
"what is the activation process for this card?"],
"label":["activate_my_card",
"activate_my_card"]
}
# and so on .....
2.3 我很幸运:无限循环数据集以无限生成示例
注意:无限循环数据集有较高的机会生成重复示例。
from mutate import pipeline
pipe = pipeline("text-classification-synthesis",
model="EleutherAI/gpt-neo-2.7B",
device=1)
task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"
# returns a python generator
text_synth_gen = pipe("csv",
data_files=["local/path/sentiment_classfication.csv"],
task_desc=task_desc,
text_column="text",
label_column="label",
text_column_alias="Comment",
label_column_alias="sentiment",
class_names=["pos","neg"],
# Flag to generate indefinite examples
infinite_loop=True)
#Infinite loop
for exp in synthesizerGen:
print(exp)
3. 支持
3.1 当前支持
- 文本分类数据集合成:使用因果LLM(类似于GPT)进行文本分类数据集的少量样本文本数据合成
3.2 路线图
- 其他类型的文本数据集合成 - NER,句子对等
- 对更高质量的生成的微调支持
- 伪标签
4. 致谢
- EleutherAI 为民主化大型LM。
- 此库使用🤗 数据集 和 🤗 Transformers 进行数据集和模型处理。
5. 参考文献
从大型语言模型生成示例的想法受到以下作品的启发:
- 《几个额外的示例可能值亿万个参数》 by Yuval Kirstain, Patrick Lewis, Sebastian Riedel, Omer Levy
- 《GPT3Mix: 利用大规模语言模型进行文本增强》 by Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, Woomyeong Park
- 《使用预训练的Transformer模型进行数据增强》 by Varun Kumar, Ashutosh Choudhary, Eunah Cho
项目详情
下载文件
下载您平台对应的文件。如果您不确定选择哪个,请了解更多关于 安装包 的信息。
源代码分布
mutate-nlp-0.1.2.tar.gz (12.2 kB 查看哈希值)
构建分布
mutate_nlp-0.1.2-py3-none-any.whl (14.6 kB 查看哈希值)
关闭
mutate-nlp-0.1.2.tar.gz的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 90ca032b6dc0f23f078e227989d65093c70899f242a29a41e49bf35c5abfbb5a |
|
MD5 | 9bddd78bde1a53eb53b29d0be8e0725a |
|
BLAKE2b-256 | dabdd8894002ef120e2c4aa3746a5295bb9fecdb0250127c88f813fa8aacbff1 |
关闭
mutate_nlp-0.1.2-py3-none-any.whl的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | df2da42c79aab44fbed1ad4547fdceceacfd3eba3cc2af78d1fc405f8df15ede |
|
MD5 | 6c8f9275c33bc132940fee4d32276547 |
|
BLAKE2b-256 | 504e49085526c6c34028167b34b187e9c3f5fca78e9ec077b3a64ed9704c9675 |