Dataset Viber是您进行数据收集、标注和情感检查的悠闲仓库。
项目描述
Dataset Viber
避免炒作,检查情感!
我制作了Dataset Viber,一套使您在处理AI模型数据时更加轻松的工具。Dataset Viber旨在使您的数据处理之旅更加顺畅和有趣。它不适用于团队合作或生产,也不是试图变得复杂和正式 - 只是一系列帮助您作为AI工程师或爱好者收集反馈和进行情感检查的**酷工具**。想看看它的实际效果?只需将其插入并开始与数据互动即可。就这么简单!
- CollectorInterface:在没有人工标注的情况下,懒加载模型交互数据。
- AnnotatorInterface:遍历您的数据,并在循环中使用模型进行标注。
- Synthesizer:在循环中使用
distilabel
合成数据。 - BulkInterface:探索您的数据分布,并批量进行标注。
需要任何调整或想了解更多关于特定工具的信息?只需创建一个问题或向我提出建议!
[!NOTE]
- 数据记录到本地的CSV文件或直接到Hugging Face Hub。
- 所有工具也都在
.ipynb
笔记本中运行。- 通过
fn_model
循环使用模型。- 使用自定义数据流器或带有
fn_next_input
参数的预构建Synthesizer
类进行输入。- 它支持文本、聊天和图像等多种模态的任务。
- 从Hugging Face Hub或CSV文件导入和导出。
[!TIP]
安装
您可以通过pip安装此软件包
pip install dataset-viber
或者安装 合成器
依赖。注意,合成器
依赖于 distilabel[hf-inference-endpoints]
,但您也可以使用 distilabel 提供的其他 LLM,例如 distilabel[ollama]
。
pip install dataset-viber[synthesizer]
或者安装 批量接口
依赖
pip install dataset-viber[bulk]
我们感觉怎么样?
收集器接口
基于
gr.Interface
和gr.ChatInterface
构建,用于自动懒加载收集交互数据。
https://github.com/user-attachments/assets/4ddac8a1-62ab-4b3b-9254-f924f5898075
收集器接口
import gradio as gr
from dataset_viber import CollectorInterface
def calculator(num1, operation, num2):
if operation == "add":
return num1 + num2
elif operation == "subtract":
return num1 - num2
elif operation == "multiply":
return num1 * num2
elif operation == "divide":
return num1 / num2
inputs = ["number", gr.Radio(["add", "subtract", "multiply", "divide"]), "number"]
outputs = "number"
interface = CollectorInterface(
fn=calculator,
inputs=inputs,
outputs=outputs,
csv_logger=False, # True if you want to log to a CSV
dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
CollectorInterface.from_interface
interface = gr.Interface(
fn=calculator,
inputs=inputs,
outputs=outputs
)
interface = CollectorInterface.from_interface(
interface=interface,
csv_logger=False, # True if you want to log to a CSV
dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
CollectorInterface.from_pipeline
from transformers import pipeline
from dataset_viber import CollectorInterface
pipeline = pipeline("text-classification", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")
interface = CollectorInterface.from_pipeline(
pipeline=pipeline,
csv_logger=False, # True if you want to log to a CSV
dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
标注器接口
基于
收集器接口
构建,用于收集和标注数据,并将其记录到枢纽。
文本
https://github.com/user-attachments/assets/d1abda66-9972-4c60-89d2-7626f5654f15
text-classification
/multi-label-text-classification
from dataset_viber import AnnotatorInterFace
texts = [
"Anthony Bourdain was an amazing chef!",
"Anthony Bourdain was a terrible tv persona!"
]
labels = ["positive", "negative"]
interface = AnnotatorInterFace.for_text_classification(
texts=texts,
labels=labels,
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
token-classification
from dataset_viber import AnnotatorInterFace
texts = ["Anthony Bourdain was an amazing chef in New York."]
labels = ["NAME", "LOC"]
interface = AnnotatorInterFace.for_token_classification(
texts=texts,
labels=labels,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
抽取式问答
from dataset_viber import AnnotatorInterFace
questions = ["Where was Anthony Bourdain located?"]
contexts = ["Anthony Bourdain was an amazing chef in New York."]
interface = AnnotatorInterFace.for_question_answering(
questions=questions,
contexts=contexts,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
text-generation
/translation
/completion
from dataset_viber import AnnotatorInterFace
prompts = ["Tell me something about Anthony Bourdain."]
completions = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]
interface = AnnotatorInterFace.for_text_generation(
prompts=prompts, # source
completions=completions, # optional to show initial completion / target
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
text-generation-preference
from dataset_viber import AnnotatorInterFace
prompts = ["Tell me something about Anthony Bourdain."]
completions_a = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]
completions_b = ["Anthony Michael Bourdain was an cool guy that knew how to cook."]
interface = AnnotatorInterFace.for_text_generation_preference(
prompts=prompts,
completions_a=completions_a,
completions_b=completions_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
聊天和多模态聊天
https://github.com/user-attachments/assets/fe7f0139-95a3-40e8-bc03-e37667d4f7a9
[!TIP] 我建议将文件上传到云存储,并使用远程 URL 避免任何问题。这可以通过 使用 Hugging Face Datasets 完成。如 utils 中所示。此外,GradioChatbot 展示了如何使用聊天机器人界面进行多模态。
chat-classification
from dataset_viber import AnnotatorInterFace
prompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
},
{
"role": "assistant",
"content": "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."
}
]
]
interface = AnnotatorInterFace.for_chat_classification(
prompts=prompts,
labels=["toxic", "non-toxic"],
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
chat-generation
from dataset_viber import AnnotatorInterFace
prompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
}
]
]
completions = [
"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]
interface = AnnotatorInterFace.for_chat_generation(
prompts=prompts,
completions=completions,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
chat-generation-preference
from dataset_viber import AnnotatorInterFace
prompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
}
]
]
completions_a = [
"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]
completions_b = [
"Anthony Michael Bourdain was an cool guy that knew how to cook."
]
interface = AnnotatorInterFace.for_chat_generation_preference(
prompts=prompts,
completions_a=completions_a,
completions_b=completions_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
图像和多模态
https://github.com/user-attachments/assets/57d89edf-ae40-4942-a20a-bf8443100b66
[!TIP] 我建议将文件上传到云存储,并使用远程 URL 避免任何问题。这可以通过 使用 Hugging Face Datasets 完成。
image-classification
/multi-label-image-classification
from dataset_viber import AnnotatorInterFace
images = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
labels = ["anthony-bourdain", "not-anthony-bourdain"]
interface = AnnotatorInterFace.for_image_classification(
images=images,
labels=labels,
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-generation
from dataset_viber import AnnotatorInterFace
prompts = [
"Anthony Bourdain laughing",
"David Chang wearing a suit"
]
images = [
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]
interface = AnnotatorInterFace.for_image_generation(
prompts=prompts,
completions=images,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-description
from dataset_viber import AnnotatorInterFace
images = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
descriptions = ["Anthony Bourdain laughing", "David Chang wearing a suit"]
interface = AnnotatorInterFace.for_image_description(
images=images,
descriptions=descriptions, # optional to show initial descriptions
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-question-answering
/visual-question-answering
from dataset_viber import AnnotatorInterFace
images = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
questions = ["Who is this?", "What is he wearing?"]
answers = ["Anthony Bourdain", "a suit"]
interface = AnnotatorInterFace.for_image_question_answering(
images=images,
questions=questions, # optional to show initial questions
answers=answers, # optional to show initial answers
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-generation-preference
from dataset_viber import AnnotatorInterFace
prompts = [
"Anthony Bourdain laughing",
"David Chang wearing a suit"
]
images_a = [
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]
images_b = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
interface = AnnotatorInterFace.for_image_generation_preference(
prompts=prompts,
completions_a=images_a,
completions_b=images_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
合成器
基于
distilabel
构建,用于在循环中使用模型合成数据。
[!TIP] 您也可以直接调用合成器生成数据。使用
synthesizer() -> Tuple
或Synthesizer.batch_synthesize(n) -> List[Tuple]
获取各种任务的输入。
text-classification
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_text_classification(prompt_context="IMDB movie reviews")
interface = AnnotatorInterFace.for_text_classification(
fn_next_input=synthesizer,
labels=["positive", "negative"]
)
interface.launch()
text-generation
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_text_generation(
prompt_context="A phone company customer support expert"
)
interface = AnnotatorInterFace.for_text_generation(
fn_next_input=synthesizer
)
interface.launch()
chat-classification
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_chat_classification(
prompt_context="A phone company customer support expert"
)
interface = AnnotatorInterFace.for_chat_classification(
fn_next_input=synthesizer,
labels=["positive", "negative"]
)
interface.launch()
chat-generation
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_chat_generation(
prompt_context="A phone company customer support expert"
)
interface = AnnotatorInterFace.for_chat_generation(
fn_next_input=synthesizer
)
interface.launch()
chat-generation-preference
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_chat_generation_preference(prompt_context="A phone company customer support expert")
interface = AnnotatorInterFace.for_chat_generation_preference(
fn_next_input=synthesizer
)
interface.launch()
image-classification
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_image_classification(prompt_context="A phone company customer support expert")
interface = AnnotatorInterFace.for_image_classification(
fn_next_input=synthesizer,
labels=["positive", "negative"]
)
interface.launch()
image-generation
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_image_generation(prompt_context="A phone company customer support expert")
interface = AnnotatorInterFace.for_image_generation(
fn_next_input=synthesizer
)
interface.launch()
image-description
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_image_description(prompt_context="A phone company customer support expert")
interface = AnnotatorInterFace.for_image_description(
fn_next_input=synthesizer
)
interface.launch()
image-question-answering
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_image_question_answering(prompt_context="A phone company customer support expert")
interface = AnnotatorInterFace.for_image_question_answering(
fn_next_input=synthesizer
)
interface.launch()
image-generation-preference
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizer
synthesizer = Synthesizer.for_image_generation_preference(prompt_context="A phone company customer support expert")
interface = AnnotatorInterFace.for_image_generation_preference(
fn_next_input=synthesizer
)
interface.launch()
批量接口
基于
Dash
、plotly-express
、umap-learn
和fast-sentence-transformers
构建,用于嵌入和理解您的分布,并标注您的数据。
https://github.com/user-attachments/assets/5e96c06d-e37f-45a0-9633-1a8e714d71ed
文本可视化
from dataset_viber import BulkInterface
from datasets import load_dataset
ds = load_dataset("SetFit/ag_news", split="train[:2000]")
interface: BulkInterface = BulkInterface.for_text_visualization(
ds.to_pandas()[["text", "label_text"]],
content_column='text',
label_column='label_text',
)
interface.launch()
text-classification
from dataset_viber import BulkInterface
from datasets import load_dataset
ds = load_dataset("SetFit/ag_news", split="train[:2000]")
df = ds.to_pandas()[["text", "label_text"]]
interface = BulkInterface.for_text_classification(
dataframe=df,
content_column='text',
label_column='label_text',
labels=df['label_text'].unique().tolist()
)
interface.launch()
聊天可视化
from dataset_viber.bulk import BulkInterface
from datasets import load_dataset
ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]
interface = BulkInterface.for_chat_visualization(
dataframe=df,
chat_column='chosen',
)
interface.launch()
chat-classification
from dataset_viber.bulk import BulkInterface
from datasets import load_dataset
ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]
interface = BulkInterface.for_chat_classification(
dataframe=df,
chat_column='chosen',
labels=["math", "science", "history", "question seeking"],
)
interface.launch()
实用工具
按相同顺序打乱输入
当处理多个输入时,您可能希望按相同顺序打乱它们。
def shuffle_lists(*lists):
if not lists:
return []
# Get the length of the first list
length = len(lists[0])
# Check if all lists have the same length
if not all(len(lst) == length for lst in lists):
raise ValueError("All input lists must have the same length")
# Create a list of indices and shuffle it
indices = list(range(length))
random.shuffle(indices)
# Reorder each list based on the shuffled indices
return [
[lst[i] for i in indices]
for lst in lists
]
随机交换以随机化完成
当处理多个完成时,您可能希望交换相同索引的完成,其中每个完成索引 x 与相同索引的随机完成交换。这对于偏好学习很有用。
def swap_completions(*lists):
# Assuming all lists are of the same length
length = len(lists[0])
# Check if all lists have the same length
if not all(len(lst) == length for lst in lists):
raise ValueError("All input lists must have the same length")
# Convert the input lists (which are tuples) to a list of lists
lists = [list(lst) for lst in lists]
# Iterate over each index
for i in range(length):
# Get the elements at index i from all lists
elements = [lst[i] for lst in lists]
# Randomly shuffle the elements
random.shuffle(elements)
# Assign the shuffled elements back to the lists
for j, lst in enumerate(lists):
lst[i] = elements[j]
return lists
从 Hugging Face Hub 加载远程图像 URL
当处理图像时,您可能希望从 Hugging Face Hub 加载远程 URL。
from datasets import Dataset, Image, load_dataset
dataset = load_dataset(
"my_hf_org/my_image_dataset"
).cast_column("my_image_column", Image(decode=False))
dataset[0]["my_image_column"]
# {'bytes': None, 'path': 'path_to_image.jpg'}
贡献和开发设置
首先,安装 PDM。
然后,安装环境,这将自动创建 .venv
虚拟环境和安装开发环境。
pdm install
最后,运行 pre-commit 进行提交时的格式化。
pre-commit install
遵循此 指南以做出首次贡献。
参考文献
标志
灵感来源
- https://hugging-face.cn/spaces/davidberenstein1957/llm-human-feedback-collector-chat-interface-dpo
- https://hugging-face.cn/spaces/davidberenstein1957/llm-human-feedback-collector-chat-interface-kto
- https://medium.com/@oxenai/collecting-data-from-human-feedback-for-generative-ai-ec9e20bf01b9
- https://hamel.dev/notes/llm/finetuning/04_data_cleaning.html
项目详情
下载文件
下载适合您平台文件的文件。如果您不确定该选择哪个,请了解更多关于安装包的信息。
源分发
构建分发
dataset_viber-0.3.1.tar.gz的散列值
算法 | 散列摘要 | |
---|---|---|
SHA256 | 7e5477cbc7377694eceecd6b9b25d8346077a777e29393ed6a54361271b944e7 |
|
MD5 | e49f2359681fbdde03511c37ee152a2f |
|
BLAKE2b-256 | 8b8453c99d8677863f78aa8171cd94fcf0c5f43efa6b8b084697b0e1600c4b78 |
dataset_viber-0.3.1-py3-none-any.whl的散列值
算法 | 散列摘要 | |
---|---|---|
SHA256 | 6502ecbbf00a45e1ff6da343279ad266d6997fc3eeeb2263290e99b4454ddd2a |
|
MD5 | 4647b0390630f75fe26eb1bf72ab56ff |
|
BLAKE2b-256 | 0d1e409ecb943c5ba2358eb130d24d19d819bd68f78cfd35ace8b4f027575710 |