GritLM
项目描述
生成表示性指令微调
此存储库提供了论文生成表示性指令微调的所有材料。我们继续开发此存储库,并欢迎任何贡献。如果您想以与论文中完全相同的方式使用代码,请使用1.0.0版本(提交哈希 = 3ac39052ef878371a658a060e69f9c0124bfd59b
)。
推理
基本
pip安装gritlm
from gritlm import GritLM
# Loads the model for both capabilities; If you only need embedding pass `mode="embedding"` to save memory (no lm head)
model = GritLM("GritLM/GritLM-7B", torch_dtype="auto")
# To load the 8x7B you will likely need multiple GPUs.
# All the kwargs are passed to HF from_pretrained so you can just do the below to load on multiple GPUs:
# model = GritLM("GritLM/GritLM-8x7B", torch_dtype="auto", device_map="auto")
# You can also load other models e.g.
# model = GritLM("Muennighoff/SGPT-125M-weightedmean-nli-bitfit", pooling_method="weighted_mean", attn=None)
# model = GritLM("hkunlp/instructor-base", pooling_method="mean", attn=None)
### Embedding/Representation ###
instruction = "Given a scientific paper title, retrieve the paper's abstract"
queries = ['Bitcoin: A Peer-to-Peer Electronic Cash System', 'Generative Representational Instruction Tuning']
documents = [
"A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution. Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending. We propose a solution to the double-spending problem using a peer-to-peer network. The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work. The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power. As long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they'll generate the longest chain and outpace attackers. The network itself requires minimal structure. Messages are broadcast on a best effort basis, and nodes can leave and rejoin the network at will, accepting the longest proof-of-work chain as proof of what happened while they were gone.",
"All text-based language problems can be reduced to either generation or embedding. Current models only perform well at one or the other. We introduce generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions. Compared to other open models, our resulting GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB) and outperforms all models up to its size on a range of generative tasks. By scaling up further, GritLM 8X7B outperforms all open generative language models that we tried while still being among the best embedding models. Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss. Among other benefits, the unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models. Models, code, etc. are freely available at https://github.com/ContextualAI/gritlm."
]
def gritlm_instruction(instruction):
return "<|user|>\n" + instruction + "\n<|embed|>\n" if instruction else "<|embed|>\n"
# No need to add instruction for retrieval documents
d_rep = model.encode(documents, instruction=gritlm_instruction(""))
q_rep = model.encode(queries, instruction=gritlm_instruction(instruction))
from scipy.spatial.distance import cosine
cosine_sim_q0_d0 = 1 - cosine(q_rep[0], d_rep[0])
cosine_sim_q0_d1 = 1 - cosine(q_rep[0], d_rep[1])
cosine_sim_q1_d0 = 1 - cosine(q_rep[1], d_rep[0])
cosine_sim_q1_d1 = 1 - cosine(q_rep[1], d_rep[1])
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0][:15], documents[0][:15], cosine_sim_q0_d0))
# Cosine similarity between "Bitcoin: A Peer" and "A purely peer-t" is: 0.608
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0][:15], documents[1][:15], cosine_sim_q0_d1))
# Cosine similarity between "Bitcoin: A Peer" and "All text-based " is: 0.101
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[1][:15], documents[0][:15], cosine_sim_q1_d0))
# Cosine similarity between "Generative Repr" and "A purely peer-t" is: 0.120
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[1][:15], documents[1][:15], cosine_sim_q1_d1))
# Cosine similarity between "Generative Repr" and "All text-based " is: 0.533
### Generation ###
# We did not finetune GritLM models with system prompts, as you can just include system-like instructions together with your user instruction
messages = [
{"role": "user", "content": "Please write me a poem about my recent hike of Mt. Fuji at midnight in the style of Shakespeare."},
]
encoded = model.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
encoded = encoded.to(model.device)
gen = model.generate(encoded, max_new_tokens=256, do_sample=False)
decoded = model.tokenizer.batch_decode(gen)
print(decoded[0])
"""
<s> <|user|>
Please write me a poem about my recent hike of Mt. Fuji at midnight in the style of Shakespeare.
<|assistant|>
Oh, Mt. Fuji, mountain grand,
A sight to see, a climb to command,
At midnight, in the dark of night,
I climbed your slopes, with all my might.
The stars above, they shone so bright,
A beacon in the darkness, guiding light,
The wind did blow, with a gentle sigh,
As I climbed higher, with a steady eye.
The path was steep, the climb was tough,
But I pressed on, with a steadfast rough,
For the summit, I longed to see,
The view from the top, a sight to be.
At last, I reached the peak, and stood,
With awe and wonder, I gazed aloud,
The world below, a sight to see,
A view that's worth the climb, you'll agree.
Mt. Fuji, mountain grand,
A sight to see, a climb to command,
At midnight, in the dark of night,
I climbed your slopes, with all my might.</s>
"""
缓存
pip安装gritlm
import numpy as np
import torch
from gritlm import GritLM
# Loads the model for both capabilities; If you only need embedding pass `mode="embedding"` to save memory (no lm head)
model = GritLM("GritLM/GritLM-7B", torch_dtype="auto")
# To load the 8x7B you will likely need multiple GPUs.
# All the kwargs are passed to HF from_pretrained so you can just do the below to load on multiple GPUs:
# model = GritLM("GritLM/GritLM-8x7B", torch_dtype="auto", device_map="auto")
# You can also load other models e.g.
# model = GritLM("Muennighoff/SGPT-125M-weightedmean-nli-bitfit", pooling_method="weighted_mean", attn=None)
# model = GritLM("hkunlp/instructor-base", pooling_method="mean", attn=None)
queries = ['Please explain to me how Bitcoin works.', 'What is "Generative Representational Instruction Tuning"?']
documents = [
"A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution. Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending. We propose a solution to the double-spending problem using a peer-to-peer network. The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work. The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power. As long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they'll generate the longest chain and outpace attackers. The network itself requires minimal structure. Messages are broadcast on a best effort basis, and nodes can leave and rejoin the network at will, accepting the longest proof-of-work chain as proof of what happened while they were gone.",
"All text-based language problems can be reduced to either generation or embedding. Current models only perform well at one or the other. We introduce generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions. Compared to other open models, our resulting GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB) and outperforms all models up to its size on a range of generative tasks. By scaling up further, GritLM 8X7B outperforms all open generative language models that we tried while still being among the best embedding models. Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss. Among other benefits, the unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models. Models, code, etc. are freely available at https://github.com/ContextualAI/gritlm."
]
CACHE_FORMAT_DOC = "\n<|user|>\n{query}\n\nAnswer the prior query while optionally using the context prior to it\n<|assistant|>\n"
CACHE_FORMAT_QUERY = "\n<|user|>\n{doc}\n\nOptionally using the prior context answer the query prior to it\n<|assistant|>\n"
CACHE_FORMAT_QUERY_DOC = "\n<|user|>\nOptionally using the prior context answer the query prior to it\n<|assistant|>\n"
CACHE_FORMAT_DOC_QUERY = "\n<|user|>\nAnswer the prior query while optionally using the context prior to it\n<|assistant|>\n"
def gritlm_instruction(instruction):
return "<|user|>\n" + instruction + "\n<|embed|>\n" if instruction else "<|embed|>\n"
### GRIT DOC CACHING ###
# cache: Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`
d_rep, d_cache = model.encode(documents, instruction=gritlm_instruction(""), get_cache=True)
q_rep = model.encode(queries, instruction=gritlm_instruction(""))
from scipy.spatial.distance import cosine
sims = {q: [1 - cosine(q_rep[i], d_rep[j]) for j in range(len(d_rep))] for i, q in enumerate(queries)}
for q, q_sims in sims.items():
sim_idx = np.argmax(q_sims)
cache = tuple([
(d_cache[i][0][sim_idx:sim_idx+1], d_cache[i][1][sim_idx:sim_idx+1]) for i, c in enumerate(d_cache)
])
# BOS is already in the cache
inputs = model.tokenizer(CACHE_FORMAT_DOC.format(query=q), return_tensors="pt", add_special_tokens=False).to(model.device)
inputs["use_cache"] = True
# Attend to the cache too
inputs["attention_mask"] = torch.cat((
torch.ones((cache[0][0].shape[0], cache[0][0].shape[2]), dtype=torch.long, device=inputs["attention_mask"].device),
inputs["attention_mask"],
), dim=1)
generation = model.generate(**inputs, max_new_tokens=256, past_key_values=cache, do_sample=False)
decoded = model.tokenizer.batch_decode(generation)
print(decoded[0])
"""
<|user|>
What is "Generative Representational Instruction Tuning"?
Answer the prior query while optionally using the context prior to it
<|assistant|>
Generative Representational Instruction Tuning (GRIT) is a method for training language models that can perform both generative and embedding tasks. It involves training a large language model to handle both types of tasks by distinguishing between them through instructions. GRIT is designed to improve the performance of language models on both generative and embedding tasks, and it can be used to unify both types of tasks at no performance loss.</s>
"""
### GRIT QUERY CACHING ###
# cache: Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`
d_rep = model.encode(documents, instruction=gritlm_instruction(""))
q_rep, q_cache = model.encode(queries, instruction=gritlm_instruction(""), get_cache=True)
from scipy.spatial.distance import cosine
sims = {d: [1 - cosine(q_rep[i], d_rep[j]) for j in range(len(d_rep))] for i, d in enumerate(documents)}
for d, d_sims in sims.items():
sim_idx = np.argmax(d_sims)
cache = tuple([
(q_cache[i][0][sim_idx:sim_idx+1], q_cache[i][1][sim_idx:sim_idx+1]) for i, c in enumerate(q_cache)
])
# BOS is already in the cache
inputs = model.tokenizer(CACHE_FORMAT_QUERY.format(doc=d), return_tensors="pt", add_special_tokens=False).to(model.device)
inputs["use_cache"] = True
# Attend to the cache too
inputs["attention_mask"] = torch.cat((
torch.ones((cache[0][0].shape[0], cache[0][0].shape[2]), dtype=torch.long, device=inputs["attention_mask"].device),
inputs["attention_mask"],
), dim=1)
generation = model.generate(**inputs, max_new_tokens=256, past_key_values=cache, do_sample=False)
decoded = model.tokenizer.batch_decode(generation)
print(decoded[0])
"""
<|user|>
All text-based language problems can be reduced to either generation or embedding. Current models only perform well at one or the other. We introduce generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions. Compared to other open models, our resulting GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB) and outperforms all models up to its size on a range of generative tasks. By scaling up further, GritLM 8X7B outperforms all open generative language models that we tried while still being among the best embedding models. Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss. Among other benefits, the unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models. Models, code, etc. are freely available at https://github.com/ContextualAI/gritlm.
Optionally using the prior context answer the query prior to it
<|assistant|>
GRIT stands for generative representational instruction tuning. It is a method for training large language models to handle both generative and embedding tasks by distinguishing between them through instructions. GritLM is a large language model trained using GRIT that sets a new state of the art on the Massive Text Embedding Benchmark (MTEB) and outperforms all models up to its size on a range of generative tasks. GritLM 8X7B is a larger version of GritLM that outperforms all open generative language models that were tried while still being among the best embedding models. GRIT matches training on only generative or embedding data, thus unifying both at no performance loss. This unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models. Models, code, etc. are freely available at <https://github.com/ContextualAI/gritlm>.</s>
"""
### GRIT QUERY-DOC CACHING ###
# cache: Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`
d_rep, d_cache = model.encode(documents, instruction=gritlm_instruction(""), get_cache=True, add_special_tokens=False)
q_rep, q_cache = model.encode(queries, instruction=gritlm_instruction(""), get_cache=True)
from scipy.spatial.distance import cosine
sims = {q: [1 - cosine(q_rep[i], d_rep[j]) for j in range(len(d_rep))] for i, q in enumerate(queries)}
for i, (q, q_sims) in enumerate(sims.items()):
sim_idx = np.argmax(q_sims)
cache_query = tuple([
(q_cache[j][0][i:i+1], q_cache[j][1][i:i+1]) for j, c in enumerate(q_cache)
])
cache_doc = tuple([
(d_cache[j][0][sim_idx:sim_idx+1], d_cache[j][1][sim_idx:sim_idx+1]) for j, c in enumerate(d_cache)
])
# For DOC-QUERY simply swap the order of the cache, change the format to CACHE_FORMAT_DOC_QUERY & set add_special_tokens=True in the `model.encode(..` above
cache = [(
torch.cat((layer[0], cache_doc[i][0]), dim=2),
torch.cat((layer[1], cache_doc[i][1]), dim=2),
) for i, layer in enumerate(cache_query)]
# BOS is already in the cache
inputs = model.tokenizer(CACHE_FORMAT_QUERY_DOC, return_tensors="pt", add_special_tokens=False).to(model.device)
inputs["use_cache"] = True
# Attend to the cache too
inputs["attention_mask"] = torch.cat((
torch.ones((cache[0][0].shape[0], cache[0][0].shape[2]), dtype=torch.long, device=inputs["attention_mask"].device),
inputs["attention_mask"],
), dim=1)
generation = model.generate(**inputs, max_new_tokens=256, past_key_values=cache, do_sample=False)
decoded = model.tokenizer.batch_decode(generation)
print(decoded[0])
"""
<|user|>
Optionally using the prior context answer the query prior to it
<|assistant|>
Sure, here's an example of how the prior context could be used to answer a query:
Query: "What is GRIT?"
Prior context: "We introduce generative representation instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions."
Answer: GRIT is a method for training language models to handle both generative and embedding tasks by distinguishing between them through instructions.</s>
"""
模型
论文中所有模型的权重和日志均免费提供
- 权重:https://hugging-face.cn/GritLM
- 日志:https://wandb.ai/muennighoff/gritlm/overview?workspace=user-muennighoff
名称不一定在HF & WandB之间匹配,但通常可以通过命令中的--output_dir
来区分它们。请注意,我们在某个时候将所有模型从sgpt2
重命名为gritlm
,因此某些名称/日志/命令包含旧名称。
快捷方式
- sq = 序列长度;sq2048是2048个标记
- ep = 训练轮数;ep1是1轮
- st = 步数;st100是100步
- m7/m8x7/l7/g6 = 基础模型是Mistral 7B/Mistral 8x7B/Llama 2 7B/GPT-J 6B
- emb/gen/gritlm = 嵌入,生成,统一
- bf16c = 在池化和相似度计算后,嵌入被转换回bf16(模拟缓存的嵌入如何操作)
- bb/cc/bbcc... = 双向与因果注意力顺序
- gendups = 训练时未使用
--use_unique_indices
。如果没有使用并且训练是一致的,那么数据将被重复,从而降低性能
其中最重要的是
模型 | 描述 | 嵌入性能(MTEB) | 生成性能 |
---|---|---|---|
GritLM-7B | 7B参数模型,使用双向注意力进行嵌入,使用因果注意力进行生成。它是从Mistral-7B微调的。 | 66.8 | 55.5 |
GritLM-8x7B | 8x7B参数模型,使用双向注意力进行嵌入,使用因果注意力进行生成。它是从Mistral-8x7B微调的。 | 65.7 | 65.7 |
仅生成变体 | 7B参数模型,是GritLM-7B的仅生成等效模型。 | 41.2 | 55.2 |
仅嵌入变体 | 7B参数模型,是GritLM-7B的仅嵌入等效模型。 | 66.8 | 7.6 |
对于 GritLM-7B
和 GritLM-8x7B
,文件夹中包含一个自定义建模文件(modeling_gritlm*.py
),通过关键字参数 is_causal
添加双向注意力,因此如果您使用 transformers 中的 from_pretrained
加载它们,它将自动可用。我们没有为上传到组织的任何其他模型添加此功能,因此对于这些模型,您需要自己添加或简单地将您的 transformers 安装中的 modeling_mistral.py
和 modeling_mixtral.py
文件替换为 scripts/modeling_mistral_gritlm.py
和 scripts/modeling_mixtral_gritlm.py
。请注意,对于不使用双向注意力或您不打算使用双向注意力(例如,用于生成)的模型,您不需要做任何事情。
训练
数据
该仓库使用以下格式。请参阅 training/toy_data.jsonl
以获取示例。
格式
- 嵌入数据:
{"query": str, "pos": List[str], "neg": List[str]}
- 嵌入数据包含排除在嵌入和损失之外的指令:
{"query": List[str, str], "pos": List[List[str, str]], "neg": List[List[str, str]]}
- 内列表的第一个元素是指令,第二个是要嵌入的文本。
- 生成数据:
{"text": str}
- 生成数据包含排除在损失之外的指令:
{"text": List[str]}
- 第一个/第三个/第五个等元素是指令,第二个/第四个/第六个等是响应。如果您只想进行单轮对话,则只需放入两个元素,对于多轮对话则放入更多。
我们发布了以下数据集
- 嵌入
- 生成
它们在论文及其附录中有更详细的说明。例如,要训练MEDI2 & Tulu2上的GRIT模型,只需通过 git clone https...
下载它们,然后将它们放在同一目录中,然后按照以下说明运行。不幸的是,我们无法发布用于我们最终模型的E5S数据。
运行
设置
# First install PyTorch (https://pytorch.ac.cn/get-started/locally/; we used torch==2.2.0 with NVIDIA-SMI 535.104.05, Driver Version: 535.104.05, CUDA Version: 12.2), then do the below
git clone https://github.com/ContextualAI/gritlm
cd gritlm
pip install -e .
# If you want to use GradCache, you need to use the one in this repository
cd gritlm/training/GradCache
pip install -e .
cd ../..
以下是一些入门示例
嵌入模型
torchrun --nproc_per_node 1 \
-m training.run \
--output_dir test_path \
--model_name_or_path openaccess-ai-collective/tiny-mistral \
--train_data training/toy_data/toy_data_embedding.jsonl \
--learning_rate 1e-5 \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--dataloader_drop_last True \
--normalized True \
--temperature 0.02 \
--query_max_len 32 \
--passage_max_len 128 \
--train_group_size 2 \
--mode embedding \
--attn cccc
生成模型
torchrun --nproc_per_node 1 \
-m training.run \
--output_dir test_path \
--model_name_or_path openaccess-ai-collective/tiny-mistral \
--train_data training/toy_data/toy_data_generative.jsonl \
--learning_rate 1e-5 \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--dataloader_drop_last True \
--passage_max_len 128 \
--mode generative \
--attn cccc
统一模型(GRIT)
torchrun --nproc_per_node 1 \
-m training.run \
--output_dir test_path \
--model_name_or_path openaccess-ai-collective/tiny-mistral \
--train_data training/toy_data \
--learning_rate 1e-5 \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--dataloader_drop_last True \
--normalized True \
--temperature 0.02 \
--query_max_len 32 \
--passage_max_len 128 \
--train_group_size 2 \
--mode unified \
--attn cccc
所有参数的解释均可在training/arguments.py
或HF TrainingArguments文档中找到,除了nproc_per_node
之外,它是每个节点上的GPU数量。对于我们实际的训练运行,我们使用accelerate轻松地使用多个节点和GPU,以及略微不同的设置(例如--attn bbcc
)。所有脚本都在scripts/training
中,例如用于GritLM-8x7B的脚本为scripts/training/train_gritlm_8x7b.sh
。对于消融实验中的模型,你可以检查huggingface hub上的它们所在的文件夹,其中包含一个包含参数的training_args.bin
文件。你还可以在WandB上检查所有参数:https://wandb.ai/muennighoff/gritlm。训练后,你可能首先需要运行python scripts/reformat_statedict.py path_to_statedict
来从检查点中删除model.
前缀,然后你可以通过python scripts/shard.py path_to_model_folder
来分片检查点,以便更容易使用。
对齐
对于将GritLM与KTO对齐的实验,我们使用https://github.com/huggingface/trl,其中的脚本在https://github.com/Muennighoff/kto。
评估
嵌入
cd gritlm
python evaluation/eval_mteb.py \
--model_name_or_path GritLM/GritLM-7B \
--task_types Classification,Clustering,PairClassification,Reranking,Retrieval,STS,Summarization \
--batch_size 32
对于一种更快的实现方式,请检查scripts/eval_mteb.sh
,它为每个数据集提交跨多个GPU的作业。
生成
## Setup
# Setup eval for MMLU/GSM8K/BBH/TyDi QA/Alpaca
git clone https://github.com/Muennighoff/open-instruct.git
cd open-instruct
pip install -r requirements.txt
bash ./scripts/prepare_eval_data.sh
cd ..
# Setup eval for HumanEvalPack
git clone https://github.com/bigcode-project/bigcode-evaluation-harness
cd bigcode-evaluation-harness
pip install -e .
cd ..
MODEL_PATH=GritLM/gritlm-7b
# Run all evals except for Alpaca; You may have to change some paths etc.
bash scripts/generative_eval.sh {path to model}
# Run Alpaca 1.0
export OPENAI_API_KEY=YOUR_API_KEY
python -m eval.alpaca_farm.run_eval \
--use_vllm \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $MODEL_PATH \
--save_dir ./ \
--use_chat_format \
--chat_formatting_function eval.templates.create_prompt_with_gritlm_chat_format
# Alpaca 2.0 (not used in the paper)
python -m eval.alpaca_farm.run_eval \
--use_vllm \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $MODEL_PATH \
--save_dir $MODEL_PATH \
--use_chat_format \
--chat_formatting_function eval.templates.create_prompt_with_gritlm_chat_format \
--alpaca2
已知问题
- 如果你在许多节点+一个大模型+fsdp上进行训练,你可能会在保存检查点时遇到超时,使用
FULL_STATE_DICT
。例如,使用32个节点,每个节点8个GPU训练Mixtral失败。通常情况下,主节点将完成保存,除非它们都在同一个作业管理器中,该管理器将其终止。不幸的是,增加超时限制似乎是不可能的?(见https://discuss.pytorch.org/t/how-to-set-nccl-timeout-to-infinity/146006 ; https://github.com/huggingface/accelerate/issues/2236#issuecomment-1864809701)所以当前的解决方案就是使用更少的节点或确保保存过程不会被终止。如果你有更好的解决方案,请告诉我们。
[dojo-a3-ghpc-9:1]: what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=683, OpType=_ALLGATHER_BASE, NumelIn=32768512, NumelOut=262148096, Timeout(ms)=600000) ran for 600032 milliseconds before timing out.
- 至少对于生成器,添加打包可能也是可能的;需要小心NextTokenLoss
- QLoRa / LoRa集成尚未经过充分测试
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [20, 2048]] is at version 21; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
- 如果你在多节点训练时遇到以下错误,请尝试https://github.com/huggingface/transformers/issues/26971#issuecomment-1868137087
load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/env/lib/conda/gritlm/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_device
new_value = value.to(device)
NotImplementedError: Cannot copy out of meta tensor; no data!
- DeepSpeed不与
--mode unified
和大于1(即GradCache)的--gradient_accumulation_steps
一起工作(FSDP大致等效,因此这不是高优先级) - 在accelerate配置中的
fsdp_use_orig_params: true
对于性能至关重要,否则它可能根本无法收敛(参见WandB运行中的比较) - 如果你在保存时遇到以下错误,则升级accelerate & transformers
508 01/06/2024 08:28:40 - INFO - accelerate.utils.fsdp_utils - Model saved to /data/niklas/gritlm/gritlm_mist_sq2048_medibgetuluv2_tuluformat_8nodes_oldtracc/tmp-checkpoint-500/pytorch_model.bin
509 01/06/2024 08:30:24 - INFO - accelerate.utils.fsdp_utils - Saving Optimizer state to /data/niklas/gritlm/gritlm_mist_sq2048_medibgetuluv2_tuluformat_8nodes_oldtracc/tmp-checkpoint-500/optimizer.bin
510 Traceback (most recent call last):
511 File "/env/lib/conda/gritlmold/lib/python3.9/runpy.py", line 197, in _run_module_as_main
512 return _run_code(code, main_globals, None,
513 File "/env/lib/conda/gritlmold/lib/python3.9/runpy.py", line 87, in _run_code
514 exec(code, run_globals)
515 File "/home/niklas/gritlm/training/run.py", line 421, in <module>
516 main()
517 File "/home/niklas/gritlm/training/run.py", line 411, in main
518 trainer.train()
519 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
520 return inner_training_loop(
521 File "/home/niklas/gritlm/training/gradcache_trainer.py", line 962, in _inner_training_loop
522 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
523 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
524 self._save_checkpoint(model, trial, metrics=metrics)
525 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/transformers/trainer.py", line 2354, in _save_checkpoint
526 self._save_optimizer_and_scheduler(staging_output_dir)
527 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/transformers/trainer.py", line 2445, in _save_optimizer_and_scheduler
528 save_fsdp_optimizer(
529 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/accelerate/utils/fsdp_utils.py", line 146, in save_fsdp_optimizer
530 torch.save(optim_state, output_optimizer_file)
531 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/torch/serialization.py", line 618, in save
532 with _open_zipfile_writer(f) as opened_zipfile:
533 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/torch/serialization.py", line 492, in _open_zipfile_writer
534 return container(name_or_buffer)
535 File "/env/lib/conda/gritlmold/lib/python3.9/site-packages/torch/serialization.py", line 463, in __init__
536 super().__init__(torch._C.PyTorchFileWriter(self.name))
537 RuntimeError: Parent directory /data/niklas/gritlm/gritlm_mist_sq2048_medibgetuluv2_tuluformat_8nodes_oldtracc/tmp-checkpoint-500 does not exist.
- 如果更改梯度累积步骤的数量时损失略有不同,这是预期的,因为torch默认使用加权平均平均在它的CrossEntropyLoss中。由于语言建模目标有时会在一个批次中预测多个相同的标记,因此在拆分批次时将导致不同的损失。同时,对于嵌入损失,每个类别ID只预测一次,因此加权平均对于嵌入来说相当于平均(见https://github.com/pytorch/pytorch/issues/72047; https://github.com/pytorch/pytorch/issues/40560; https://github.com/pytorch/pytorch/issues/107680)。
- 当更改进程数时,损失不同的另一个原因是数据顺序可能不同。虽然所有种子都在设置,但trainer中数据加载器的accelerate.prepare设置数据加载器,使其提前迭代一个样本。因此,在第一次迭代时,每个进程会获得两个批次而不是一个。某种方式下,这导致第一个批次中的一个样本在从0到8个GPU时进入后续批次。我无法确切知道原因,但欢迎调查。
- 使用fp32进行训练通常比使用bf16收敛得快得多。将allreduce和buffer的数据类型改为fp32并不会改变这一点(https://github.com/NVIDIA/Megatron-LM/issues/502;https://github.com/pytorch/pytorch/issues/106395)。然而,在论文的消融实验中,完全使用fp32并没有表现得更好。
- 在统一模式下,torch.compile会失败(也请参阅https://github.com/pytorch/pytorch/issues/111317)
from user code:
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/transformers/models/mistral/mode
ling_mistral.py", line 757, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/nn/modules/module.py", line
1527, in _call_impl
return forward_call(*args, **kwargs)
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/transformers/models/mistral/model
ing_mistral.py", line 89, in forward
return self.weight * hidden_states.to(input_dtype)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
example_value = wrap_to_fake_tensor_and_record(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_dynamo/variables/builder.p
y", line 1587, in wrap_to_fake_tensor_and_record
fake_e = wrap_fake_exception(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 916
, in wrap_fake_exception
return fn()
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_dynamo/variables/builder.p
y", line 1588, in <lambda>
lambda: tx.fake_mode.from_tensor(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py"
, line 1721, in from_tensor
return self.fake_tensor_converter(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py"
, line 371, in __call__
return self.from_real_tensor(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py"
, line 324, in from_real_tensor
out = self.meta_converter(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_subclasses/meta_utils.py",
line 591, in __call__
r = self.meta_tensor(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_subclasses/meta_utils.py",
line 307, in meta_tensor
base = self.meta_tensor(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_subclasses/meta_utils.py",
line 478, in meta_tensor
r.grad = self.meta_tensor(
torch._dynamo.exc.InternalTorchDynamoError: attempting to assign a gradient of size '[2726400
0]' to a tensor of size '[218112000]'. Please ensure that the gradient and the tensor are the
same size
- DeepSpeed + FlashAttention2 + Optim & Params卸载到CPU + DeepSpeed ZeRo3初始化失败
s. (Triggered internally at /opt/conda/conda-bld/pytorch_1702400412039/work/torch/csrc/t
ensor/python_tensor.cpp:83.)
total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
Invalidate trace cache @ step 1: expected module 1, but got module 2
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: Work
- 如果实现完整的拆分 + GC,可能会遇到以下问题
File "/home/niklas/gritlm/training/gradcache_trainer.py", line 630, in _inner_training_loop
self.accelerator.backward(loss)
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/accelerate/accelerator.py", line 1964, in backward
loss.backward(**kwargs)
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
frame.check_recomputed_tensors_match(gid)
File "/env/lib/conda/gritlmnew/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 812, in check_recomp
uted_tensors_match
raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during th
e original forward and recomputation.
Number of tensors saved during forward: 47
Number of tensors saved during recomputation: 45
视觉效果
- 图1:
visuals/performance.pdf
;visuals/grit_plots.ipynb
/ colab,然后通过visuals/performance.drawio
添加标志,该文件可以用https://app.diagrams.net/打开,然后通过visuals/performance.key
添加注释,该文件可以用Keynote打开。 - 图2:
visuals/octopus.pdf
;https://docs.google.com/drawings/d/1ZAzaX4h2JfJR1ahan0R5nk3Xm17SMquGjhshnBNJOzY/edit?usp=sharing - 图3:
visuals/format.pdf
;https://docs.google.com/drawings/d/1vaSNvDWy6xBBuC70rI22qdOmymksxqoTYiplGPH22ys/edit?usp=sharing - 图4:
visuals/rag.pdf
;https://docs.google.com/drawings/d/1rv916zpYvBbaS6QxpFP4_6fc4gABcPWc2qZC3NUpz8s/edit?usp=sharing - 图5/6/7/8:
visuals/latency.pdf
/visuals/loss7.pdf
/visuals/loss8x7.pdf
/visuals/embmem.pdf
;visuals/grit_plots.ipynb
/ colab - 其他图表和表格是手动的,但有一些辅助脚本,例如
scripts/mteb_to_tex.py
致谢
代码灵感来源于
- https://github.com/Muennighoff/sgpt
- https://github.com/FlagOpen/FlagEmbedding
- https://github.com/embeddings-benchmark/mteb
请参阅论文中的其他致谢。
引用
如果觉得有用,请考虑引用 😊
@misc{muennighoff2024generative,
title={Generative Representational Instruction Tuning},
author={Niklas Muennighoff and Hongjin Su and Liang Wang and Nan Yang and Furu Wei and Tao Yu and Amanpreet Singh and Douwe Kiela},
year={2024},
eprint={2402.09906},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
项目详情
下载文件
下载适合您平台文件。如果您不确定选择哪一个,请了解更多关于安装包的信息。