Iâm trying to implement a complex training pipeline where models can be re-finetuned in a RL style. However, I canât make it working using transformers + peft. The issue is that transformers refuses to load the correct model. Here is a minimal example,
import pathlib
import torch
from peft import LoraConfig, TaskType, get_peft_model, PeftConfig, PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer, ModernBertForSequenceClassification
def init_model(path_to_dir: pathlib.Path) -> None:
base_model = AutoModelForSequenceClassification.from_pretrained(
pretrained_model_name_or_path="answerdotai/ModernBERT-large",
num_labels=1,
torch_dtype=torch.float32,
problem_type="regression",
device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-large")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.add_tokens(["[USER]", "[/USER]", "[EOT]"])
tokenizer.chat_template = (
"{% for i in range(0, messages|length, 2) %}"
"{% if i + 1 < messages|length %}"
"[USER]{{ messages[i].content }}[/USER] {{ messages[i+1].content }}[EOT]\n"
"{% endif %}"
"{% endfor %}"
)
base_model.resize_token_embeddings(len(tokenizer))
peft_config = LoraConfig(
r=4,
lora_alpha=32,
task_type=TaskType.SEQ_CLS,
target_modules="all-linear"
)
model = get_peft_model(base_model, peft_config)
model.save_pretrained(path_to_dir)
model.base_model.save_pretrained(path_to_dir)
tokenizer.save_pretrained(path_to_dir)
def reload_model(path_to_dir: pathlib.Path) -> None:
tokenizer = AutoTokenizer.from_pretrained(path_to_dir)
base_model = ModernBertForSequenceClassification.from_pretrained(
str(path_to_dir),
num_labels=1,
torch_dtype=torch.float32,
device_map="cuda"
)
config = PeftConfig.from_pretrained(str(path_to_dir))
base_model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(
base_model,
str(path_to_dir),
is_trainable=True,
config=config,
device_map="cuda"
)
if __name__ == "__main__":
init_model(pathlib.Path("/tmp/test"))
reload_model(pathlib.Path("/tmp/test"))
In the above example, I expect a model to be initialized (random, thatâs fine), store it to disk and then reload it. In real world, letâs say that the model made predictions, a score was computed, and then on the second step, the model is reloaded, finetuned and stored again for the next training step.
Now, when I run this script, Iâm facing two issues I canât work around.
First, transformers seem to ignore that the model was previously initialized and it doesnât load the classifier.wights and classifier.bias.
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-large and are newly initialized: [âclassifier.biasâ, âclassifier.weightâ]
Secondly, it does not recognize that I have resized the base_model token space (i.e., base_model.resize_token_embeddings(len(tokenizer))) and it throws an error:
Error(s) in loading state_dict for ModernBertForSequenceClassification:
size mismatch for model.embeddings.tok_embeddings.weight: copying a param with shape torch.Size([50371, 1024]) from checkpoint, the shape in current model is torch.Size([50368, 1024]).
These are the files it created:
$ ls -lhrt /tmp/test/
total 208M
-rw-r--r-- 1 gatti data 5,0K juil. 21 16:40 README.md
-rw-r--r-- 1 gatti data 204M juil. 21 16:40 adapter_model.safetensors
-rw-r--r-- 1 gatti data 828 juil. 21 16:40 adapter_config.json
-rw-r--r-- 1 gatti data 170 juil. 21 16:40 chat_template.jinja
-rw-r--r-- 1 gatti data 21K juil. 21 16:40 tokenizer_config.json
-rw-r--r-- 1 gatti data 694 juil. 21 16:40 special_tokens_map.json
-rw-r--r-- 1 gatti data 3,5M juil. 21 16:40 tokenizer.json
It does not seem to be storing the classifier, which is at best wierd, since I explicitly asked to model.base_model.save_pretrained(path_to_dir)
Besides, if I investigate the adapter_config:
$ cat /tmp/test/adapter_config.json
{
// ...
"base_model_name_or_path": "answerdotai/ModernBERT-large",
//...
}
It is storing answerdotai/ModernBERT-large as part of the config, which is clearly incorrect since it should be a custom classifier model. I donât understand whatâs going on.
Thanks for any enlightment.