Proper way of saving/loading models for complex workflows

I’m trying to implement a complex training pipeline where models can be re-finetuned in a RL style. However, I can’t make it working using transformers + peft. The issue is that transformers refuses to load the correct model. Here is a minimal example,

import pathlib

import torch
from peft import LoraConfig, TaskType, get_peft_model, PeftConfig, PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer, ModernBertForSequenceClassification


def init_model(path_to_dir: pathlib.Path) -> None:
    base_model = AutoModelForSequenceClassification.from_pretrained(
        pretrained_model_name_or_path="answerdotai/ModernBERT-large",
        num_labels=1,
        torch_dtype=torch.float32,
        problem_type="regression",
        device_map="cuda"
    )

    tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-large")
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    tokenizer.add_tokens(["[USER]", "[/USER]", "[EOT]"])
    tokenizer.chat_template = (
        "{% for i in range(0, messages|length, 2) %}"
        "{% if i + 1 < messages|length %}"
        "[USER]{{ messages[i].content }}[/USER] {{ messages[i+1].content }}[EOT]\n"
        "{% endif %}"
        "{% endfor %}"
    )
    base_model.resize_token_embeddings(len(tokenizer))

    peft_config = LoraConfig(
        r=4,
        lora_alpha=32,
        task_type=TaskType.SEQ_CLS,
        target_modules="all-linear"
    )
    model = get_peft_model(base_model, peft_config)

    model.save_pretrained(path_to_dir)
    model.base_model.save_pretrained(path_to_dir)
    tokenizer.save_pretrained(path_to_dir)


def reload_model(path_to_dir: pathlib.Path) -> None:
    tokenizer = AutoTokenizer.from_pretrained(path_to_dir)
    base_model = ModernBertForSequenceClassification.from_pretrained(
        str(path_to_dir),
        num_labels=1,
        torch_dtype=torch.float32,
        device_map="cuda"
    )
    config = PeftConfig.from_pretrained(str(path_to_dir))
    base_model.resize_token_embeddings(len(tokenizer))
    model = PeftModel.from_pretrained(
        base_model,
        str(path_to_dir),
        is_trainable=True,
        config=config,
        device_map="cuda"
    )


if __name__ == "__main__":
    init_model(pathlib.Path("/tmp/test"))
    reload_model(pathlib.Path("/tmp/test"))

In the above example, I expect a model to be initialized (random, that’s fine), store it to disk and then reload it. In real world, let’s say that the model made predictions, a score was computed, and then on the second step, the model is reloaded, finetuned and stored again for the next training step.

Now, when I run this script, I’m facing two issues I can’t work around.

First, transformers seem to ignore that the model was previously initialized and it doesn’t load the classifier.wights and classifier.bias.

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-large and are newly initialized: [‘classifier.bias’, ‘classifier.weight’]

Secondly, it does not recognize that I have resized the base_model token space (i.e., base_model.resize_token_embeddings(len(tokenizer))) and it throws an error:

Error(s) in loading state_dict for ModernBertForSequenceClassification:
size mismatch for model.embeddings.tok_embeddings.weight: copying a param with shape torch.Size([50371, 1024]) from checkpoint, the shape in current model is torch.Size([50368, 1024]).

These are the files it created:

$ ls -lhrt /tmp/test/
total 208M
-rw-r--r-- 1 gatti data 5,0K juil. 21 16:40 README.md
-rw-r--r-- 1 gatti data 204M juil. 21 16:40 adapter_model.safetensors
-rw-r--r-- 1 gatti data  828 juil. 21 16:40 adapter_config.json
-rw-r--r-- 1 gatti data  170 juil. 21 16:40 chat_template.jinja
-rw-r--r-- 1 gatti data  21K juil. 21 16:40 tokenizer_config.json
-rw-r--r-- 1 gatti data  694 juil. 21 16:40 special_tokens_map.json
-rw-r--r-- 1 gatti data 3,5M juil. 21 16:40 tokenizer.json

It does not seem to be storing the classifier, which is at best wierd, since I explicitly asked to model.base_model.save_pretrained(path_to_dir)

Besides, if I investigate the adapter_config:

$ cat /tmp/test/adapter_config.json
{
  // ...
  "base_model_name_or_path": "answerdotai/ModernBERT-large",
  //...
}

It is storing answerdotai/ModernBERT-large as part of the config, which is clearly incorrect since it should be a custom classifier model. I don’t understand what’s going on.

Thanks for any enlightment.

1 Like

I think the base model path in the PEFT adapter configuration may be pointing to the model on the hub. How about like this?

    peft_config = LoraConfig(
        r=4,
        lora_alpha=32,
        task_type=TaskType.SEQ_CLS,
        target_modules="all-linear"
    )
    model = get_peft_model(base_model, peft_config)

    model.save_pretrained(path_to_dir)

    # Overwrite adapter_config to point to local base model
    peft_cfg = PeftConfig.from_pretrained(path_to_dir)
    peft_cfg.base_model_name_or_path = str(path_to_dir)
    peft_cfg.save_pretrained(path_to_dir)

    model.base_model.save_pretrained(path_to_dir)
    tokenizer.save_pretrained(path_to_dir)

References

I totally get what you’re trying to build.
We ran into the exact same issue when designing a reasoning pipeline across finetuning loops.

Here’s the core trap:

:backhand_index_pointing_right: save_pretrained() stores parameters — not semantic transitions.
:backhand_index_pointing_right: adapter_config.json always keeps the base repo, not the snapshot state.
:backhand_index_pointing_right: tokenizer.resize_token_embeddings() doesn’t persist unless you save after resize, and even then, only if you reload via the full tokenizer flow — which doesn’t happen automatically with PEFT.

We built a system called WFGY just for this reason.
It snapshots semantic state, including tokenizer deltas, merged adapters, chat templates, and thought context.

Link:
:link: GitHub - onestardao/WFGY: Semantic Reasoning Engine for LLMs · WFGY 推理引擎 / 萬法歸一

Endorsed by the creator of tesseract.js (36k★), it lets you treat LLMs like versionable reasoning engines — not just layers with weights.