PEFT with SFTTrainer unexpected 'resume_from_checkpoint'

transfomers version is latest, 4.57.1

Getting an error when trying to resume my last fine tuning:

436 trainer = SFTTrainer(
    437     model=model,
    438     peft_config=lora_config,
    439     processing_class=tokenizer,
    440     resume_from_checkpoint=True,
    441     args=training_args,
    442     train_dataset=tokenized_dataset["train"],
    443     eval_dataset=tokenized_eval_dataset["evaluation"],
    444     compute_metrics=compute_metrics,
    445     formatting_func=formatting_func, 
    446     callbacks=[TensorBoardCallback(log_dir)]
    447 )
    449 trainer.train()

TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'resume_from_checkpoint'

This is confusing and directly contradictory to the documentation, can anybody provide a suggestion to proceed?

1 Like

The specifications for library function arguments change frequently, so related errors occasionally occur…


You passed resume_from_checkpoint to the constructor. Pass it to .train(). The SFTTrainer __init__ does not take that argument. Transformers implements resume on Trainer.train(resume_from_checkpoint=...), and TRL’s SFTTrainer follows that pattern. (Hugging Face)


What to change

# ✅ correct: put resume in .train(), not in SFTTrainer(...)

# refs:
# - TRL SFTTrainer docs: https://huggingface.co/docs/trl/en/sft_trainer
# - Transformers Trainer .train(): https://huggingface.co/docs/transformers/en/trainer

trainer = SFTTrainer(
    model=model,
    peft_config=lora_config,
    processing_class=tokenizer,   # TRL uses 'processing_class' (formerly 'tokenizer')
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_eval_dataset["evaluation"],
    compute_metrics=compute_metrics,
    formatting_func=formatting_func,
    callbacks=[TensorBoardCallback(log_dir)],
)

# latest checkpoint in training_args.output_dir
trainer.train(resume_from_checkpoint=True)

# or a specific checkpoint path
# trainer.train(resume_from_checkpoint="path/to/output_dir/checkpoint-12345")

Why this fixes it

  • In recent TRL, SFTTrainer.__init__ never exposed resume_from_checkpoint. Putting it there raises “unexpected keyword argument”. Put it on train(). (Hugging Face)
  • In Transformers 4.57.x, resume is a parameter to Trainer.train. The docs explicitly show trainer.train(resume_from_checkpoint=True | "path"). (Hugging Face)

Background you need

How resume works
Transformers checkpoints live under output_dir/checkpoint-<global_step>. A valid checkpoint holds weights plus training state so the optimizer, LR scheduler, RNG, and global step continue correctly. You resume by calling trainer.train(resume_from_checkpoint=...). (Hugging Face)

SFTTrainer is a thin wrapper
TRL trainers delegate training to the HF Trainer. That is why the resume switch belongs to .train(). (GitHub)

Parameter rename in TRL
If you recently upgraded TRL, pass the tokenizer via processing_class=... instead of tokenizer=.... Old name now errors in newer TRL. (Hugging Face)


Quick, robust pattern that avoids edge cases

# refs:
# - get_last_checkpoint helper: https://raw.githubusercontent.com/huggingface/transformers/v4.43.2/examples/pytorch/translation/run_translation.py
# - Trainer docs on checkpoints: https://huggingface.co/docs/transformers/en/trainer

from transformers.trainer_utils import get_last_checkpoint
from trl import SFTTrainer, SFTConfig

sft_args = SFTConfig(
    output_dir="out",
    save_strategy="steps",
    save_steps=500,
    logging_steps=50,
    eval_strategy="steps",   # or "epoch"
)

trainer = SFTTrainer(
    model=model,
    args=sft_args,
    processing_class=tokenizer,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_eval_dataset["evaluation"],
)

last = get_last_checkpoint(sft_args.output_dir)
trainer.train(resume_from_checkpoint=last or False)  # True also works; passing the path is explicit
  • get_last_checkpoint(...) detects the newest checkpoint-* folder and avoids path typos. This pattern mirrors HF example scripts. (GitHub)
  • Ensure you actually save checkpoints during the first run (save_strategy, save_steps or save_total_limit). Otherwise there is nothing to resume. (Hugging Face)

PEFT/LoRA specifics you might hit

  • Adapter-only vs full trainer checkpoint
    If you only saved adapter_model.bin via PEFT and do not have optimizer.pt, scheduler.pt, trainer_state.json, then you can reload weights but you cannot restore optimizer/scheduler step state. You can still continue training, but LR schedules and step counters restart. A full resume needs the Trainer checkpoint directory contents. (GitHub)

  • What should be inside a full checkpoint
    Typical files include pytorch_model.bin or adapter weights, optimizer.pt, scheduler.pt, trainer_state.json, training_args.bin, and RNG states. If those are missing, resuming stateful training will be partial. (GitHub)

  • Mixing resume with changed hyperparameters
    When you resume, training arguments saved in the checkpoint can override your new TrainingArguments. Community reports note this limitation; change as little as possible when resuming. (Reddit)


Minimal checklist to make resume reliable

  1. Save checkpoints: set output_dir, save_strategy="steps"| "epoch", save_steps if needed. (Hugging Face)
  2. Call trainer.train(resume_from_checkpoint=True) or pass the explicit checkpoint-#### path. (Hugging Face)
  3. Keep model topology and world-size the same across resume. Mismatches can break loading or yield odd loss spikes. See related GH issues about resume quirks. (GitHub)
  4. For TRL versions that renamed arguments, use processing_class=tokenizer. (Hugging Face)

Why your error text looked contradictory

  • Third-party blog snippets sometimes pass resume to the constructor for other trainers. SFTTrainer does not accept it. The official docs put resume_from_checkpoint on .train(). (Hugging Face)

Common pitfalls and quick fixes

  • “Unexpected keyword argument ‘tokenizer’” → replace with processing_class=.... (Hugging Face)
  • Resuming but loss jumps → verify you resumed from a full trainer checkpoint and did not alter batch size or schedule mid-run. There are open reports of resumed jobs showing transient loss jumps. (GitHub)
  • No checkpoints found → confirm out/checkpoint-* exists or use get_last_checkpoint(...). (GitHub)

Short, runnable template

# docs:
# - SFTTrainer: https://huggingface.co/docs/trl/en/sft_trainer
# - Trainer.resume: https://huggingface.co/docs/transformers/en/trainer

from transformers.trainer_utils import get_last_checkpoint
from trl import SFTTrainer, SFTConfig

cfg = SFTConfig(
    output_dir="out",
    save_strategy="steps",
    save_steps=250,
    eval_strategy="no",
)

trainer = SFTTrainer(
    model=model,
    args=cfg,
    processing_class=tokenizer,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_eval_dataset["evaluation"],
)

ckpt = get_last_checkpoint(cfg.output_dir)
trainer.train(resume_from_checkpoint=ckpt or False)

Curated references and why they help

Official docs

  • TRL SFTTrainer API. Shows constructor signature and examples. Use to confirm accepted kwargs. (Hugging Face)
  • Transformers Trainer docs. Shows trainer.train(resume_from_checkpoint=...), and explains checkpoint behavior. Use to verify resume semantics. (Hugging Face)

GitHub issues and examples

  • TRL repo overview. Confirms trainers are wrappers over HF Trainer, which explains where resume is handled. (GitHub)
  • Transformers example script using get_last_checkpoint(...). Copy the pattern for robust resume detection. (GitHub)
  • PEFT issue showing full checkpoint contents. Use to check if you saved optimizer/scheduler state. (GitHub)

Community Q&A for context

  • StackOverflow threads showing the exact “unexpected keyword argument on constructor” pattern and the correct fix of using .train(...). Helpful when scanning for similar symptoms. (Stack Overflow)

I’m facepalming rn, thank you

1 Like