Error using SFTTrainer: Make sure that your dataset has enough samples to at least yield one packed sequence

Hi, I create a small dataset to mimic the format (oai) recommended by STFTrainer doc page:


from datasets import load_dataset
from trl import SFTTrainer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig
import torch, sys

dataset = load_dataset("celsowm/auryn", split="train", download_mode="force_redownload")

# Specify the model
model_name = 'recogna-nlp/bode-7b-alpaca-pt-br-no-peft'

# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto",
    torch_dtype=torch.float16
)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = TrainingArguments(
    output_dir='outputs/summarize',
    num_train_epochs=3,
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,  # Increase gradient accumulation to compensate for small batch size
    gradient_checkpointing=True,
    bf16=True,
    optim="adamw_bnb_8bit",
    #warmup_steps=500,
    weight_decay=0.01,
    #save_steps=500,  # Save the model every 500 steps
    save_total_limit=3,  # Keep only the last 3 models
)

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    args=training_args,
    train_dataset=dataset,
    packing=True,
    max_seq_length=2048
)

But I got this error:

Make sure that your dataset has enough samples to at least yield one packed sequence

Any hints?

I got the same error.

Did you get this solved? I am also getting the same problem.

Any update on this? 2000 sample is not enough?

autotrain.trainers.common:wrapper:116 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.

I had the same problem. In my case, passing the ‘train’ subset of the dataset to the SFTTrainer’s train_dataset solved the issue.

I had the same problem on my private dataset. In my case, dropping NaN values solved my the issue.

import pandas as pd
from datasets import Dataset

def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": system_message.format(schema=sample["context"])},
      {"role": "user", "content": sample["question"]},
      {"role": "assistant", "content": sample["answer"]}
    ]
  }

train_df = pd.read_csv('private.csv', lineterminator='\n')
train_df.dropna(inplace=True)

dataset_hf = Dataset.from_pandas(train_df)
train_dataset = dataset_hf.map(create_conversation, batched=False)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    peft_config=peft_config,
    args=training_args,
    train_dataset=train_dataset,
    packing=True,
    max_seq_length=2048
)