I am trying to use a non packed dataset with SFTTrainer by setting ‘packing=False’ but I get the error: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (id
in this case) have excessive nesting (inputs type list
where type int
is expected).
My code is:
llm_tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
truncation_side = "right",
padding_side="right",
add_eos_token=True,
add_bos_token=True,
)
llm_tokenizer.pad_token = llm_tokenizer.eos_token
huggingface_dataset_name = "neil-code/dialogsum-test"
train_dataset = load_dataset(huggingface_dataset_name, split='train')
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", device_map="auto")
model.gradient_checkpointing_enable()
model.train()
training_args = TrainingArguments(
evaluation_strategy = "epoch",
save_strategy = "epoch",
save_safetensors = False,
num_train_epochs=2,
save_steps=10,
eval_steps=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
per_device_eval_batch_size=1,
learning_rate = 2.5e-5,
lr_scheduler_type = "linear",
weight_decay=0.1,
# adam_beta1 = 0.9,
# adam_beta2 = 0.98,
optim="adafactor",
bf16=True,
warmup_ratio=0.1,
logging_steps=1,
logging_strategy='steps',
logging_dir='./logs',
load_best_model_at_end=True,
output_dir = "./",
remove_unused_columns=False,
label_names=["labels"],
# run_name='exp4_no_peft',
disable_tqdm=False
)
trainer = SFTTrainer(
model,
packing=False,
args=training_args,
tokenizer=llm_tokenizer,
train_dataset = train_dataset,
dataset_text_field='dialogue',
max_seq_length=2048,
)
trainer.train()
Can someone please help me out? The dataset is a simple text based dataset.