Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True'

I am trying to use a non packed dataset with SFTTrainer by setting ā€˜packing=Falseā€™ but I get the error: Unable to create tensor, you should probably activate truncation and/or padding with ā€˜padding=Trueā€™ ā€˜truncation=Trueā€™ to have batched tensors with the same length. Perhaps your features (id in this case) have excessive nesting (inputs type list where type int is expected).

My code is:


llm_tokenizer = AutoTokenizer.from_pretrained(
        "meta-llama/Llama-2-7b-chat-hf",
        truncation_side = "right",
        padding_side="right",
        add_eos_token=True,
        add_bos_token=True,
)

llm_tokenizer.pad_token = llm_tokenizer.eos_token

huggingface_dataset_name = "neil-code/dialogsum-test"    
train_dataset = load_dataset(huggingface_dataset_name, split='train')

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", device_map="auto")

model.gradient_checkpointing_enable()  

model.train()

training_args = TrainingArguments(
        evaluation_strategy = "epoch",
        save_strategy = "epoch",
        save_safetensors = False,
        num_train_epochs=2,
        save_steps=10,
        eval_steps=1,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        per_device_eval_batch_size=1,
        learning_rate = 2.5e-5,
        lr_scheduler_type = "linear",
        weight_decay=0.1, 
        # adam_beta1 = 0.9,
        # adam_beta2 = 0.98,
        optim="adafactor",
        bf16=True,
        warmup_ratio=0.1,
        logging_steps=1,
        logging_strategy='steps',
        logging_dir='./logs',
        load_best_model_at_end=True,
        output_dir = "./",
        remove_unused_columns=False,
        label_names=["labels"],
        # run_name='exp4_no_peft',
        disable_tqdm=False
)

trainer = SFTTrainer(
        model,
        packing=False,
        args=training_args,
        tokenizer=llm_tokenizer,
        train_dataset = train_dataset,
        dataset_text_field='dialogue',
        max_seq_length=2048,
 )
trainer.train()

Can someone please help me out? The dataset is a simple text based dataset.

1 Like

Iā€™m currently experiencing the same issue with Seq2SeqTrainer. From what Iā€™ve read, the issue is not new. At bottom are links to two previous reports.

If I understand correctly, ā€œinputs type list where type int is expectedā€ means that the trainer is assuming a classification task, not a sequence-to-sequence task.

Previous reports of the issue:

Hereā€™s a post that looks relevant. Itā€™s possible that the relevant sections of several ā€œutilsā€ libraries should change:

return torch.tensor(value)

to

return torch.tensor(value, dtype=torch.float16)

or to something similar in the function:

        def as_tensor(value, dtype=None):
            if isinstance(value, list) and isinstance(value[0], np.ndarray):
                return torch.tensor(np.array(value))
            return torch.tensor(value)

In the post, the author edited line 141 of feature_extraction_utils.py. I found identical code at line 724 of tokenization_utils_base.py. Identical code would cause identical errors.

According to the author, this solution worked for him. For me, all I got was another error.