Perhaps your features (`output` in this case) have excessive nesting (inputs type `list` where type `int` is expected)

KushwanthK · January 13, 2025, 8:41pm

I am also getting similar issue here.

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 
'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features
(`output` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
  0% 0/20 [00:05<?, ?it/s]

here are my fine-tuning step details.

Model load using unsloth not Huggingface Transformers directly

from unsloth import FastLanguageModel, is_bfloat16_supported

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B", # or choose "unsloth/Llama-3.2-1B"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Dataset preparation

def prepare_dataset(tokenizer_data: dict) -> dict:
    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

    def formatting_prompts_func(examples):
        return {"text": [alpaca_prompt.format(inst, inp, out) + tokenizer_data['tokenizer'].eos_token
                         for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"])]}

    # Load the dataset
    dataset = load_dataset("yahma/alpaca-cleaned")

    # Apply formatting
    dataset = dataset.map(formatting_prompts_func, batched=True)

    # Split the dataset into train, validation, and test sets
    train_valid_test_split = dataset['train'].train_test_split(test_size=0.1, seed=42)
    train_valid_dataset = train_valid_test_split['train']
    test_dataset = train_valid_test_split['test']

    train_valid_split = train_valid_dataset.train_test_split(test_size=0.1, seed=42)
    train_dataset = train_valid_split['train']
    val_dataset = train_valid_split['test']

    return {
        'train_dataset': train_dataset,
        'val_dataset': val_dataset,
        'test_dataset': test_dataset
    }

Here passing train_dataset and eval_dataset for eval_loss metrics.

    training_args = TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 20,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="mlflow",
        evaluation_strategy=IntervalStrategy.STEPS,
        eval_steps=20,
        save_total_limit=5,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        remove_unused_columns=False
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        max_seq_length=2048,
        dataset_num_proc=2,
        packing=False,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
        args=training_args
    )

Alanturner2 · January 13, 2025, 11:50pm

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length
How about this answer?
Isn’t it helpful for you?

KushwanthK · January 14, 2025, 4:53am

@Alanturner2 Thanks for your quick response. @John6666 thanks for taking a look into my post.

Here I got the issue that I am wrongly processing the data. below is the the fix.

    # Load the dataset
    dataset = load_dataset("yahma/alpaca-cleaned",  split = 'train')

    # Split the dataset into train, validation, and test sets
    train_valid_test_split = dataset.train_test_split(test_size=0.1, seed=42)

But
***running into this loop of issues… here I am doing instruct-tuning

when I am providing the eval_dataset=val_dataset in SFTTrainer

ValueError: No columns in the dataset match the model's forward method signature. The following 
columns have been ignored: [instruction, output, text, input]. Please check the dataset and model. 
You may need to set `remove_unused_columns=False` in `TrainingArguments`.

Based on error message I tried to provide remove_unused_columns=True in training_args and also This required me to enable packing=True but this raising other new issue

ValueError: You should supply an encoding or a list of encodings to this method that includes 
input_ids, but you provided ['output', 'input', 'instruction', 'text']

So finally I stuck at this above error after enabling remove_unused_columns=True and packing=True while passing the eval_dataset=val_dataset in SFTTrainer for valuable Traditional metrics like accuracy, precision, recall, and F1-scores.

Help me with this? Dataset info: yahma/alpaca-cleaned · Datasets at Hugging Face

John6666 · January 14, 2025, 5:00am

I think there is also a way to implement and specify the DataCollator that is suitable for each data set, but in some simple cases, it seems that you can deal with it by renaming the column names of the dataset.

John6666 · January 14, 2025, 5:01am

This time, this case may be closer. There is a possibility that you have forgotten that you need to tokenize the data before passing it on.

KushwanthK · January 14, 2025, 5:15am

Yea. you’re right once I will give a try I will update this thread Thanks.

Topic		Replies	Views
Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' 🤗Transformers	2	154	July 24, 2024
ValueError: Unable to create tensor, you should probably activate truncation... but only for training on multiple GPUs or with multi-batch 🤗Transformers	3	257	November 8, 2024
Simple use of Transformers breaks Beginners	1	1280	June 2, 2023
Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length. (Paligemma) 🤗Transformers	2	979	July 3, 2024
It asks to add padding or truncation but I have already done it Beginners	1	767	October 6, 2023

Perhaps your features (`output` in this case) have excessive nesting (inputs type `list` where type `int` is expected)

Related topics