Perhaps your features (`output` in this case) have excessive nesting (inputs type `list` where type `int` is expected)

I am also getting similar issue here.

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 
'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features
(`output` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
  0% 0/20 [00:05<?, ?it/s]

here are my fine-tuning step details.

Model load using unsloth not Huggingface Transformers directly

from unsloth import FastLanguageModel, is_bfloat16_supported

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B", # or choose "unsloth/Llama-3.2-1B"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Dataset preparation

def prepare_dataset(tokenizer_data: dict) -> dict:
    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

    def formatting_prompts_func(examples):
        return {"text": [alpaca_prompt.format(inst, inp, out) + tokenizer_data['tokenizer'].eos_token
                         for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"])]}

    # Load the dataset
    dataset = load_dataset("yahma/alpaca-cleaned")

    # Apply formatting
    dataset = dataset.map(formatting_prompts_func, batched=True)

    # Split the dataset into train, validation, and test sets
    train_valid_test_split = dataset['train'].train_test_split(test_size=0.1, seed=42)
    train_valid_dataset = train_valid_test_split['train']
    test_dataset = train_valid_test_split['test']

    train_valid_split = train_valid_dataset.train_test_split(test_size=0.1, seed=42)
    train_dataset = train_valid_split['train']
    val_dataset = train_valid_split['test']

    return {
        'train_dataset': train_dataset,
        'val_dataset': val_dataset,
        'test_dataset': test_dataset
    }

Here passing train_dataset and eval_dataset for eval_loss metrics.

    training_args = TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 20,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="mlflow",
        evaluation_strategy=IntervalStrategy.STEPS,
        eval_steps=20,
        save_total_limit=5,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        remove_unused_columns=False
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        max_seq_length=2048,
        dataset_num_proc=2,
        packing=False,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
        args=training_args
    )
1 Like

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length
How about this answer?
Isn’t it helpful for you?

2 Likes

@Alanturner2 Thanks for your quick response. @John6666 thanks for taking a look into my post.

Here I got the issue that I am wrongly processing the data. below is the the fix.

    # Load the dataset
    dataset = load_dataset("yahma/alpaca-cleaned",  split = 'train')

    # Split the dataset into train, validation, and test sets
    train_valid_test_split = dataset.train_test_split(test_size=0.1, seed=42)

But
***running into this loop of issues… here I am doing instruct-tuning

when I am providing the eval_dataset=val_dataset in SFTTrainer

ValueError: No columns in the dataset match the model's forward method signature. The following 
columns have been ignored: [instruction, output, text, input]. Please check the dataset and model. 
You may need to set `remove_unused_columns=False` in `TrainingArguments`.

Based on error message I tried to provide remove_unused_columns=True in training_args and also This required me to enable packing=True but this raising other new issue

ValueError: You should supply an encoding or a list of encodings to this method that includes 
input_ids, but you provided ['output', 'input', 'instruction', 'text']

So finally I stuck at this above error after enabling remove_unused_columns=True and packing=True while passing the eval_dataset=val_dataset in SFTTrainer for valuable Traditional metrics like accuracy, precision, recall, and F1-scores.

Help me with this? Dataset info: yahma/alpaca-cleaned ¡ Datasets at Hugging Face

1 Like

I think there is also a way to implement and specify the DataCollator that is suitable for each data set, but in some simple cases, it seems that you can deal with it by renaming the column names of the dataset.

1 Like

This time, this case may be closer. There is a possibility that you have forgotten that you need to tokenize the data before passing it on.

1 Like

Yea. you’re right once I will give a try I will update this thread Thanks.

1 Like

Follow up question for example ? Here I am using SFTTrainer

I have dataset like this after formatting it

train_dataset, val_dataset, test_dataset

(Dataset({
     features: ['output', 'input', 'instruction', 'text'],
     num_rows: 41925
 }),
 Dataset({
     features: ['output', 'input', 'instruction', 'text'],
     num_rows: 4659
 }),
 Dataset({
     features: ['output', 'input', 'instruction', 'text'],
     num_rows: 5176
 }))

Before using this dataset in this SFTTrainer should I need to drop the other columns 'output', 'input', 'instruction'

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        # ignored_columns=ignored_columns,
        max_seq_length=2048,
        dataset_num_proc=2,
        packing=True,
        # callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
        args=training_args
    )

because I am running into following error when I am doing this.

result = trainer.evaluate(dataset_test_final)

result = trainer.evaluate(test_dataset)

Also as per Hugging-face Docs they mentioned we no need to explicitly encode the columns and the SFTTrainer will handle it. Please here thanks

ValueError: You should supply an encoding or a list of encodings to this method that includes
input_ids, but you provided ['output', 'input', 'instruction', 'text']
1 Like

Could it be that remove_unused_columns=False is specified?
If something is not working properly, I think it is safer to define and specify DataCollator yourself. It takes a bit of effort, but…

I hope its all good from my end the prob is with SFTTrainer? In above given examples they all are using Trainer not SFTTrainer Here is my code

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from transformers import TrainingArguments, pipeline as hf_pipeline, EarlyStoppingCallback, IntervalStrategy

training_args = TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 5,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
        # evaluation_strategy=IntervalStrategy.EPOCH,
        # save_strategy=IntervalStrategy.EPOCH,        # Save checkpoint at the end of each epoch
        # eval_steps=20,
        # save_steps=20,
        # save_total_limit=5,
        # load_best_model_at_end=True,
        # metric_for_best_model="eval_loss",
        # greater_is_better=False,
        remove_unused_columns=False
    )

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    dataset_text_field="text",
    # ignored_columns=ignored_columns,
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    # callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
    args=training_args
)

# unsloth_train fixes gradient_accumulation_steps
from unsloth import unsloth_train
# trainer_stats = trainer.train() << Buggy gradient accumulation
trainer_stats = unsloth_train(trainer)

when I do unsloth_train(trainer) why is this only showing Training loss not other metrics?
Output:

Step	Training Loss
1	1.345000
2	1.478800
3	1.385700
4	1.392100
5	1.334100

Following error when I am using

# result = trainer.evaluate(dataset_test_final)
result = trainer.evaluate(test_dataset)

Is this an Issue with SFTTrainer?

also looks like this same issue open and closed several-times without resolving.

1 Like

This smells like an unresolved issue…
OK, let’s work around it. If we process the dataset in advance, SFTTrainer won’t complain.

process the dataset in advance you mean explicitly tokenize and drop unused columns [‘output’, ‘input’, ‘instruction’, ‘text’] ?

from datasets import load_dataset, DatasetDict

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    return {"text": [alpaca_prompt.format(inst, inp, out) + EOS_TOKEN
                      for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"])]}

# Load the dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Apply formatting
dataset = dataset.map(formatting_prompts_func, batched=True, remove_columns=["instruction", "input", "output"])

# Split the dataset into train, validation, and test sets
train_valid_test_split = dataset.train_test_split(test_size=0.1, seed=42)
train_valid_dataset = train_valid_test_split['train']
test_dataset = train_valid_test_split['test']

train_valid_split = train_valid_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_valid_split['train']
val_dataset = train_valid_split['test']


def tokenize_and_align_labels(examples, label_all_tokens=True):
    tokenized_inputs = tokenizer(
        examples["text"],
        truncation=True,
        max_length=tokenizer.model_max_length,
        padding="max_length",
        # return_tensors="pt"
    )
    return tokenized_inputs

tokenized_train_no_text = train_dataset.map(tokenize_and_align_labels, batched=False, remove_columns=["text"])
tokenized_val_no_text = val_dataset.map(tokenize_and_align_labels, batched=False, remove_columns=["text"])
tokenized_test_no_text = test_dataset.map(tokenize_and_align_labels, batched=False, remove_columns=["text"])

Now it has only input_ids and attention columns after tokenized? is this right?

1 Like

Maybe yes. Actually, it would probably be cleaner to use DataCollator, but if we suspect a bug in the library, it’s better to do it this way.

drop unused columns [‘output’, ‘input’, ‘instruction’, ‘text’]

just this.

okay let me do it plain and then if required I use dataCollector.

1 Like

:grimacing:
I just took only 10 text. I tried taking 1 text data as well.

when I do result = trainer.evaluate(tokenized_test_no_text)

below error

OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 58.12 MiB is free. Process 2353 has 14.68 GiB memory in use. Of the allocated memory 14.50 GiB is allocated by PyTorch, and 30.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.5 documentation)

1 Like

Perhaps it’s because it’s SFTTrainer.:sweat_smile: Apparently VRAM consumption is high.

I have solved this problem by
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)"

Yea I am using almost same settings also of course better onceI think I need to reduce the max-length and try. reduced to 100 still getting out of memory error. also I turned of compute metrics fro now.

from unsloth import FastLanguageModel
import torch
max_seq_length = 100 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B", # or choose "unsloth/Llama-3.2-1B"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
1 Like

I tried In multiple env colab, Kaggle, and in lightning as well is there something wrong in the approch?

i think we shouldn’t pass tokenized inputs directly. This giving me outofMemory issue.

here is my SFTTrianer updated code I tried my best to reduce data and everything except providing tokenized inputs. Training also went well but only while running evaluate() func throwing memory error.

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_train_no_text,
    # eval_dataset=tokenized_val_no_text,
    # dataset_text_field="text",
    # ignored_columns=ignored_columns,
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    # callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
    args=training_args
)
1 Like

If you want to pass tokenized data, you probably need to write your own Evaluator and DataCollator. If you just want to pass the data set by removing the unnecessary columns, it might work with the default ones…
By the way, it looks like there is a way to use QLora to reduce VRAM.

How to customize DataCollator. Well, just select the data you want to pass.

Finally able to test compute_metrics=compute_metrics not working with the SFTTrianer.

Default metrics that I can see using evaluate() function is

final code is here we no need to pass the tokenizer dataset SFTTrainer can handle it only thing is make sure drop the unused columns [“instruction”, “input”, “output”] and ‘Text’ field is needed.

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    dataset_text_field="text",
    # ignored_columns=ignored_columns,
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    # callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
    args=training_args,
    # compute_metrics=compute_metrics
)

if using regular Trainer we need to pass the tokenized input and also if I pass tokenized running outofmemeory issue. May need to test further with other finetunning methods like Qlora as well as using datacollector if outofmemory issue resolves.

1 Like