How to figure out corresponding arguments in PeftModel?

I am trying to fine tune the llama-2 model from :hugs: using Peft

model_id = "meta-llama/Llama-2-7b-chat-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=API)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=3, quantization_config=bnb_config, device_map="auto", token=API)

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=2,
    lora_alpha=2,
    target_modules=[
    "q_proj",
    "up_proj",
    "o_proj",
    "k_proj",
    "down_proj",
    "gate_proj",
    "v_proj"],
    lora_dropout=0.15,
    bias="none"
)

model = get_peft_model(model, config)
data = load_dataset("FinanceInc/auditor_sentiment")

tokenizer.pad_token = tokenizer.eos_token
data = data.map(lambda samples:tokenizer(samples["sentence"], return_tensors='pt', padding=True), batched=True)

data = data.rename_column('label', 'labels')
data

The data:

DatasetDict({
train: Dataset({
features: [‘sentence’, ‘labels’, ‘input_ids’, ‘attention_mask’],
num_rows: 3877
})
test: Dataset({
features: [‘sentence’, ‘labels’, ‘input_ids’, ‘attention_mask’],
num_rows: 969
})
})

Now, the model has arguments ‘input_ids’ and ‘attention_mask’ as the code below generates a completely valid output:

input_ids = torch.tensor(data['train'][0]['input_ids'])
input_ids = torch.unsqueeze(input_ids, 0)
attention_mask = torch.tensor(data['train'][0]['attention_mask'])
attention_mask = torch.unsqueeze(attention_mask, 0)

output = model(input_ids=input_ids, attention_mask=attention_mask)
output

Output

SequenceClassifierOutputWithPast(loss={‘logits’: tensor([[-1.7373, 0.5537, 0.7510]], grad_fn=)}, logits=tensor([[-1.7373, 0.5537, 0.7510]], grad_fn=), past_key_values=None, hidden_states=None, attentions=None)

So I try and use the trainer class

trainer = Trainer(
    model=model,
    train_dataset=data['train'],
    eval_dataset=data['test'],
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

But I keep getting errors:

Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The model is quantized. To train this model you need to add additional modules inside the model such as adapters using `peft` library and freeze the model weights. Please check the examples in https://github.com/huggingface/peft for more details.
max_steps is given, it will override any value given in num_train_epochs
The following columns in the training set don't have a corresponding argument in `PeftModel.forward` and have been ignored: sentence, labels, attention_mask, input_ids. If sentence, labels, attention_mask, input_ids are not expected by `PeftModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 0
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 4
  Total optimization steps = 10
  Number of trainable parameters = 5,009,408
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-59-d57a6efd24d9> in <cell line: 19>()
     17 )
     18 model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
---> 19 trainer.train()

11 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1537                 hf_hub_utils.enable_progress_bars()
   1538         else:
-> 1539             return inner_training_loop(
   1540                 args=args,
   1541                 resume_from_checkpoint=resume_from_checkpoint,

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1797 
   1798             step = -1
-> 1799             for step, inputs in enumerate(epoch_iterator):
   1800                 total_batched_samples += 1
   1801                 if rng_to_sync:

/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py in __iter__(self)
    382         # We iterate one batch ahead to check when we are at the end
    383         try:
--> 384             current_batch = next(dataloader_iter)
    385         except StopIteration:
    386             yield

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    631                 # TODO(https://github.com/pytorch/pytorch/issues/76750)
    632                 self._reset()  # type: ignore[call-arg]
--> 633             data = self._next_data()
    634             self._num_yielded += 1
    635             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    675     def _next_data(self):
    676         index = self._next_index()  # may raise StopIteration
--> 677         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    678         if self._pin_memory:
    679             data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     47         if self.auto_collation:
     48             if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
---> 49                 data = self.dataset.__getitems__(possibly_batched_index)
     50             else:
     51                 data = [self.dataset[idx] for idx in possibly_batched_index]

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in __getitems__(self, keys)
   2805     def __getitems__(self, keys: List) -> List:
   2806         """Can be used to get a batch using a list of integers indices."""
-> 2807         batch = self.__getitem__(keys)
   2808         n_examples = len(batch[next(iter(batch))])
   2809         return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in __getitem__(self, key)
   2801     def __getitem__(self, key):  # noqa: F811
   2802         """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2803         return self._getitem(key)
   2804 
   2805     def __getitems__(self, keys: List) -> List:

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in _getitem(self, key, **kwargs)
   2785         format_kwargs = format_kwargs if format_kwargs is not None else {}
   2786         formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
-> 2787         pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
   2788         formatted_output = format_table(
   2789             pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in query_table(table, key, indices)
    581     else:
    582         size = indices.num_rows if indices is not None else table.num_rows
--> 583         _check_valid_index_key(key, size)
    584     # Query the main table
    585     if indices is None:

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
    534     elif isinstance(key, Iterable):
    535         if len(key) > 0:
--> 536             _check_valid_index_key(int(max(key)), size=size)
    537             _check_valid_index_key(int(min(key)), size=size)
    538     else:

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
    524     if isinstance(key, int):
    525         if (key < 0 and key + size < 0) or (key >= size):
--> 526             raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
    527         return
    528     elif isinstance(key, slice):

IndexError: Invalid key: 2012 is out of bounds for size 0

How do I fix this? How do I find the corresponding arguments in the model’s defined forward method?

It looks like there’s an issue when creating the DataLoaders based on your dataset.

Any reason you’re using DataCollatorWithPadding, cause the example script of causal language modeling using the default data collator.

No specific reason for using DataCollatorWithPadding. I get the same error with using default_data_collator.

Can you verify that batches can be created properly based on your datasets?

Like so (assuming your dataset does not yet contain PyTorch tensors):

from torch.utils.data import DataLoader

train_dataset = data['train'].set_format("torch")

train_dataloader = DataLoader(train_dataset, batch_size=2)

for batch in train_dataloader:
   print(batch)

Using set_format('torch') returns none. This works however, the batches are created properly.

from torch.utils.data import DataLoader

train_dataset = data['train']

train_dataloader = DataLoader(train_dataset, batch_size=2)

for batch in train_dataloader:
   print(batch)
   break

Can’t get the trainerAPI to work however

Does it also work when specifying the collate function?

from transformers import DataCollatorWithPadding

train_dataloader = DataLoader(train_dataset, batch_size=2, collate_fn=DataCollatorWithPadding(tokenizer=tokenizer))

Seems to work fine with that as well.

from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=2, collate_fn=DataCollatorWithPadding(tokenizer=tokenizer))
for batch in train_dataloader:
   print(batch)
   break

Output:

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'labels': tensor([2, 2]), 'input_ids': tensor([[    1, 10790,   423,   525, 29879, 13598, 21665, 12500,   287,   304,
           382,  4574, 29871, 29946, 29955,  7284,   515,   382,  4574, 29871,
         29953, 29889, 29953,  7284,   869,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2],
        [    1,   450, 17327,   471,  8794,   411, 21184, 27342, 15202, 18020,
         19806,  1919,   278, 10261, 29899,  6707, 11684,  8819,   653,   310,
         21184, 27342,   438, 17472,  1919,   263, 21189,   728,   970,  5001,
           607,  2693, 29879,  1919, 12012,  1973,   322,  2791,  1691, 23904,
         11415,  9316,   322,   652, 21780,  1243,  6757,   869,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

The error message and stack trace you provided indicate multiple issues and warnings in the script. Let’s break them down for clarity:

  1. Safetensors Warning
    The warning about safetensors suggests that although safetensors is installed, it is not being used (--save_safetensors=False). Safetensors is recommended for security and performance reasons. To resolve this, consider setting save_safetensors to True if applicable to your model.

  2. Device Setup Confirmation
    PyTorch: setting up devices is a standard informational message indicating that PyTorch is configuring its computation devices (like GPUs).

  3. Future Change in Training Arguments
    The script warns about a future change in the default value of the --report_to argument. You may need to explicitly set report_to='all' in future versions.

  4. Model Quantization and PEFT Library
    The message suggests that the model is quantized and recommends using the peft library for adding additional modules (like adapters) and freezing model weights. Check the PEFT examples for guidance.

  5. Training Parameters
    max_steps is given, it will override any value given in num_train_epochs. This is informational, indicating that the max_steps parameter takes precedence over num_train_epochs.

  6. Ignored Columns in Training Set
    The script is ignoring some columns (sentence, labels, attention_mask, input_ids) as they don’t correspond to any argument in PeftModel.forward. Ensure that your dataset only includes columns expected by the model’s forward method.

7 Training Execution Summary
This part summarizes the training configuration, like the number of examples, epochs, batch size, etc. Notably, it says Num examples = 0, which indicates your dataset is empty or not loaded correctly.

  1. IndexError
    The critical issue causing the script to fail is an IndexError. This is likely due to the dataset being empty (Num examples = 0). The script tries to access an index in a dataset that doesn’t exist.

To resolve these issues

Dataset Loading Ensure your dataset is correctly loaded and not empty. The Num examples should be greater than 0.
Column Mismatch Align the columns in your dataset with the expected input of PeftModel.forward.
Safetensors and Quantization If applicable, enable safetensors and follow the PEFT library guidelines for dealing with quantized models.
Training Argument Prepare for the upcoming change in report_to by setting it explicitly if necessary.

After addressing these issues, rerun the script and check if the errors are resolved. If the IndexError persists, verify the dataset’s integrity and loading process.