Batch tensor creation error when finetuning gpt2

Python: 3.7.6
Transformers: 4.17.0
Datasets: 2.0.0
Tokenizers: 0.11.6
Pytorch: 1.7.0
OS: Pop!_OS 21.10

I have the following code for finetuning gpt2:

import pandas as pd
import datasets
from transformers import GPT2Tokenizer, DataCollatorForLanguageModeling, GPT2LMHeadModel, TrainingArguments, Trainer
import numpy as np


ppl_metric = datasets.load_metric('perplexity')
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return ppl_metric.compute(predictions=predictions, references=labels)


sample_set = pd.read_csv('./data.csv', encoding='ISO-8859-1')
sample_ds = datasets.Dataset.from_pandas(sample_set['cleaned_spacy_stopped'].to_frame())
sample_ds = sample_ds.train_test_split(test_size=0.1)

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_data(examples):
    return tokenizer([" ".join(x) for x in examples['cleaned_spacy_stopped']], padding=True)

tokenized_ds = sample_ds.map(tokenize_data,
                            print_str='sample_ds.map(tokenize_data)',
                            batched=True,
                            num_proc=4,
                            remove_columns=sample_ds['train'].column_names)


block_size = 256
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_ds.map(group_texts,
                          print_str='tokenized_ds.map(group_texts)',
                          batched=True,
                          num_proc=4)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


model = GPT2LMHeadModel.from_pretrained('gpt2')
training_args = TrainingArguments(
    output_dir='./models',
    evaluation_strategy='epoch',
    report_to='wandb'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['test'],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

trainer.train()

and I get the following error:

/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 384
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 144
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
2022-03-24 17:42:58.679514: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2022-03-24 17:42:58.679545: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
wandb: Tracking run with wandb version 0.12.11
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  3%|████▎                                                                                                                      | 5/144 [00:54<26:12, 11.31s/it]Traceback (most recent call last):
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 708, in convert_to_tensors
    tensor = as_tensor(value)
ValueError: expected sequence of length 256 at dim 1 (got 65)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_finetuning.py", line 98, in <module>
    mt.time_func(trainer.train, print_str='train.train()')
  File "/home/aclifton/gpt2_dm/method_timer.py", line 9, in wrapper_timer
    value, str_to_print = func(*args, **kwargs)
  File "/home/aclifton/gpt2_dm/method_timer.py", line 26, in time_func
    output = f(*args, **kwargs)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1374, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 41, in __call__
    return self.torch_call(features)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 729, in torch_call
    batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2862, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 213, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 725, in convert_to_tensors
    "Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

wandb: Waiting for W&B process to finish... (failed 1).
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/gpt2_dm/wandb/offline-run-20220324_174257-k8vydnze
wandb: Find logs at: ./wandb/offline-run-20220324_174257-k8vydnze/logs

I tried adding truncation=True and got the same thing. I was also originally following the documentation here for dynamic padding using the DataCollatorForLanguageModeling and get the same error.

Any thoughts about what I might be doing wrong? Thanks in advance! I’d be interested in using dynamic padding if possible.

1 Like

Any thoughts?

I’m experiencing the same thing
Please, let me know if you’ve solved it