Batch tensor creation error when finetuning gpt2

Python: 3.7.6
Transformers: 4.17.0
Datasets: 2.0.0
Tokenizers: 0.11.6
Pytorch: 1.7.0
OS: Pop!_OS 21.10

I have the following code for finetuning gpt2:

import pandas as pd
import datasets
from transformers import GPT2Tokenizer, DataCollatorForLanguageModeling, GPT2LMHeadModel, TrainingArguments, Trainer
import numpy as np

ppl_metric = datasets.load_metric('perplexity')
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return ppl_metric.compute(predictions=predictions, references=labels)

sample_set = pd.read_csv('./data.csv', encoding='ISO-8859-1')
sample_ds = datasets.Dataset.from_pandas(sample_set['cleaned_spacy_stopped'].to_frame())
sample_ds = sample_ds.train_test_split(test_size=0.1)

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_data(examples):
    return tokenizer([" ".join(x) for x in examples['cleaned_spacy_stopped']], padding=True)

tokenized_ds =,

block_size = 256
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset =,

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

model = GPT2LMHeadModel.from_pretrained('gpt2')
training_args = TrainingArguments(

trainer = Trainer(


and I get the following error:

/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/ FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
***** Running training *****
  Num examples = 384
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 144
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
2022-03-24 17:42:58.679514: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-03-24 17:42:58.679545: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
wandb: Tracking run with wandb version 0.12.11
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  3%|████▎                                                                                                                      | 5/144 [00:54<26:12, 11.31s/it]Traceback (most recent call last):
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/", line 708, in convert_to_tensors
    tensor = as_tensor(value)
ValueError: expected sequence of length 256 at dim 1 (got 65)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 98, in <module>
    mt.time_func(trainer.train, print_str='train.train()')
  File "/home/aclifton/gpt2_dm/", line 9, in wrapper_timer
    value, str_to_print = func(*args, **kwargs)
  File "/home/aclifton/gpt2_dm/", line 26, in time_func
    output = f(*args, **kwargs)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/", line 1374, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/", line 435, in __next__
    data = self._next_data()
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/", line 47, in fetch
    return self.collate_fn(data)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/", line 41, in __call__
    return self.torch_call(features)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/", line 729, in torch_call
    batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/", line 2862, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/", line 213, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/", line 725, in convert_to_tensors
    "Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/gpt2_dm/wandb/offline-run-20220324_174257-k8vydnze
wandb: Find logs at: ./wandb/offline-run-20220324_174257-k8vydnze/logs

I tried adding truncation=True and got the same thing. I was also originally following the documentation here for dynamic padding using the DataCollatorForLanguageModeling and get the same error.

Any thoughts about what I might be doing wrong? Thanks in advance! I’d be interested in using dynamic padding if possible.

Any thoughts?