Python: 3.7.6
Transformers: 4.17.0
Datasets: 2.0.0
Tokenizers: 0.11.6
Pytorch: 1.7.0
OS: Pop!_OS 21.10
I have the following code for finetuning gpt2:
import pandas as pd
import datasets
from transformers import GPT2Tokenizer, DataCollatorForLanguageModeling, GPT2LMHeadModel, TrainingArguments, Trainer
import numpy as np
ppl_metric = datasets.load_metric('perplexity')
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return ppl_metric.compute(predictions=predictions, references=labels)
sample_set = pd.read_csv('./data.csv', encoding='ISO-8859-1')
sample_ds = datasets.Dataset.from_pandas(sample_set['cleaned_spacy_stopped'].to_frame())
sample_ds = sample_ds.train_test_split(test_size=0.1)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_data(examples):
return tokenizer([" ".join(x) for x in examples['cleaned_spacy_stopped']], padding=True)
tokenized_ds = sample_ds.map(tokenize_data,
print_str='sample_ds.map(tokenize_data)',
batched=True,
num_proc=4,
remove_columns=sample_ds['train'].column_names)
block_size = 256
def group_texts(examples):
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_dataset = tokenized_ds.map(group_texts,
print_str='tokenized_ds.map(group_texts)',
batched=True,
num_proc=4)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
model = GPT2LMHeadModel.from_pretrained('gpt2')
training_args = TrainingArguments(
output_dir='./models',
evaluation_strategy='epoch',
report_to='wandb'
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_dataset['train'],
eval_dataset=lm_dataset['test'],
compute_metrics=compute_metrics,
data_collator=data_collator
)
trainer.train()
and I get the following error:
/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
***** Running training *****
Num examples = 384
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 144
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
2022-03-24 17:42:58.679514: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2022-03-24 17:42:58.679545: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
wandb: Tracking run with wandb version 0.12.11
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
3%|████▎ | 5/144 [00:54<26:12, 11.31s/it]Traceback (most recent call last):
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 708, in convert_to_tensors
tensor = as_tensor(value)
ValueError: expected sequence of length 256 at dim 1 (got 65)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_finetuning.py", line 98, in <module>
mt.time_func(trainer.train, print_str='train.train()')
File "/home/aclifton/gpt2_dm/method_timer.py", line 9, in wrapper_timer
value, str_to_print = func(*args, **kwargs)
File "/home/aclifton/gpt2_dm/method_timer.py", line 26, in time_func
output = f(*args, **kwargs)
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1374, in train
for step, inputs in enumerate(epoch_iterator):
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 41, in __call__
return self.torch_call(features)
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 729, in torch_call
batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2862, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 213, in __init__
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 725, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
wandb: Waiting for W&B process to finish... (failed 1).
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/gpt2_dm/wandb/offline-run-20220324_174257-k8vydnze
wandb: Find logs at: ./wandb/offline-run-20220324_174257-k8vydnze/logs
I tried adding truncation=True
and got the same thing. I was also originally following the documentation here for dynamic padding using the DataCollatorForLanguageModeling
and get the same error.
Any thoughts about what I might be doing wrong? Thanks in advance! I’d be interested in using dynamic padding if possible.