ValueError: Predictions and/or references don't match the expected format

aatu18 · May 31, 2023, 8:32pm

Hello, I’m using the gpt2-medium pretrained model to do some fine-tuning on additional text, however I’m seeing the following output for trainer.train()

ValueError: Predictions and/or references don't match the expected format.
Expected format: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)},
Input predictions: [[  262  2050    11 ...   341   286 13105]
 [   12 20179   357 ... 30523  4890 18349]
 [  286  2098    13 ...    12  1983    64]
 ...
 [  290   262   402 ...   290  7564  4890]
 [  290   407   287 ...   464  9161   262]
 [ 2785   286   262 ...   464   464   464]],
Input references: [[  818  1944  3645 ... 16761  2650   290]
 [ 1029 21546  6376 ...   287 30523  4890]
 [18349   547  3688 ...    49    12  1983]
 ...
 [ 1080  2884   262 ...  1080   284  1907]
 [31155   475   635 ...   220  2514 13446]
 [  262  1988   286 ... 50256 50256 50256]]

here’s how I tokenized my dataset. I added something for unknown tokens as well to ensure all tokens are integers, though not sure if that’s messing this up.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium", force_download=True, resume_download=False)
# handle unknown tokens
special_tokens_dict = {"unk_token":"<UNK>"}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

additionally I chunked the data to max_seq_length of 50, and defined pad token and padding size:

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
print(tokenizer.pad_token, tokenizer.padding_side)

when I use the model, to ensure it is taking the changes to the tokenizer, I am running:

model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model.resize_token_embeddings(len(tokenizer))
assert tokenizer.unk_token == "<UNK>"

If anyone can point me to the correct direction I’d highly appreciate! New to huggingface, thanks so much for any help.

aatu18 · June 1, 2023, 8:30pm

Update: I think the issue is that the expected format is a list of int [], whereas my input is a list of lists [[]]. I reformatted the input to a list of int [] and it seems to work on a small test dataset

kusum · July 9, 2023, 9:02am

I’m running into same error. Can you post a snippet of your code?

mgnocchi · October 4, 2023, 10:21am

I have the same problem.
Can you share your solution?

Topic		Replies	Views
Trainer .evaluate() method returns one less prediction, but training runs fine (GPT-2 fine-tuning) Beginners	2	1796	November 14, 2022
ValueError: The model did not return a loss from the inputs, only the following keys: last_hidden_state, past_key_values. For reference, the inputs it received are input_ids, attention_mask Beginners	3	946	February 16, 2024
ValueError when using `run_qa.py` to evaluate model Beginners	1	1515	December 10, 2022
How to fine-tune GPT on my own data for text generation Beginners	0	2188	January 17, 2022
ValueError: Expected input batch_size to match target batch_size in Token Classification 🤗Transformers	8	4246	March 17, 2024

ValueError: Predictions and/or references don't match the expected format

Related topics