ValueError: Predictions and/or references don't match the expected format

Hello, I’m using the gpt2-medium pretrained model to do some fine-tuning on additional text, however I’m seeing the following output for trainer.train()

ValueError: Predictions and/or references don't match the expected format.
Expected format: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)},
Input predictions: [[  262  2050    11 ...   341   286 13105]
 [   12 20179   357 ... 30523  4890 18349]
 [  286  2098    13 ...    12  1983    64]
 ...
 [  290   262   402 ...   290  7564  4890]
 [  290   407   287 ...   464  9161   262]
 [ 2785   286   262 ...   464   464   464]],
Input references: [[  818  1944  3645 ... 16761  2650   290]
 [ 1029 21546  6376 ...   287 30523  4890]
 [18349   547  3688 ...    49    12  1983]
 ...
 [ 1080  2884   262 ...  1080   284  1907]
 [31155   475   635 ...   220  2514 13446]
 [  262  1988   286 ... 50256 50256 50256]]

here’s how I tokenized my dataset. I added something for unknown tokens as well to ensure all tokens are integers, though not sure if that’s messing this up.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium", force_download=True, resume_download=False)
# handle unknown tokens
special_tokens_dict = {"unk_token":"<UNK>"}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

additionally I chunked the data to max_seq_length of 50, and defined pad token and padding size:

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
print(tokenizer.pad_token, tokenizer.padding_side)

when I use the model, to ensure it is taking the changes to the tokenizer, I am running:

model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model.resize_token_embeddings(len(tokenizer))
assert tokenizer.unk_token == "<UNK>"

If anyone can point me to the correct direction I’d highly appreciate! New to huggingface, thanks so much for any help.

Update: I think the issue is that the expected format is a list of int [], whereas my input is a list of lists [[]]. I reformatted the input to a list of int [] and it seems to work on a small test dataset

I’m running into same error. Can you post a snippet of your code?

I have the same problem.
Can you share your solution?