Hello, I’m using the gpt2-medium pretrained model to do some fine-tuning on additional text, however I’m seeing the following output for trainer.train()
ValueError: Predictions and/or references don't match the expected format.
Expected format: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)},
Input predictions: [[ 262 2050 11 ... 341 286 13105]
[ 12 20179 357 ... 30523 4890 18349]
[ 286 2098 13 ... 12 1983 64]
...
[ 290 262 402 ... 290 7564 4890]
[ 290 407 287 ... 464 9161 262]
[ 2785 286 262 ... 464 464 464]],
Input references: [[ 818 1944 3645 ... 16761 2650 290]
[ 1029 21546 6376 ... 287 30523 4890]
[18349 547 3688 ... 49 12 1983]
...
[ 1080 2884 262 ... 1080 284 1907]
[31155 475 635 ... 220 2514 13446]
[ 262 1988 286 ... 50256 50256 50256]]
here’s how I tokenized my dataset. I added something for unknown tokens as well to ensure all tokens are integers, though not sure if that’s messing this up.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium", force_download=True, resume_download=False)
# handle unknown tokens
special_tokens_dict = {"unk_token":"<UNK>"}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
additionally I chunked the data to max_seq_length of 50, and defined pad token and padding size:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
print(tokenizer.pad_token, tokenizer.padding_side)
when I use the model, to ensure it is taking the changes to the tokenizer, I am running:
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model.resize_token_embeddings(len(tokenizer))
assert tokenizer.unk_token == "<UNK>"
If anyone can point me to the correct direction I’d highly appreciate! New to huggingface, thanks so much for any help.