Hi, I’m doing the Translation section of the chapter, this part.
After running this line
trainer.evaluate(max_length=max_length)
I get a warning:
That’s 100 lines that end in a tokenized period (‘.’)
It looks like you forgot to detokenize your test data, which may hurt your score.
If you insist your data is detokenized, or don’t care, you can suppress this message with the force
parameter.
I am doing basically everything like in tutorial, I only changed fp16
to False
, lowered to per_device_eval_batch_size
to 32 (apparently not enough memory on my MacBook). This is before training and I get a bleu score 17 while tutorial has it 39. So I don’t know, I do not see that I may have skipped something and the tutorial code snippet has explicitly tokenized_datasets
as an argument for the Trainer, so I am bit confused. I wasn’t able to figure it out myself.
Earlier while loading the model I got “UserWarning: Recommended: pip install sacremoses.” and I did install it but afterwards I didn’t reload in my Jupyter notebook the loading of the model (the line model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
).
Could this be a culprit? Maybe sacremoses has to be present as early as during loading model stage otherwise some default HF tokenizer is set? But so far everything worked fine, meaning that tokenized examples from the tutorial were the same as the ones generated by my code. Maybe tokenization is basically the same as default fallback, but sacremoses also provides detokenization which default does not?
I will try to restart the environment and rerun the code. But in case it won’t work, maybe you will know the answer and will be able to help
Edit: Ok I rerun everything but looks like eval considerably slower than before and it will be a few good hours before it finishes and I will know the answer. Maybe slowdown comes from the usage of the sacremoses, meaning that earlier it wasn’t really used as I hypothesized
Edit2: I got the same message about detokenization and dots and the same bleu score. I do not have any other ideas.