Special Characters TrOCR

Is there a way to ignore the U+FFFD � replacement character when decoding generated ids?

You can include bad_word_ids in the generate method to make sure a certain token is not generated: Utilities for Generation.

Thank you for the reply. Seems it does not work for �

Okay, I think i got the problem.
� is shown when you form a character that does not actually exists.

For example here i got 48820, 48820 (repeated two times instead that once) and that is why
i get the �.
tensor([[ 0, 0, 49173, 48247, 46311, 23133, 1437, 48827, 48820, 48820,
23171, 49188, 27969, 47783, 5782, 48718, 48718, 27969, 27969, 48718,
47780, 46311, 3602, 3602, 27819, 49045, 49045, 49045, 48897, 48897,
9357, 9357, 9470, 9470, 9470, 2]], device=‘cuda:0’)

Is there anything I can do to prevent token repetition? Or there is something I am doing wrong?
:hugs: