Special Characters TrOCR

Wolf390ru2 · May 13, 2023, 8:38am

Is there a way to ignore the U+FFFD � replacement character when decoding generated ids?

nielsr · May 13, 2023, 9:08am

You can include bad_word_ids in the generate method to make sure a certain token is not generated: Utilities for Generation.

Wolf390ru2 · May 13, 2023, 12:40pm

Thank you for the reply. Seems it does not work for �

Wolf390ru2 · May 14, 2023, 5:59am

Okay, I think i got the problem.
� is shown when you form a character that does not actually exists.

For example here i got 48820, 48820 (repeated two times instead that once) and that is why
i get the �.
tensor([[ 0, 0, 49173, 48247, 46311, 23133, 1437, 48827, 48820, 48820,
23171, 49188, 27969, 47783, 5782, 48718, 48718, 27969, 27969, 48718,
47780, 46311, 3602, 3602, 27819, 49045, 49045, 49045, 48897, 48897,
9357, 9357, 9470, 9470, 9470, 2]], device=‘cuda:0’)

Is there anything I can do to prevent token repetition? Or there is something I am doing wrong?

Topic		Replies	Views
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4101	March 11, 2025
GPT2: many bad_words_ids leading to slow text generation? Intermediate	0	1539	September 4, 2021
How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step? 🤗Tokenizers	1	2866	February 22, 2023
What is the correct form of decoder_input_ids for LEDForConditionalGeneration? 🤗Transformers	1	710	July 5, 2021
TrOCR repeated generation Beginners	3	1307	November 30, 2021

Special Characters TrOCR

Related topics