How to prevent the output from containing strange characters?

kintaro · September 8, 2020, 7:23pm

Hello to everyone,

I’m using t5 to do summarization tasks in english, however, always at the end of the outputs strange symbols are generated.

this is the code:

summary_ids = model.generate(tokenized_text,
                         num_beams=3,
                         no_repeat_ngram_size=2,
                         min_length=300,
                        max_length=600)
 output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
 print ("\n\nSummarized text: \n",output)

At the end of the output there are symbols like:

á là és ê e â€ ô óà unà uneestànèàêm–asôsè1á2al

Is this normal? Is there any way to prevent it?

Topic		Replies	Views
<unk> token in the output instead curly braces 🤗Tokenizers	0	498	March 25, 2023
Text Generation, adding random words, weird linebreaks & symbols at random Beginners	5	982	May 24, 2021
Ensure sentence completion at "." 🤗Transformers	0	662	March 31, 2023
[T5] How to control the lenth of the generated summaries 🤗Transformers	0	34	July 26, 2024
Keeping special chars in translations 🤗Tokenizers	0	302	October 12, 2023

How to prevent the output from containing strange characters?

Related topics