Pegasus dropping Non-ASCII Chars

kmfoda · March 3, 2022, 1:12pm

Hello all,

I’m currently using a Pegasus model fine-tuned on a custom dataset to summarise conversations in English. Some participants in the conversation have non-ascii characters in their name e.g Gonçalo. I’ve noticed that for the original pegasus-large model and my fine tuned model these characters are dropped and the output references Gonalo not Gonçalo.

This wasn’t the case with the BART model I was previously using.

I was looking for help on how I could address this problem as I look to fine-tune my Pegasus model even further.

Thanks,

Karim

anwarika · March 3, 2022, 1:56pm

Hi @kmfoda which pegasus model are you using? In your custom dataset you say your input had special characters, but what about your output? In your tokenizer are you passing skip_special_tokens=True?

kmfoda · March 3, 2022, 2:01pm

Hi @anwarika thanks for your response. I’m using pegasus-large fine-tuned on my own dataset. I’m using the summarisation pipeline defaults - I’ll need to investigate what the default is for that variable.

kmfoda · March 3, 2022, 2:45pm

So it seems that skip_special_tokens is set to True in the pipeline source code but there doesn’t seem to be a way pre-set the variable as False. It’s seems to be hardcoded in there if I’m not mistaken?

Also, I’m not sure this would explain why if you feed Gonçalo into the hosted inference API example for BART-Large you get Gonçalo in the output whereas for Pegasus-Large you get Gonalo. They both use the summarisation pipeline.

anwarika · March 4, 2022, 2:03pm

The tokenizers are different, BART is a BPE and Pegasus uses Sentencepiece. When you finetuned your pegasus model, did your data target data have those special characters? You may want to add certain non-ascii characters to your tokenizer and then fineTune it on your data.

special_tokens_dict = {'additional_special_tokens': [['whatever your tokens translate to']]}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

kmfoda · March 9, 2022, 4:28pm

I see thanks so much. The target data did have those special characters yes. The code snipet you provided is very helpful I’ll try it out. Just wondering though, why is it added as a special token not a normal one?

anwarika · March 11, 2022, 1:48pm

Recommend reading through this doc for your tokenizer type. Sorry not familiar with sentencepiece tokenizer. Summary of the tokenizers

Here is a repo that discusses it in detail GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

One part in that doc makes sense why your char don’t pass through. SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

Reading further i think this doc may help you.

Topic		Replies	Views
Question about llama fine tuning dataset token string Beginners	1	13	May 17, 2025
Questions about Pegasus for Summarization 🤗Transformers	1	785	August 24, 2020
Simple Model to rewrite/paraphrase Beginners	7	284	March 19, 2025
PEGASUS extracting from input instead of abstrative summarization 🤗Transformers	0	269	June 16, 2021
Pegasus tokenizer for batch processing Beginners	1	2364	August 10, 2023

Pegasus dropping Non-ASCII Chars

Related topics