Pegasus dropping Non-ASCII Chars

Hello all,

I’m currently using a Pegasus model fine-tuned on a custom dataset to summarise conversations in English. Some participants in the conversation have non-ascii characters in their name e.g Gonçalo. I’ve noticed that for the original pegasus-large model and my fine tuned model these characters are dropped and the output references Gonalo not Gonçalo.

This wasn’t the case with the BART model I was previously using.

I was looking for help on how I could address this problem as I look to fine-tune my Pegasus model even further.

Thanks,

Karim

Hi @kmfoda which pegasus model are you using? In your custom dataset you say your input had special characters, but what about your output? In your tokenizer are you passing skip_special_tokens=True?

Hi @anwarika thanks for your response. I’m using pegasus-large fine-tuned on my own dataset. I’m using the summarisation pipeline defaults - I’ll need to investigate what the default is for that variable.

So it seems that skip_special_tokens is set to True in the pipeline source code but there doesn’t seem to be a way pre-set the variable as False. It’s seems to be hardcoded in there if I’m not mistaken?

Also, I’m not sure this would explain why if you feed Gonçalo into the hosted inference API example for BART-Large you get Gonçalo in the output whereas for Pegasus-Large you get Gonalo. They both use the summarisation pipeline.

The tokenizers are different, BART is a BPE and Pegasus uses Sentencepiece. When you finetuned your pegasus model, did your data target data have those special characters? You may want to add certain non-ascii characters to your tokenizer and then fineTune it on your data.

special_tokens_dict = {'additional_special_tokens': [['whatever your tokens translate to']]}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

I see thanks so much. The target data did have those special characters yes. The code snipet you provided is very helpful I’ll try it out. Just wondering though, why is it added as a special token not a normal one?

Recommend reading through this doc for your tokenizer type. Sorry not familiar with sentencepiece tokenizer. Summary of the tokenizers

Here is a repo that discusses it in detail GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

One part in that doc makes sense why your char don’t pass through. SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

Reading further i think this doc may help you.

1 Like