Show Submodels of PegasusTokenizer

rmeier · April 28, 2022, 12:33pm

Hi,

for understanding the process of the PegasusTokenizer, I would like to print the different outputs of the Tokenizer after each submodule (Normalization, Pre Tokenization and so on.

I used the google/pegasus-cnn_dailymail model with the PegasusTokenizer.
Is there s way to show on one data example how the Tokenizer transforms the input text to the output integer sequences?

I would like to see the transformation step wise for each step the Tokenizer is performing?

It would be great if somebody can point me in one direction because it is very urgent.

Thanks
Ralf

rmeier · April 28, 2022, 1:01pm

I have some ideas but I don’t think this is the right solution?

import sentencepiece as spm

s = spm.SentencePieceProcessor(tokenizer.vocab_file)
str(s.encode(sample_text,  out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1))[:1000]

Topic		Replies	Views
Simple Model to rewrite/paraphrase Beginners	7	339	March 19, 2025
Pegasus tokenizer for batch processing Beginners	1	2373	August 10, 2023
PEGASUS (CNN / DailyMail) model doesn't summarize this input 🤗Transformers	0	438	April 24, 2021
Using XLA fast text generation with Pegasus models Intermediate	5	570	August 25, 2022
Questions about Pegasus for Summarization 🤗Transformers	1	787	August 24, 2020

Show Submodels of PegasusTokenizer

Related topics