Show Submodels of PegasusTokenizer


for understanding the process of the PegasusTokenizer, I would like to print the different outputs of the Tokenizer after each submodule (Normalization, Pre Tokenization and so on.

I used the google/pegasus-cnn_dailymail model with the PegasusTokenizer.
Is there s way to show on one data example how the Tokenizer transforms the input text to the output integer sequences?

I would like to see the transformation step wise for each step the Tokenizer is performing?

It would be great if somebody can point me in one direction because it is very urgent.


I have some ideas but I don’t think this is the right solution?

import sentencepiece as spm

s = spm.SentencePieceProcessor(tokenizer.vocab_file)
str(s.encode(sample_text,  out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1))[:1000]