Pegasus Questions

Q: Max model input size varies between checkpoints, what is the max num input tokens that each model can process?
A:

max_model_length = {
    "xsum": 512,
    "cnn_dailymail": 1024,
    "newsroom": 512,
    "wikihow": 512,
    "multi_news": 1024,
    "reddit_tifu": 512,
    "big_patent": 1024,
    "arxiv": 1024,
    "pubmed": 1024,
    "gigaword": 128,
    "aeslc": 512,
    "billsum": 1024,
    "large": 1024,
}

that constant is defined here:

So does that mean max_position_embedding is reduced for fine-tuned models (gigaword, wikihow). i.e if the max_position_embedding for the pre-trained model is 1024 then all fine-tuned models should also have same, right ?

The positional embeddings are static, so the max model_input_length is more a reflection of how the model was finetuned than a fundamental limitation.

I got the params from: https://github.com/google-research/pegasus/blob/master/pegasus/params/public_params.py

For some of the saved state_dicts on S3, I simply removed the embed_positions.weight to allow passing max_position_embeddings=ANY_INT to from_pretrained allocate more at init. Let me know if you try that and it fails and I can delete the embed_positions.weight as needed.

Was suggested I re-direct my questions from this thread: Questions about Pegasus for Summarization

Two questions: (1) does anyone have good examples of the google/pegasus models being used for summarization using the Automodal and Autotokenizer classes?

(2) Using the google/pegasus models, how do you set the minimum and/or maximum length for the generated summary?

Thanks!

Thanks for redirecting

Docs have a usage example, that I’ll put here

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to tens of thousands of customers."

For all seq2seq/ForConditionalGeneration models, you can control generated summary lengths with the min_length and max_length keyword arguments to the generate function.

If you don’t pass them, the (good) defaults from config will be used. The defaults from config were set by the authors to get a good validation rouge on the finetuning dataset.

1 Like

In the original paper Section 6:

We used sinusoidal positional encoding

and Section 6.2:

This would present a problem for position embeddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal positional encodings (Vaswani et al., 2017) generalize well when fine-tuning PEGASUSLARGE beyond the input lengths observed in training up to Linput = 1024 tokens.

So you can use longer input than training, but you have to observe if model still generalize with longer input.

1 Like

So if we want to use longer input we’ll need to resize the embeddings (as these are static) and these newly added embeddings will be randomly initialised and need to be trained.

Is this correct ?

Ideally you won’t need to manually resize, and can just pass max_position_embeddings as a kwarg to from_pretrained

1 Like

Something is wrong ? :thinking: UPDATE : Normal behavior…as pubmed dataset(biomedical text) is used for finetuning

Text : The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.

google/pegasus-pubmed summary:
the tower of so paulo ( tower of so paulo , brazil ) is the tallest building in the brazilian amazon . its height is determined by measuring the distance from the top of the building to the base of the building . the building is located in so paulo , brazil , on the so paulo river . its height is determined by measuring the distance from the top of the building to the base of the building .

I cant tell from that. It might be so different from a pubmed article that the model gets confused. Try finding the pubmed dataset and pasting in an example.

1 Like

Yes ! tested with authors script/finetuned checkpoints and got similar results…

Thanks !

1 Like

@sshleifer I was trying to convert pegasus TF checkpoint to pytorch using this script in transformers.

Can you please help me with below queries:

1.While replacing the keys i noticed below replacement which is not matching with BART keys so i have changed it accordingly:
Pegasus:
decoder/LayerNorm/beta -> decoder.layer_norm.bias
decoder/LayerNorm/gamma -> decoder.layer_norm.weight
BART:
decoder.layernorm_embedding.weight
decoder.layernorm_embedding.bias

2.In below code all pegasus dict_keys should be present in BART keys right ? but as Pegasus has more number of layers compared to BART…so it will never get keys like “decoder.layers.12.final_layer_norm.bias” in BART keys

def convert_pegasus_to_bart(tf_weights: dict, cfg_updates: dict) -> PegasusForConditionalGeneration:
    cfg_kwargs = DEFAULTS.copy()
    cfg_kwargs.update(cfg_updates)
    cfg = PegasusConfig(**cfg_updates)
    bart = PegasusForConditionalGeneration(cfg)
    sd = bart.model.state_dict()
    mapping = {}
    for k, v in tf_weights.items():
        new_k = rename_state_dict_key(k)
        if new_k not in sd:
            raise ValueError(f"could not find new key {new_k} in state dict. (converted from {k})")
        if "dense" in k or "proj" in new_k:
            v = v.T
  1. rename_state_dict_key should handle that
  2. cfg.encoder_layers will be 16 there, so decoder.layers.12 will definitely exist.

If the code breaks, pls send a command and traceback.

(update: works for me now)
cfg_kwargs.update(cfg_updates)
cfg = PegasusConfig(**cfg_updates) -> cfg = PegasusConfig(**cfg_kwargs) :sweat_smile:

Also, even got an error with model name in tokenizer which i later changed from “sshleifer/pegasus” to “google/pegasus-aeslc”
command: python convert_pegasus_tf_to_pytorch.py gs://pegasus_ckpt/arxiv/model.ckpt-32000 save_dir

Thanks !

great catch, PR’d fix https://github.com/huggingface/transformers/pull/7094

1 Like

You may know this but we already have

google/pegasus-arxiv available fwiw

Yes, actually i wanted to add pegasus for qa and paraphrasing :slight_smile:

1 Like

Are there any script available to distill model and how long does it take? I need the small model trained on multi-news, but found nothing.

Pegasus distillation is now supported in the seq2seq examples.
https://github.com/huggingface/transformers/tree/master/examples/seq2seq

You’ll need to pass pegaus model path instead of bart and multi-news directory formatted according to the format specified in the readme.

3 Likes

Hi Sshleifer,

@sshleifer Sorry in advance if this is a silly question. When inferring on pegasus-arxiv, specially on long documents like the Arxiv dataset, can we only infer on the maximum input length (ie: 1024) or the whole document at once. Or do we have to send in the input in batches according to max length of 1024?