Pegasus Questions

sshleifer · August 24, 2020, 2:56pm

Q: Max model input size varies between checkpoints, what is the max num input tokens that each model can process?
A:

max_model_length = {
    "xsum": 512,
    "cnn_dailymail": 1024,
    "newsroom": 512,
    "wikihow": 512,
    "multi_news": 1024,
    "reddit_tifu": 512,
    "big_patent": 1024,
    "arxiv": 1024,
    "pubmed": 1024,
    "gigaword": 128,
    "aeslc": 512,
    "billsum": 1024,
    "large": 1024,
}

that constant is defined here:

github.com

sshleifer/transformers_fork/blob/f69cac3347641beaba5037b9a6fca1a46f423639/src/transformers/configuration_pegasus.py#L67-L67


max_model_length = {

valhalla · August 24, 2020, 2:59pm

So does that mean max_position_embedding is reduced for fine-tuned models (gigaword, wikihow). i.e if the max_position_embedding for the pre-trained model is 1024 then all fine-tuned models should also have same, right ?

sshleifer · August 24, 2020, 3:42pm

The positional embeddings are static, so the max model_input_length is more a reflection of how the model was finetuned than a fundamental limitation.

I got the params from: https://github.com/google-research/pegasus/blob/master/pegasus/params/public_params.py

For some of the saved state_dicts on S3, I simply removed the embed_positions.weight to allow passing max_position_embeddings=ANY_INT to from_pretrained allocate more at init. Let me know if you try that and it fails and I can delete the embed_positions.weight as needed.

Buckeyes2019 · August 24, 2020, 3:47pm

Was suggested I re-direct my questions from this thread: Questions about Pegasus for Summarization

Two questions: (1) does anyone have good examples of the google/pegasus models being used for summarization using the Automodal and Autotokenizer classes?

(2) Using the google/pegasus models, how do you set the minimum and/or maximum length for the generated summary?

Thanks!

sshleifer · August 24, 2020, 7:50pm

Thanks for redirecting

Docs have a usage example, that I’ll put here

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to tens of thousands of customers."

For all seq2seq/ForConditionalGeneration models, you can control generated summary lengths with the min_length and max_length keyword arguments to the generate function.

If you don’t pass them, the (good) defaults from config will be used. The defaults from config were set by the authors to get a good validation rouge on the finetuning dataset.

LiuYangyang · August 25, 2020, 1:46am

In the original paper Section 6:

We used sinusoidal positional encoding

and Section 6.2:

This would present a problem for position embeddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal positional encodings (Vaswani et al., 2017) generalize well when fine-tuning PEGASUSLARGE beyond the input lengths observed in training up to Linput = 1024 tokens.

So you can use longer input than training, but you have to observe if model still generalize with longer input.

valhalla · August 25, 2020, 8:58am

So if we want to use longer input we’ll need to resize the embeddings (as these are static) and these newly added embeddings will be randomly initialised and need to be trained.

Is this correct ?

sshleifer · August 25, 2020, 3:18pm

Ideally you won’t need to manually resize, and can just pass max_position_embeddings as a kwarg to from_pretrained

tuner007 · August 27, 2020, 8:54pm

Something is wrong ? UPDATE : Normal behavior…as pubmed dataset(biomedical text) is used for finetuning

Text : The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.

google/pegasus-pubmed summary:
the tower of so paulo ( tower of so paulo , brazil ) is the tallest building in the brazilian amazon . its height is determined by measuring the distance from the top of the building to the base of the building . the building is located in so paulo , brazil , on the so paulo river . its height is determined by measuring the distance from the top of the building to the base of the building .

sshleifer · August 28, 2020, 12:18am

I cant tell from that. It might be so different from a pubmed article that the model gets confused. Try finding the pubmed dataset and pasting in an example.

tuner007 · August 28, 2020, 6:46pm

Yes ! tested with authors script/finetuned checkpoints and got similar results…

Thanks !

tuner007 · September 12, 2020, 2:11pm

@sshleifer I was trying to convert pegasus TF checkpoint to pytorch using this script in transformers.

Can you please help me with below queries:

1.While replacing the keys i noticed below replacement which is not matching with BART keys so i have changed it accordingly:
Pegasus:
decoder/LayerNorm/beta -> decoder.layer_norm.bias
decoder/LayerNorm/gamma -> decoder.layer_norm.weight
BART:
decoder.layernorm_embedding.weight
decoder.layernorm_embedding.bias

2.In below code all pegasus dict_keys should be present in BART keys right ? but as Pegasus has more number of layers compared to BART…so it will never get keys like “decoder.layers.12.final_layer_norm.bias” in BART keys

def convert_pegasus_to_bart(tf_weights: dict, cfg_updates: dict) -> PegasusForConditionalGeneration:
    cfg_kwargs = DEFAULTS.copy()
    cfg_kwargs.update(cfg_updates)
    cfg = PegasusConfig(**cfg_updates)
    bart = PegasusForConditionalGeneration(cfg)
    sd = bart.model.state_dict()
    mapping = {}
    for k, v in tf_weights.items():
        new_k = rename_state_dict_key(k)
        if new_k not in sd:
            raise ValueError(f"could not find new key {new_k} in state dict. (converted from {k})")
        if "dense" in k or "proj" in new_k:
            v = v.T

sshleifer · September 12, 2020, 5:36pm

rename_state_dict_key should handle that
cfg.encoder_layers will be 16 there, so decoder.layers.12 will definitely exist.

If the code breaks, pls send a command and traceback.

tuner007 · September 12, 2020, 6:01pm

(update: works for me now)
cfg_kwargs.update(cfg_updates)
cfg = PegasusConfig(**cfg_updates) -> cfg = PegasusConfig(**cfg_kwargs)

Also, even got an error with model name in tokenizer which i later changed from “sshleifer/pegasus” to “google/pegasus-aeslc”
command: python convert_pegasus_tf_to_pytorch.py gs://pegasus_ckpt/arxiv/model.ckpt-32000 save_dir

Thanks !

sshleifer · September 12, 2020, 8:41pm

great catch, PR’d fix https://github.com/huggingface/transformers/pull/7094

sshleifer · September 12, 2020, 8:42pm

You may know this but we already have

google/pegasus-arxiv available fwiw

tuner007 · September 12, 2020, 8:55pm

Yes, actually i wanted to add pegasus for qa and paraphrasing

LiuYangyang · September 15, 2020, 2:44pm

Are there any script available to distill model and how long does it take? I need the small model trained on multi-news, but found nothing.

valhalla · September 15, 2020, 3:03pm

Pegasus distillation is now supported in the seq2seq examples.
https://github.com/huggingface/transformers/tree/master/examples/seq2seq

You’ll need to pass pegaus model path instead of bart and multi-news directory formatted according to the format specified in the readme.

Jeeves · November 3, 2020, 7:18am

Hi Sshleifer,

@sshleifer Sorry in advance if this is a silly question. When inferring on pegasus-arxiv, specially on long documents like the Arxiv dataset, can we only infer on the maximum input length (ie: 1024) or the whole document at once. Or do we have to send in the input in batches according to max length of 1024?

Topic		Replies	Views
Questions about Pegasus for Summarization 🤗Transformers	1	786	August 24, 2020
Pegasus Summarization API_Inference Beginners	4	324	May 28, 2021
Creating summaries of fixed length with PEGASUS model 🤗Transformers	1	473	July 13, 2022
Doubt on Tokenization in Pegasus Beginners	1	289	November 30, 2020
Simple Model to rewrite/paraphrase Beginners	7	311	March 19, 2025

Pegasus Questions

Was suggested I re-direct my questions from this thread: Questions about Pegasus for Summarization

Related topics