So does that mean max_position_embedding is reduced for fine-tuned models (gigaword, wikihow). i.e if the max_position_embedding for the pre-trained model is 1024 then all fine-tuned models should also have same, right ?
For some of the saved state_dicts on S3, I simply removed the embed_positions.weight to allow passing max_position_embeddings=ANY_INT to from_pretrained allocate more at init. Let me know if you try that and it fails and I can delete the embed_positions.weight as needed.
Two questions: (1) does anyone have good examples of the google/pegasus models being used for summarization using the Automodal and Autotokenizer classes?
(2) Using the google/pegasus models, how do you set the minimum and/or maximum length for the generated summary?
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
src_text = [
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]
model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to tens of thousands of customers."
For all seq2seq/ForConditionalGeneration models, you can control generated summary lengths with the min_length and max_length keyword arguments to the generate function.
If you don’t pass them, the (good) defaults from config will be used. The defaults from config were set by the authors to get a good validation rouge on the finetuning dataset.
This would present a problem for position embeddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal positional encodings (Vaswani et al., 2017) generalize well when fine-tuning PEGASUSLARGE beyond the input lengths observed in training up to Linput = 1024 tokens.
So you can use longer input than training, but you have to observe if model still generalize with longer input.
So if we want to use longer input we’ll need to resize the embeddings (as these are static) and these newly added embeddings will be randomly initialised and need to be trained.
Something is wrong ? UPDATE : Normal behavior…as pubmed dataset(biomedical text) is used for finetuning
Text : The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
google/pegasus-pubmedsummary:
the tower of so paulo ( tower of so paulo , brazil ) is the tallest building in the brazilian amazon . its height is determined by measuring the distance from the top of the building to the base of the building . the building is located in so paulo , brazil , on the so paulo river . its height is determined by measuring the distance from the top of the building to the base of the building .
I cant tell from that. It might be so different from a pubmed article that the model gets confused. Try finding the pubmed dataset and pasting in an example.
@sshleifer I was trying to convert pegasus TF checkpoint to pytorch using this script in transformers.
Can you please help me with below queries:
1.While replacing the keys i noticed below replacement which is not matching with BART keys so i have changed it accordingly: Pegasus:
decoder/LayerNorm/beta -> decoder.layer_norm.bias
decoder/LayerNorm/gamma -> decoder.layer_norm.weight BART:
decoder.layernorm_embedding.weight
decoder.layernorm_embedding.bias
2.In below code all pegasus dict_keys should be present in BART keys right ? but as Pegasus has more number of layers compared to BART…so it will never get keys like “decoder.layers.12.final_layer_norm.bias” in BART keys
def convert_pegasus_to_bart(tf_weights: dict, cfg_updates: dict) -> PegasusForConditionalGeneration:
cfg_kwargs = DEFAULTS.copy()
cfg_kwargs.update(cfg_updates)
cfg = PegasusConfig(**cfg_updates)
bart = PegasusForConditionalGeneration(cfg)
sd = bart.model.state_dict()
mapping = {}
for k, v in tf_weights.items():
new_k = rename_state_dict_key(k)
if new_k not in sd:
raise ValueError(f"could not find new key {new_k} in state dict. (converted from {k})")
if "dense" in k or "proj" in new_k:
v = v.T
(update: works for me now)
cfg_kwargs.update(cfg_updates)
cfg = PegasusConfig(**cfg_updates) -> cfg = PegasusConfig(**cfg_kwargs)
Also, even got an error with model name in tokenizer which i later changed from “sshleifer/pegasus” to “google/pegasus-aeslc” command:python convert_pegasus_tf_to_pytorch.py gs://pegasus_ckpt/arxiv/model.ckpt-32000 save_dir
@sshleifer Sorry in advance if this is a silly question. When inferring on pegasus-arxiv, specially on long documents like the Arxiv dataset, can we only infer on the maximum input length (ie: 1024) or the whole document at once. Or do we have to send in the input in batches according to max length of 1024?