I tried it, but didn’t work. I tried one of the failing texts and got this error: “Token indices sequence length is longer than the specified maximum sequence length for this model (753 > 512). Running this sequence through the model will result in indexing errors”. Does this mean ‘truncation’ doesn’t work?
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
src_text = [
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]
model_name = "google/pegasus-xsum"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert (
tgt_text[0]
== "California's largest electricity provider has turned off power to hundreds of thousands of customers."
)
So, if you want truncation your text,
set parameter 'truncation=True ’ to tokenizer(), not from_pretrained()
try this code. it show correctly truncate input text.
from transformers import PegasusTokenizer
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')
text_over_512 = "".join(['i' for i in range(756)])
text_under_112 = "".join(['i' for i in range(112)])
print(len(text_over_512))
print(len(text_under_112))
#
encoding_1 = tokenizer(text_over_512, truncation=True, padding= "max_length")
encoding_2 = tokenizer(text_under_112, truncation=True, padding= "max_length")
print(f"text_over_512's input_ids length : {len(encoding_1['input_ids'])}")
print(f"text_under_112's input_ids length : {len(encoding_2['input_ids'])}")
from transformers import pipeline, PegasusTokenizer, PegasusForConditionalGeneration
# Named entity recognition pipeline, passing in a specific model and tokenizer
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')
tokenizer =PegasusTokenizer.from_pretrained('google/pegasus-xsum')
summerize_pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
str_input_ = "A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on information highways. The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network."
# make long input
str_input_over512 = str_input_*4
summerize_pipe(str_input_over512, truncation=True)
output :
[{'summary_text': 'A Highway Network is an architecture designed to ease gradient-based training of very deep networks.'}]