Environment info
Python version: 3.7.10
PyTorch version (GPU?): '1.7.1+cu110' (True)
Transformer version: '4.5.0.dev0'
Details
I am trying to use the cross-attention from the T5 model for paraphrasing. The idea is to map the input sentence and output generated sequence based on the attention. But the first results I got are very strange.
I generated an example with the following code:
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
pretrained_model = "ramsrigouthamg/t5_paraphraser"
model = T5ForConditionalGeneration.from_pretrained(pretrained_model,
output_attentions=True,
output_scores=True)
translated_sentence = "I like drinking Fanta and Cola."
text = "paraphrase: " + translated_sentence + " </s>"
encoding = tokenizer.encode_plus(text,
pad_to_max_length=True,
return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"].
Then, I gave a look to the cross attention for each generated token by selecting the last layer of the encoder and the first head.
beam_outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_masks,
do_sample=True,
max_length=256,
top_k=100,
top_p=0.95,
num_return_sequences=1,
output_attentions = True,
output_scores=True,
return_dict_in_generate=True
)
sentence_id = 0
print("Input phrase: ", tokenizer.decode(encoding.input_ids[0],
skip_special_tokens=False,
clean_up_tokenization_spaces=False))
print("Predicted phrase: ", tokenizer.decode(beam_outputs.sequences[sentence_id],
skip_special_tokens=True,
clean_up_tokenization_spaces=True))
for out in range(len(beam_outputs.sequences[sentence_id])-1):
print(
"\nPredicted word: ",
tokenizer.decode(beam_outputs.sequences[sentence_id][out],
skip_special_tokens=True,
clean_up_tokenization_spaces=True))
att = torch.stack(beam_outputs.cross_attentions[out])
# Last layer of the encoder
att = att[-1]
# First batch and first head
att = att[0, 0, :, :]
att = torch.squeeze(att)
idx = torch.argsort(att)
idx = idx.cpu().numpy()
print("Input words ordered by attention: ")
for i in range(min(5, len(idx))):
token_smallest_attention =tokenizer.decode(encoding.input_ids[0][idx[i]],
skip_special_tokens=True,
clean_up_tokenization_spaces=True)
token_largest_attention =tokenizer.decode(encoding.input_ids[0][idx[-(1+i)]],
skip_special_tokens=True,
clean_up_tokenization_spaces=True)
print(f"{i+1}: Largest attention: {token_largest_attention} | smallest attention:{token_smallest_attention}")
The attention scores are sorted and each generated token is associated with the input with the highest attention (5 values) and with the lowest attentions (also 5 values).
Input phrase: paraphrase: I like drinking Fanta and Cola.</s>
Predicted phrase: I like to drink Fanta and Cola.
Predicted word: <pad>
Input words ordered by attention:
1: Largest attention: I | smallest attention:Col
2: Largest attention: like | smallest attention:a
3: Largest attention: : | smallest attention:t
4: Largest attention: para | smallest attention:a
5: Largest attention: . | smallest attention:Fan
Predicted word: I
Input words ordered by attention:
1: Largest attention: phrase | smallest attention:t
2: Largest attention: </s> | smallest attention:a
3: Largest attention: para | smallest attention:a
4: Largest attention: : | smallest attention:Col
5: Largest attention: like | smallest attention:and
Predicted word: like
Input words ordered by attention:
1: Largest attention: Fan | smallest attention:I
2: Largest attention: Col | smallest attention:.
3: Largest attention: phrase | smallest attention:like
4: Largest attention: a | smallest attention:para
5: Largest attention: </s> | smallest attention:a
Expecting results
I was expecting an almost one-to-one mapping as the paraphrase is very close to the input but it is not the case. The model gives good paraphrases. Do you think that I made some errors in the interpretation of the cross-attention object?
Thank you for your help!
Hopefully, it is something simple that I am missing.