Hello there,
thank you all in advance for helping me
I have built a Longformer Encoder Decoder on top of a MBart architecture by simply following instructions provided at longformer/convert_bart_to_longformerencoderdecoder.py by allenai/longformer (which I cannot link here because i am a new member).
I am using the MBart model from huggingface â ARTeLab/mbart-summarization-fanpage
In doing so I firstly updated any import methods called from the âtransfomersâ library, secondly, since I am working on Google Colab to use a GPU, I moved all necessary classes into a .ipynb file.
An old thread of mine stated the difficulty in making it work accordingly to the code provided by allenai but by manually updating the Class LongformerEncoderDecoderForConditionalGeneration(MBartForConditionalGeneration) to add self.model.encoder.embed_positions = MBartLearnedPositionalEmbedding(4096, 1024), as seen in Converting MBart to Longformer · GitHub, it now seems to work in creating the model, which now has same size for model.encoder.embed_positions.weight.
Snippets of code and more in details explanations are given below.
Environment info
-
transformers
version: 4.17.0 - Platform: Linux-5.4.144±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.12
- PyTorch version (GPU?): 1.10.0+cu111 (True)
- Tensorflow version (GPU?): 2.8.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes, Backend (GPU) Google Compute Engine
- Using distributed or parallel set-up in script?: no
Who can help
Information
Model I am using (Longformer Encoder Decoder For Conditional Generation, MBART):
The problem arises when making inference by calling the generate function:
- model.generate(inputs[âinput_idsâ], num_beams=4, max_length=50, early_stopping=True)
The task I am trying to work on is:
- Summarization in Italian
To reproduce
Steps to reproduce the behavior:
- Run the code
!pip install transformers==4.17.0 SentencePiece
from typing import List, Optional, Tuple, Dict
from torch import nn, Tensor
# from longformer.longformer import LongformerSelfAttention
from transformers import LongformerSelfAttention
from transformers import MBartConfig, MBartForConditionalGeneration
from transformers.models.mbart.modeling_mbart import MBartLearnedPositionalEmbedding
class LongformerEncoderDecoderForConditionalGeneration(MBartForConditionalGeneration):
def __init__(self, config):
super().__init__(config)
if config.attention_mode == 'n2':
pass # do nothing, use BertSelfAttention instead
else:
self.model.encoder.embed_positions = MBartLearnedPositionalEmbedding(4096, 1024)
for i, layer in enumerate(self.model.encoder.layers):
layer.self_attn = LongformerSelfAttentionForMBart(config, layer_id=i)
class LongformerEncoderDecoderConfig(MBartConfig):
def __init__(self, attention_window: List[int] = None, attention_dilation: List[int] = None,
autoregressive: bool = False, attention_mode: str = 'sliding_chunks',
gradient_checkpointing: bool = False, **kwargs):
"""
Args:
attention_window: list of attention window sizes of length = number of layers.
window size = number of attention locations on each side.
For an affective window size of 512, use `attention_window=[256]*num_layers`
which is 256 on each side.
attention_dilation: list of attention dilation of length = number of layers.
attention dilation of `1` means no dilation.
autoregressive: do autoregressive attention or have attention of both sides
attention_mode: 'n2' for regular n^2 self-attention, 'tvm' for TVM implemenation of Longformer
selfattention, 'sliding_chunks' for another implementation of Longformer selfattention
"""
super().__init__(**kwargs)
self.attention_window = attention_window
self.attention_dilation = attention_dilation
self.autoregressive = autoregressive
self.attention_mode = attention_mode
self.gradient_checkpointing = gradient_checkpointing
assert self.attention_mode in ['tvm', 'sliding_chunks', 'n2']
class LongformerSelfAttentionForMBart(nn.Module):
def __init__(self, config, layer_id):
super().__init__()
self.embed_dim = config.d_model
self.longformer_self_attn = LongformerSelfAttention(config, layer_id=layer_id)
self.output = nn.Linear(self.embed_dim, self.embed_dim)
def forward(
self,
hidden_states=None,
attention_mask=None,
layer_head_mask=None,
output_attentions=False
) -> Tuple[Tensor, Optional[Tensor]]:
# NEW
outputs = self.longformer_self_attn(
hidden_states=hidden_states, # I'm guessing I just need to pass
attention_mask=attention_mask, # I'm guessing I just need to pass
layer_head_mask=layer_head_mask, # I'm guessing I just need to pass
is_index_masked=None,
is_index_global_attn=None,
is_global_attn=None,
output_attentions=output_attentions,
)
attn_output = self.output(outputs[0].transpose(0, 1))
return (attn_output,) + outputs[1:] if len(outputs) == 2 else (attn_output, None)
import argparse
import logging
import os
import copy
from transformers import MBartTokenizer
from transformers import MBartForConditionalGeneration, AutoTokenizer
# from transformers.modeling_bart import shift_tokens_right
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
def create_long_model(
save_model_to,
base_model,
tokenizer_name_or_path,
attention_window,
max_pos
):
model = MBartForConditionalGeneration.from_pretrained(base_model)
tokenizer = MBartTokenizer.from_pretrained(tokenizer_name_or_path, model_max_length=max_pos)
config = LongformerEncoderDecoderConfig.from_pretrained(base_model)
model.config = config
# in BART attention_probs_dropout_prob is attention_dropout, but LongformerSelfAttention
# expects attention_probs_dropout_prob, so set it here
config.attention_probs_dropout_prob = config.attention_dropout
config.architectures = ['LongformerEncoderDecoderForConditionalGeneration', ]
# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.model.encoder.embed_positions.weight.shape
assert current_max_pos == config.max_position_embeddings + 2
config.max_encoder_position_embeddings = max_pos
config.max_decoder_position_embeddings = config.max_position_embeddings
del config.max_position_embeddings
max_pos += 2 # NOTE: BART has positions 0,1 reserved, so embedding size is max position + 2
assert max_pos >= current_max_pos
# allocate a larger position embedding matrix for the encoder
new_encoder_pos_embed = model.model.encoder.embed_positions.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
new_encoder_pos_embed[k:(k + step)] = model.model.encoder.embed_positions.weight[2:]
k += step
model.model.encoder.embed_positions.weight.data = new_encoder_pos_embed
# replace the `modeling_bart.SelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
config.attention_dilation = [1] * config.num_hidden_layers
for i, layer in enumerate(model.model.encoder.layers):
longformer_self_attn_for_bart = LongformerSelfAttentionForMBart(config, layer_id=i)
longformer_self_attn_for_bart.longformer_self_attn.query = layer.self_attn.q_proj
longformer_self_attn_for_bart.longformer_self_attn.key = layer.self_attn.k_proj
longformer_self_attn_for_bart.longformer_self_attn.value = layer.self_attn.v_proj
longformer_self_attn_for_bart.longformer_self_attn.query_global = copy.deepcopy(layer.self_attn.q_proj)
longformer_self_attn_for_bart.longformer_self_attn.key_global = copy.deepcopy(layer.self_attn.k_proj)
longformer_self_attn_for_bart.longformer_self_attn.value_global = copy.deepcopy(layer.self_attn.v_proj)
longformer_self_attn_for_bart.output = layer.self_attn.out_proj
layer.self_attn = longformer_self_attn_for_bart
# save model
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer
save_model_to = "model"
base_model = "ARTeLab/mbart-summarization-fanpage"
attention_window = 512
max_pos=4096
if not os.path.exists(save_model_to):
os.mkdir(save_model_to)
create_long_model(
save_model_to=save_model_to,
base_model=base_model,
tokenizer_name_or_path=base_model,
attention_window=attention_window,
max_pos=max_pos
)
- Load the tokenizer and model
from transformers import AutoTokenizer
from transformers import MBartForConditionalGeneration, AutoTokenizer
# from transformers.modeling_bart import shift_tokens_right
tokenizer = AutoTokenizer.from_pretrained('/content/model')
# TXT = "My friends are <mask> but they eat too many carbs."
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained('/content/model')
# what are these doing?!
# I discommented them because I think they are going to fix the problem of arguments in forward function
model.model.encoder.config.gradient_checkpointing = True
model.model.decoder.config.gradient_checkpointing = True
- When I try to make inference pretending âArticle_to_summarizeâ is in italian (it does not matter"
ARTICLE_TO_SUMMARIZE = '''Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.'''
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=4096, return_tensors='pt', padding="max_length", truncation=True)
# Generate Summary
print(inputs['input_ids'])
print('length input ids:', inputs)
print('w = ', model.model.config.attention_window)
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=50, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
I get the following error
ValueError: Traceback (most recent call last)
<ipython-input-16-526dcf88f4e5> in <module>()
----> 1 model.generate(input_ids, do_sample = False, temperature=0.7, num_beams=int(3), length_penalty=float(2), max_length = int(150), min_length=int(50), no_repeat_ngram_size=int(3))
11 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/longformer/modeling_longformer.py in _sliding_chunks_query_key_matmul(self, query, key, window_overlap)
806 overlap of size window_overlap
807 """
--> 808 batch_size, seq_len, num_heads, head_dim = query.size()
809 assert (
810 seq_len % (window_overlap * 2) == 0
ValueError: too many values to unpack (expected 4)
Expected behavior
Created Summary from Text