Can't make inference from Longformer model build on top of MBART

VioletRaven · April 4, 2022, 11:16am

Hello there,

thank you all in advance for helping me

I have built a Longformer Encoder Decoder on top of a MBart architecture by simply following instructions provided at longformer/convert_bart_to_longformerencoderdecoder.py by allenai/longformer (which I cannot link here because i am a new member).

I am using the MBart model from huggingface → ARTeLab/mbart-summarization-fanpage

In doing so I firstly updated any import methods called from the ‘transfomers’ library, secondly, since I am working on Google Colab to use a GPU, I moved all necessary classes into a .ipynb file.
An old thread of mine stated the difficulty in making it work accordingly to the code provided by allenai but by manually updating the Class LongformerEncoderDecoderForConditionalGeneration(MBartForConditionalGeneration) to add self.model.encoder.embed_positions = MBartLearnedPositionalEmbedding(4096, 1024), as seen in Converting MBart to Longformer · GitHub, it now seems to work in creating the model, which now has same size for model.encoder.embed_positions.weight.

Snippets of code and more in details explanations are given below.

Environment info

transformers version: 4.17.0
Platform: Linux-5.4.144±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.10.0+cu111 (True)
Tensorflow version (GPU?): 2.8.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes, Backend (GPU) Google Compute Engine
Using distributed or parallel set-up in script?: no

Who can help

@ydshieh

Information

Model I am using (Longformer Encoder Decoder For Conditional Generation, MBART):

The problem arises when making inference by calling the generate function:

model.generate(inputs[‘input_ids’], num_beams=4, max_length=50, early_stopping=True)

The task I am trying to work on is:

Summarization in Italian

To reproduce

Steps to reproduce the behavior:

Run the code

!pip install transformers==4.17.0 SentencePiece

from typing import List, Optional, Tuple, Dict
from torch import nn, Tensor
# from longformer.longformer import LongformerSelfAttention
from transformers import LongformerSelfAttention
from transformers import MBartConfig, MBartForConditionalGeneration
from transformers.models.mbart.modeling_mbart import MBartLearnedPositionalEmbedding

class LongformerEncoderDecoderForConditionalGeneration(MBartForConditionalGeneration):

    def __init__(self, config):
        super().__init__(config)
        if config.attention_mode == 'n2':
            pass  # do nothing, use BertSelfAttention instead
        else:

            self.model.encoder.embed_positions = MBartLearnedPositionalEmbedding(4096, 1024)
            for i, layer in enumerate(self.model.encoder.layers):
                layer.self_attn = LongformerSelfAttentionForMBart(config, layer_id=i)

class LongformerEncoderDecoderConfig(MBartConfig):
    def __init__(self, attention_window: List[int] = None, attention_dilation: List[int] = None,
                 autoregressive: bool = False, attention_mode: str = 'sliding_chunks',
                 gradient_checkpointing: bool = False, **kwargs):

        """

        Args:

            attention_window: list of attention window sizes of length = number of layers.

                window size = number of attention locations on each side.

                For an affective window size of 512, use `attention_window=[256]*num_layers`

                which is 256 on each side.

            attention_dilation: list of attention dilation of length = number of layers.

                attention dilation of `1` means no dilation.

            autoregressive: do autoregressive attention or have attention of both sides

            attention_mode: 'n2' for regular n^2 self-attention, 'tvm' for TVM implemenation of Longformer

                selfattention, 'sliding_chunks' for another implementation of Longformer selfattention

        """
        super().__init__(**kwargs)
        self.attention_window = attention_window
        self.attention_dilation = attention_dilation
        self.autoregressive = autoregressive
        self.attention_mode = attention_mode
        self.gradient_checkpointing = gradient_checkpointing
        assert self.attention_mode in ['tvm', 'sliding_chunks', 'n2']

class LongformerSelfAttentionForMBart(nn.Module):

    def __init__(self, config, layer_id):
        super().__init__()
        self.embed_dim = config.d_model
        self.longformer_self_attn = LongformerSelfAttention(config, layer_id=layer_id)
        self.output = nn.Linear(self.embed_dim, self.embed_dim)

    def forward(
        self,
        hidden_states=None,
        attention_mask=None,
        layer_head_mask=None,
        output_attentions=False

    ) -> Tuple[Tensor, Optional[Tensor]]:

        # NEW

        outputs = self.longformer_self_attn(
            hidden_states=hidden_states,  # I'm guessing I just need to pass
            attention_mask=attention_mask,   # I'm guessing I just need to pass
            layer_head_mask=layer_head_mask,  # I'm guessing I just need to pass
            is_index_masked=None,
            is_index_global_attn=None,
            is_global_attn=None,
            output_attentions=output_attentions,
        )

        attn_output = self.output(outputs[0].transpose(0, 1))
        return (attn_output,) + outputs[1:] if len(outputs) == 2 else (attn_output, None)


import argparse
import logging
import os
import copy

from transformers import MBartTokenizer
from transformers import MBartForConditionalGeneration, AutoTokenizer
# from transformers.modeling_bart import shift_tokens_right

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def create_long_model(
    save_model_to,
    base_model,
    tokenizer_name_or_path,
    attention_window,
    max_pos
):
    model = MBartForConditionalGeneration.from_pretrained(base_model)
    tokenizer = MBartTokenizer.from_pretrained(tokenizer_name_or_path, model_max_length=max_pos)
    config = LongformerEncoderDecoderConfig.from_pretrained(base_model)
    model.config = config

    # in BART attention_probs_dropout_prob is attention_dropout, but LongformerSelfAttention
    # expects attention_probs_dropout_prob, so set it here
    config.attention_probs_dropout_prob = config.attention_dropout
    config.architectures = ['LongformerEncoderDecoderForConditionalGeneration', ]

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.model.encoder.embed_positions.weight.shape
    assert current_max_pos == config.max_position_embeddings + 2

    config.max_encoder_position_embeddings = max_pos
    config.max_decoder_position_embeddings = config.max_position_embeddings
    del config.max_position_embeddings
    max_pos += 2  # NOTE: BART has positions 0,1 reserved, so embedding size is max position + 2
    assert max_pos >= current_max_pos

    # allocate a larger position embedding matrix for the encoder
    new_encoder_pos_embed = model.model.encoder.embed_positions.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos - 2
    while k < max_pos - 1:
        new_encoder_pos_embed[k:(k + step)] = model.model.encoder.embed_positions.weight[2:]
        k += step
    model.model.encoder.embed_positions.weight.data = new_encoder_pos_embed

    # replace the `modeling_bart.SelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    config.attention_dilation = [1] * config.num_hidden_layers

    for i, layer in enumerate(model.model.encoder.layers):
        longformer_self_attn_for_bart = LongformerSelfAttentionForMBart(config, layer_id=i)

        longformer_self_attn_for_bart.longformer_self_attn.query = layer.self_attn.q_proj
        longformer_self_attn_for_bart.longformer_self_attn.key = layer.self_attn.k_proj
        longformer_self_attn_for_bart.longformer_self_attn.value = layer.self_attn.v_proj

        longformer_self_attn_for_bart.longformer_self_attn.query_global = copy.deepcopy(layer.self_attn.q_proj)
        longformer_self_attn_for_bart.longformer_self_attn.key_global = copy.deepcopy(layer.self_attn.k_proj)
        longformer_self_attn_for_bart.longformer_self_attn.value_global = copy.deepcopy(layer.self_attn.v_proj)

        longformer_self_attn_for_bart.output = layer.self_attn.out_proj

        layer.self_attn = longformer_self_attn_for_bart

    # save model
    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

save_model_to = "model"
base_model = "ARTeLab/mbart-summarization-fanpage"
attention_window = 512
max_pos=4096
if not os.path.exists(save_model_to):
      os.mkdir(save_model_to)

create_long_model(
    save_model_to=save_model_to,
    base_model=base_model,
    tokenizer_name_or_path=base_model,
    attention_window=attention_window,
    max_pos=max_pos
)

Load the tokenizer and model

from transformers import AutoTokenizer
from transformers import MBartForConditionalGeneration, AutoTokenizer
# from transformers.modeling_bart import shift_tokens_right
tokenizer = AutoTokenizer.from_pretrained('/content/model')
    # TXT = "My friends are <mask> but they eat too many carbs."
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained('/content/model')    
# what are these doing?!
# I discommented them because I think they are going to fix the problem of arguments in forward function
model.model.encoder.config.gradient_checkpointing = True
model.model.decoder.config.gradient_checkpointing = True

When I try to make inference pretending “Article_to_summarize” is in italian (it does not matter"

ARTICLE_TO_SUMMARIZE =  '''Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.Transformers (Vaswani et al., 2017) have achieved state-of-the-art
results in a wide range of natural language tasks including generative language modeling
(Dai et al., 2019; Radford et al., 2019) and discriminative ... language understanding (Devlin et al., 2019).
This success is partly due to the self-attention component which enables the network to capture contextual
information from the entire sequence. While powerful, the memory and computational requirements of
self-attention grow quadratically with sequence length, making it infeasible (or very expensive) to
process long sequences. To address this limitation, we present Longformer, a modified Transformer
architecture with a self-attention operation that scales linearly with the sequence length, making it
versatile for processing long documents (Fig 1). This is an advantage for natural language tasks such as
long document classification, question answering (QA), and coreference resolution, where existing approaches
partition or shorten the long context into smaller sequences that fall within the typical 512 token limit
of BERT-style pretrained models. Such partitioning could potentially result in loss of important
cross-partition information, and to mitigate this problem, existing methods often rely on complex
architectures to address such interactions. On the other hand, our proposed Longformer is able to build
contextual representations of the entire context using multiple layers of attention, reducing the need for
task-specific architectures.'''

inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=4096, return_tensors='pt', padding="max_length", truncation=True)

# Generate Summary
print(inputs['input_ids'])
print('length input ids:', inputs)
print('w = ', model.model.config.attention_window)

summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=50, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])

I get the following error

ValueError: Traceback (most recent call last)
<ipython-input-16-526dcf88f4e5> in <module>()
----> 1 model.generate(input_ids, do_sample = False, temperature=0.7, num_beams=int(3), length_penalty=float(2), max_length = int(150), min_length=int(50), no_repeat_ngram_size=int(3))

11 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/longformer/modeling_longformer.py in _sliding_chunks_query_key_matmul(self, query, key, window_overlap)
    806         overlap of size window_overlap
    807         """
--> 808         batch_size, seq_len, num_heads, head_dim = query.size()
    809         assert (
    810             seq_len % (window_overlap * 2) == 0

ValueError: too many values to unpack (expected 4)

Expected behavior

Created Summary from Text

Topic		Replies	Views
Help for inference.py code Amazon SageMaker	10	3390	March 8, 2022
CUDA error when deploying model with custom inference Amazon SageMaker	0	172	February 21, 2024
Error while Trying to run inference using gaudi on a finetuned llama2 model using habana repo 🤗Optimum	9	549	August 21, 2023
Sagemaker Endpoint Not Using GPU for PygmalionAI Amazon SageMaker	7	1169	April 18, 2024
How does the API inference work on models such as Blenderbot? Models	4	830	May 14, 2022