Fine-tuning BERT with sequences longer than 512 tokens

arteagac · December 9, 2021, 5:08am

The BERT models I have found in the Model’s Hub handle a maximum input length of 512. Using sequences longer than 512 seems to require training the models from scratch, which is time consuming and computationally expensive. However, the only limitation to input sequences longer than 512 in a pretrained BERT model is the length of the position embeddings. Therefore, Would it be okay if I expand the matrix of pretrained position embeddings in order to handle longer sequences and avoid the training from scratch? This can be achieved by re-using the first 512 embeddings to expand the matrix of position embeddings.

I know BERT’s attention mechanism is quadratic and inputting longer sequences requires more computations; however, this seems manageable for sequences below 2k tokens by reducing the batch size. In fact, I tried this expansion of position embeddings using max_position_embeddings=1024 and managed to fine-tune a bert-medium model using less than 12GB of GPU memory with a batch_size=8 at a rate of 4 minutes per epoch (4k records in the training set). The improvement in accuracy compared to using a max length of 512 seems negligible (~1%); however, having the opportunity to use longer sequences might be useful in applications where truncation is too inconvenient and using larger models such as Longformer and BigBird is difficult in terms of computational resources. I suspect this “hacky” expansion of position embeddings works because the expanded matrix of embeddings still carries the positional information BERT needs. In summary, my quick test suggests the simple expansion of the position embeddings works; however, I am asking to see if anybody has an opinion about any potential issues that may arise with this approach.

Below is the code I used to expand the position embeddings to length 1024 by simply stacking the first 512 positions twice. Note that the position_ids and token_type_ids also need to be expanded.

max_length = 1024
model_checkpoint = "prajjwal1/bert-medium"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

tokenizer.model_max_length = max_length
model.config.max_position_embeddings = max_length
model.base_model.embeddings.position_ids = torch.arange(max_length).expand((1, -1))
model.base_model.embeddings.token_type_ids = torch.zeros(max_length).expand((1, -1))
orig_pos_emb = model.base_model.embeddings.position_embeddings.weight
model.base_model.embeddings.position_embeddings.weight = torch.nn.Parameter(torch.cat((orig_pos_emb, orig_pos_emb)))

moma1820 · February 27, 2022, 2:04pm

why not use a longformer?

arteagac · February 28, 2022, 7:57pm

In my experience, LongFormer and BigBird require a lot of GPU memory. I tried using these on a 14GB GPU, but I was limited to batch_size=1, which took for ever to train and yielded rather poor results.

moma1820 · February 28, 2022, 8:13pm

Interesting, was just thinking that if a bert model is passed in longer sequences of tokens. But the bert model was initially pretrained with a limited (512) tokens. Wouldn’t the weights be “confused” by the longer sequences. hence perform badly?

arteagac · March 1, 2022, 6:10am

Perhaps, but I assume the “confusion” will not be significant, given that the only element in BERT’s architecture that depends on the sequence length is the position embeddings. My empirical test on an IMDB text classification task suggests that using longer sequences does not degrade the performance, but it does not significantly improve it either (see the original post where I mention that performance only increased by 1% when using longer sequences). However, for other datasets and tasks, the performance may vary when using longer sequences.

moma1820 · April 3, 2022, 10:44am

do you know if the max seq length is 512 character token or word token?

arteagac · April 3, 2022, 8:33pm

HI @moma1820, I saw you posted a similar question in another thread, so I replied there. See the link below:

moma1820 · April 4, 2022, 6:17am

thanks man!

Topic		Replies	Views
Is there a pre-trained BERT model with the sequence length 2048? 🤗Transformers	2	2094	November 5, 2020
Token Classification Models on (Very) Long Text Models	8	11202	March 9, 2023
Modeling long sequences Models	0	461	June 9, 2022
Flan-T5 - Finetuning to a Longer Sequence Length (512 -> 2048 tokens): Will it work? Beginners	3	4217	January 9, 2024
How to load a BERT model with 1024 dimensions Beginners	0	2883	June 9, 2021

Fine-tuning BERT with sequences longer than 512 tokens

Related topics