Bigbird pretraining

onclue · April 7, 2021, 1:08pm

Hi,

I am curious about the new Bigbird model, and I’m trying to pretrain one for my language+domain. Running my usual pretraining script (see below) however gives me the following message, and as a result of not being able to use block sparse attention my GPU (Tesla V100) obviously runs out of memory in no time.

Attention type 'block_sparse' is not possible if sequence_length: 130 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...

I understand what the problem is, but not how to solve it. How can I dynamically pad each minibatch to a multiple of that number of tokens in the formula for block sparse attention? I suppose I could pad all my samples to the max sequence length (4096), but that seems exceedingly wasteful as well. Any pointers on how to proceed here would be immensely appreciated.

Thanks a lot!

My current pretraining code:

tokenizer = BigBirdTokenizer(FLAGS.tokenizer)
print('tokenizer:', tokenizer)

config = BigBirdConfig(
    vocab_size=tokenizer.vocab_size,
    num_hidden_layers=6,
    max_position_embeddings=4096,
    attention_type="block_sparse",
)

model = BigBirdForMaskedLM(config=config)
print('model', model)
print(model.num_parameters(), 'parameters')

dataset = load_from_disk(FLAGS.data)
print('data loaded')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(
    output_dir=FLAGS.output,
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=FLAGS.batchsize,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

print('start training!')
trainer.train()
print('done training!')
trainer.save_model(training_args.output_dir)

``

ccfeidao · September 6, 2021, 4:50am

I have encountered the same problem. Do you have a solution now?

eloaf · March 15, 2022, 6:56pm

Same thing here - trying to solve this.

nbroad · March 16, 2022, 4:46am

You can do that like this
DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15, pad_to_multiple_of=1024)

I’d recommend following the run_mlm.py method which groups texts rather than dynamically padding them (see here)

Topic		Replies	Views
Attention type 'block_sparse' is not possible if sequence_length: 458 <= num global tokens: 🤗Transformers	4	1061	September 6, 2021
Bigbirdmodel: Problem with running code provided in documentation Beginners	11	996	October 22, 2021
Bigbird-roberta batch size Beginners	0	501	May 30, 2021
Pad Token and attention mask. What is the difference? 🤗Transformers	0	987	August 13, 2021
Padding strategy for classification Beginners	3	2528	July 20, 2020

Bigbird pretraining

Related topics