How can I enforce reproducibility for Longformer?

Hi all,

I’m struggling with ensuring reproducible results with the Longformer.

Here is the result of transformer-cli env:

  • transformers version: 4.9.1
  • Platform: Linux-5.8.0-63-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.8.1+cu102 (True)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

I am running a script which finetunes Longformer for sequence classification two times , each time for 4 epochs.

When using the model "allenai/longformer-base-4096", I do not get the same training loss in the two iterations.
However, if I use "roberta-base" as a model, the training loss is identical in both iterations.
I did not find anything else I could add to the script to ensure reproducible results. Could you tell me if I am missing something?

I plotted the training loss over epochs for two consecutive runs with "roberta-base" and "allenai/longformer-base-4096". You can see that the "allenai/longformer-base-4096" runs show different training loss in the two runs where as the "roberta-base" runs have identical training loss.
See the plot in a wandb-Report here:
Wandb Report

Below is code to reproduce the results. You can comment/uncomment the respective model_name to chose either "allenai/longformer-base-4096" or "roberta-base".

import torch
import random
import wandb
import datetime
import numpy as np
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoConfig, TrainingArguments, Trainer, AutoModelForSequenceClassification
import transformers

transformers.logging.set_verbosity_error()

seed = 42
# python RNG
random.seed(seed)

# pytorch RNGs
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

# numpy RNG
np.random.seed(seed)

#model_name = "roberta-base"
model_name = "allenai/longformer-base-4096"

raw_datasets = load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

def get_model():
    # get_model is used for the model_init argument for trainer. This should ensures reproducibility. Otherwise, weights from classification head are randomly initialized.
    # see https://discuss.huggingface.co/t/fixing-the-random-seed-in-the-trainer-does-not-produce-the-same-results-across-runs/3442
    model =  AutoModelForSequenceClassification.from_pretrained(
        model_name,
        config = AutoConfig.from_pretrained(model_name, num_labels = 2),
    )
    return model

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

lr = 1e-5
num_epochs = 4
batch_size = 2
model_path = "models/" + model_name.replace("/", "_")

for i in range(2):
	run = wandb.init(
	reinit=True,
	name = "transformers_" + model_name + "_" + datetime.datetime.now().strftime("%Y%m%d_%H%M%S"),
	notes = "reproducibility training with imdb dataset",
	save_code = True,
	config = {
		"model":model_name,
		"learning_rate":lr,
		"num_epochs": num_epochs,
		"warmup_ratio":0.1,
		"batch_size":batch_size,
		"random_seed":seed
		}
	)

	training_args = TrainingArguments(
		seed = seed,
		do_train=True,
		do_eval=True,
		evaluation_strategy="epoch",
		logging_strategy="epoch",
		num_train_epochs = num_epochs,
		learning_rate=lr, 
		per_device_train_batch_size = batch_size, 
		per_device_eval_batch_size = batch_size,
		output_dir = "./test_output"
	)

	trainer = Trainer(
		model_init=get_model,
		args=training_args,
		train_dataset=small_train_dataset,
		eval_dataset=small_eval_dataset,
		compute_metrics=compute_metrics,
	)
	trainer.train()
	run.finish()

Hi @DavidPfl, were you able to figure this out?

Hello, do you have any progress about the issue? I am facing the same problem with long former.

Facing the same issue with allenai/led-large-16384 via the run_summarization.py script. Was anyone able to get reproducible results with these models?