How can I enforce reproducibility for Longformer?

Hi all,

I’m struggling with ensuring reproducible results with the Longformer.

Here is the result of transformer-cli env:

  • transformers version: 4.9.1
  • Platform: Linux-5.8.0-63-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.8.1+cu102 (True)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

I am running a script which finetunes Longformer for sequence classification two times , each time for 4 epochs.

When using the model "allenai/longformer-base-4096", I do not get the same training loss in the two iterations.
However, if I use "roberta-base" as a model, the training loss is identical in both iterations.
I did not find anything else I could add to the script to ensure reproducible results. Could you tell me if I am missing something?

I plotted the training loss over epochs for two consecutive runs with "roberta-base" and "allenai/longformer-base-4096". You can see that the "allenai/longformer-base-4096" runs show different training loss in the two runs where as the "roberta-base" runs have identical training loss.
See the plot in a wandb-Report here:
Wandb Report

Below is code to reproduce the results. You can comment/uncomment the respective model_name to chose either "allenai/longformer-base-4096" or "roberta-base".

import torch
import random
import wandb
import datetime
import numpy as np
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoConfig, TrainingArguments, Trainer, AutoModelForSequenceClassification
import transformers


seed = 42
# python RNG

# pytorch RNGs
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

# numpy RNG

#model_name = "roberta-base"
model_name = "allenai/longformer-base-4096"

raw_datasets = load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets =, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

def get_model():
    # get_model is used for the model_init argument for trainer. This should ensures reproducibility. Otherwise, weights from classification head are randomly initialized.
    # see
    model =  AutoModelForSequenceClassification.from_pretrained(
        config = AutoConfig.from_pretrained(model_name, num_labels = 2),
    return model

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

lr = 1e-5
num_epochs = 4
batch_size = 2
model_path = "models/" + model_name.replace("/", "_")

for i in range(2):
	run = wandb.init(
	name = "transformers_" + model_name + "_" +"%Y%m%d_%H%M%S"),
	notes = "reproducibility training with imdb dataset",
	save_code = True,
	config = {
		"num_epochs": num_epochs,

	training_args = TrainingArguments(
		seed = seed,
		num_train_epochs = num_epochs,
		per_device_train_batch_size = batch_size, 
		per_device_eval_batch_size = batch_size,
		output_dir = "./test_output"

	trainer = Trainer(

Hi @DavidPfl, were you able to figure this out?