KeyError: 'loss' even after appending labels while Fine Tuning Transformer XL

I am trying to do Casual Language Modeling Task by fine tuning Transformer XL (transfo-xl-wt103) model on my custom data. I have data in which max input words in a line are approx 50,000 and average number of words are 1000. As I have read that Transformer XL can take unlimited length of input. So, I think its the best option for my task. Moreover, I don’t want to concatenate and make blocks of length 128 as it would defeat the actual purpose. I can add padding/truncation to make all sentence of equal length. I want to know is my task achievable. Also, what memory & GPU would I require for this task.

Currently I am doing the same task with dummy data but getting error. Please check the code and help me with this error.
Here is my complete code:

from datasets import load_dataset
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments

data = """
A laptop, laptop computer, or notebook computer is a small, portable personal computer (PC).
Laptops are folded shut for transportation, and thus are suitable for mobile use.
Its name comes from lap, as it was deemed practical to be placed on a person's lap when being used. 
Today, laptops are the used in a variety of settings, such as at work, in education, web browsing, and general home computer use.
Design elements, form factor and construction can also vary significantly between models depending on intended use. 
Examples of specialized models of laptops include rugged notebooks for use in construction or military applications.

dataList = data.strip().split('.')
dataset = []
for line in dataList[0: -2]:
dataFrame = pd.DataFrame(dataset, columns = ['data'])
valDataFrame = pd.DataFrame(dataset[0:2], columns = ['data'])
dataset = Dataset.from_pandas(dataFrame)

modelCheckpoint = 'transfo-xl-wt103'
tokenizer = AutoTokenizer.from_pretrained(modelCheckpoint)
tokenizer.pad_token = tokenizer.eos_token

def tokenizeFunction(examples):
  return tokenizer(examples["data"],  add_special_tokens = True, padding = True, pad_to_max_length = True, max_length = 10, truncation = True)

dataset =, remove_columns=["data"])
valDataset =, remove_columns=["data"])
print(dataset[0], end = " ")

def appendLabels(examples):
  examples["labels"] = examples["input_ids"].copy()
  return examples 

dataset =
valDataset =
print(dataset[0], end = " ")

model = AutoModelForCausalLM.from_pretrained(modelCheckpoint)

trainingArgs = TrainingArguments(
    evaluation_strategy = "epoch",
    learning_rate = 2e-5,
    weight_decay = 0.01,
    label_names = ["labels"]

trainer = Trainer(
    model = model,
    args = trainingArgs,
    train_dataset = dataset,
    eval_dataset = valDataset


KeyError                                  Traceback (most recent call last)
<ipython-input-37-0d4f1e0acd39> in <module>()
     15     eval_dataset = valDataset
     16 )
---> 17 trainer.train()

3 frames
/usr/local/lib/python3.7/dist-packages/transformers/ in __getitem__(self, k)
   1614         if isinstance(k, str):
   1615             inner_dict = {k: v for (k, v) in self.items()}
-> 1616             return inner_dict[k]
   1617         else:
   1618             return self.to_tuple()[k]

KeyError: 'loss'

Version & Details:

- `transformers` version: 4.5.1
- Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.1+cu101 (False)
- Tensorflow version (GPU?): 2.4.1 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

Yes, transformer XL is the only model of the library incompatible with Trainer because it return losses (not averaged) instead of loss.

okay. @sgugger can you please suggest any source/article/link etc. where I can find fine tuning of Transformer XL without Trainer API.