Questions when doing Transformer-XL Finetune with Trainer

Hi everyone,

Nice to see you here. :blush:

I’m new to the Transformer-XL model. :pleading_face: I’m following Fine-tuning with custom datasets to finetune Transformer-XL with Trainer.(sequence classification task)

First, I used exactly the same way as the instruction above except for:

tokenizer = TransfoXLTokenizer.from_pretrained(‘transfo-xl-wt103’)
model = TransfoXLForSequenceClassification.from_pretrained(“transfo-xl-wt103”)

By doing this, I got ‘RuntimeError: stack expects each tensor to be equal size, but got [25] at entry 0 and [24] at entry 1.’ I think the reason for the error is that I should pad the sequences in the same batch to the same length. Let me know, if I’m wrong. Probably, I need a data_collator to solve this problem. Is there a build-in data_collator in huggingface to solve this problem? If not, is there an example about how to overwrite the data_collator?

Second, I changed the code to:

tokenizer = TransfoXLTokenizer.from_pretrained(‘transfo-xl-wt103’)
model = TransfoXLForSequenceClassification.from_pretrained(“transfo-xl-wt103”)

train_texts = [train_text[:120] for train_text in train_texts]
val_texts = [val_text[:120] for val_text in val_texts]
test_texts = [test_text[:120] for test_text in test_texts]

tokenizer.pad_token = tokenizer.eos_token

train_encodings = tokenizer(train_texts, padding=True, max_length=‘120’)
val_encodings = tokenizer(val_texts, padding=True, max_length=‘120’)
test_encodings = tokenizer(test_texts, padding=True, max_length=‘120’)

multilabel_trainer = Trainer(
model=model, # the instantiated :hugs: Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset

By doing this, I think I made the sequence in the same batch have the same size. However, I got the error ‘AssertionError: Cannot handle batch sizes > 1 if no padding token is defined.’ I checked my tokenizer:

tokenizer.pad_token return ‘’, tokenizer.pad_token_id return 0.

Sometimes, it will provide my cuda out of memory even though I restarted the gpu and checked the gpu memory before I running the code by using nvidia-smi.

Last, I changed the batchsize to 1, it trained for 11 steps and cuda out of memory. My GPU is P100 with 16 GB memory, I think it shouldn’t be full so quick. (I used the gpu to fine tune bert successfully)

I have no idea where did I do wrong. :sneezing_face: Any suggestions or help will be appreciated. :pleading_face:

For your convenience, I uploaded the notebook here.


Note that TransformerXL is the only model of the library that does not work with Trainer as the loss it returns is not reduced (it’s an array and not a scalar). You might get away with it by implementing your own subclass of Trainer and override the compute_loss function to convert that array to a scalar.

Thanks for letting me know this! That’s really helpful. Or I will keep working on figuring out why Trainer is not working with Transformer-XL. :sweat_smile:
I will try to rewrite the compute_loss function for it.