Note that TransformerXL is the only model of the library that does not work with Trainer as the loss it returns is not reduced (it’s an array and not a scalar). You might get away with it by implementing your own subclass of Trainer and override the compute_loss function to convert that array to a scalar.