Fine-tuning XLNet for permutation language modeling: what is the required format of the train data?


I want to fine-tune the XLNet model with the permutation LM objective (so essentially continue with training) using my own dataset. I’m using this code for that.

I didn’t manage to find what the required format for my own data is. One sentence per line doesn’t make too much sense since XLNet follows the two-segment data format. I’m wondering if the tokenizer takes care of this.

In my current setting, I use the format of one-document-per-line (so multiple sentences per line, they are quite lengthy). The text is not tokenized and doesn’t have special tokens. A CUDA OOM error was raised and I suspect it is due to the documents being too long.