Fine-tuning XLNet for permutation language modeling: what is the required format of the train data?

izaskr · July 21, 2021, 10:30am

Hi,

I want to fine-tune the XLNet model with the permutation LM objective (so essentially continue with training) using my own dataset. I’m using this code for that.

I didn’t manage to find what the required format for my own data is. One sentence per line doesn’t make too much sense since XLNet follows the two-segment data format. I’m wondering if the tokenizer takes care of this.

In my current setting, I use the format of one-document-per-line (so multiple sentences per line, they are quite lengthy). The text is not tokenized and doesn’t have special tokens. A CUDA OOM error was raised and I suspect it is due to the documents being too long.

Topic		Replies	Views
Data format in run_lm_fine_tuning.py Beginners	2	414	September 8, 2020
Continue training XLNet on domain-specific data stuck in Creating features 🤗Transformers	0	349	July 24, 2020
How to reproduce XLNet correctly And What is the config for finetuning XLNet? 🤗Transformers	0	254	July 30, 2021
Continue training XLNet on a specific closed-domain dataset Beginners	2	592	July 19, 2020
How can train a POS model with XLNET? Beginners	2	267	April 18, 2022

Fine-tuning XLNet for permutation language modeling: what is the required format of the train data?

Related topics