Customize FlauBERT tokenizer to split line breaks


I want to train FlauBERT model on french music lyrics and I want to adapt the tokenizer to my usecase : for example I’ve seen the tokenizer is actually ignoring line breaks

How can I make it tokenize them ? I’ve also seen FlauBERT Tokenizer is a “slow” tokenizer so it cannot be trained with the .train_from_iterator() method

Should I preprocess data myself ? Can I train another tokenizer ? I’m pretty blocked

Thanks !