Hello,
I want to train FlauBERT model on french music lyrics and I want to adapt the tokenizer to my usecase : for example I’ve seen the tokenizer is actually ignoring line breaks
How can I make it tokenize them ? I’ve also seen FlauBERT Tokenizer is a “slow” tokenizer so it cannot be trained with the .train_from_iterator() method
Should I preprocess data myself ? Can I train another tokenizer ? I’m pretty blocked
Thanks !