Customize FlauBERT tokenizer to split line breaks

rapminerz · March 4, 2023, 10:45am

Hello,

I want to train FlauBERT model on french music lyrics and I want to adapt the tokenizer to my usecase : for example I’ve seen the tokenizer is actually ignoring line breaks

How can I make it tokenize them ? I’ve also seen FlauBERT Tokenizer is a “slow” tokenizer so it cannot be trained with the .train_from_iterator() method

Should I preprocess data myself ? Can I train another tokenizer ? I’m pretty blocked

Thanks !

Topic		Replies	Views
Padding and truncation for custom tokenizer 🤗Tokenizers	1	643	January 22, 2023
Preprocessing data for custom tokenizer 🤗Transformers	0	251	October 21, 2022
Custom tokenizer: finetune model or retrain model? 🤗Transformers	1	907	March 8, 2024
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12834	February 12, 2024
TypeError: forward() got an unexpected keyword argument 'token_type_ids' Beginners	3	3262	June 10, 2022

Customize FlauBERT tokenizer to split line breaks

Related topics