Which strategy is better for text pre-processing in training a transformer model

MahdiA · January 2, 2022, 2:42pm

I have a text dataset and I need to train an MLM model on that. Which strategy is better to do pre-process on corpus:
1- Concatenate all the texts, then tokenize it and then chuck them in 512 tokens to feed the MLM.
2. extract each sentence from dataset and then use padding or slicing the token vectors.

Topic		Replies	Views
Best solution for train tokenizer and MLM from scratch 🤗Tokenizers	0	729	December 6, 2021
(How) should I pre-process my data for a transformer model used for classification (sentiment analysis)? Beginners	0	435	December 29, 2022
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4282	September 8, 2021
RoBERTa MLM fine-tuning Beginners	1	1871	November 24, 2021
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023

Which strategy is better for text pre-processing in training a transformer model

Related topics