The process for tokenizing concatenated dataset is slow st the end of tokenizing

pkr7098 · October 30, 2023, 10:35am

I am tokenizing wikipedia English and bookcorpus dataset, which is concatenated in one dataset for training GPT2. Tokenizing each of dataset is fast(i.e. not concatenated) but after concatenation, the tokenizing process is extremly slow at the end of tokenizing. I am using fast tokenizer option

Topic		Replies	Views
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2237	November 11, 2024
Extremely slow operation on dataset.map 🤗Datasets	0	297	June 27, 2024
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	888	December 23, 2024
Speeding up Tokenization on large text corpus 🤗Transformers	0	439	September 26, 2022
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	769	November 2, 2023

The process for tokenizing concatenated dataset is slow st the end of tokenizing

Related topics