Hello. I have a function block_size = 8192 def tokenize_and_chunk(examples): tokenized_inputs = tokenizer(examples['content'], return_tensors='pt', add_special_tokens=False) input_ids = tokenized_inputs['input_ids'][0] total_length = len(input_ids) if total_length >= bloc…

Map with tokenize function stuck in the beginning

John6666 December 27, 2024, 1:59pm 5

As you say, I don’t want to recommend it either. The slowdown and difficulty of implementation are unavoidable overheads, and this is a last resort. However, your dataset is probably too large even in an environment with a lot of RAM…
There may be a way to train it by slicing the dataset itself in advance simply and feeding it in small amounts without streaming.

Topic		Replies	Views
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	951	December 23, 2024
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	782	November 2, 2023
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1998	April 22, 2022
Tokenize iterable dataset Models	0	263	June 7, 2023
Streaming datasets and batched mapping 🤗Datasets	5	2685	January 10, 2022

Map with tokenize function stuck in the beginning

Related topics