Map with tokenize function stuck in the beginning

sleepywalker · December 26, 2024, 2:21am

Hello. I have a function

block_size = 8192

def tokenize_and_chunk(examples):
    tokenized_inputs = tokenizer(examples['content'], return_tensors='pt', add_special_tokens=False)
    
    input_ids = tokenized_inputs['input_ids'][0]
    total_length = len(input_ids)
    
    if total_length >= block_size:
        input_ids = input_ids[:(total_length // block_size) * block_size]
    
    input_chunks = [input_ids[i:i + block_size] for i in range(0, len(input_ids), block_size)]
    
    return {'input_ids': input_chunks}

and a dataset with 25000 samples where every sample is a string with ~50000 symbols. When I try to tokenize this dataset using tokenized_dataset = concatenated_data.map(tokenize_and_chunk, batched=True, num_proc=200, remove_columns=concatenated_data.column_names) it stucks in the beginning. I tried to change num_proc to 1 to use tokenizer parallelism but it didn’t help. When I try to tokenize with this function only one sample like tokenize_and_chunk(concatenated_data[0]) it succesfully tokenized in a moment. How can I fix that problem with dataset.map for tokenization?

John6666 · December 27, 2024, 6:41am

Maybe try IterableDataset?

sleepywalker · December 27, 2024, 11:34am

With IterableDataset like ConstnatLengthDataset model will train longer than with predefined tokenized dataset. Moreover, with IterableDataset multigpu training is more complex. It needs something like DistributedSample, DistributedLoader and so on.

John6666 · December 27, 2024, 1:59pm

As you say, I don’t want to recommend it either. The slowdown and difficulty of implementation are unavoidable overheads, and this is a last resort. However, your dataset is probably too large even in an environment with a lot of RAM…
There may be a way to train it by slicing the dataset itself in advance simply and feeding it in small amounts without streaming.

sleepywalker · December 27, 2024, 10:47pm

I mean why function with tokenization takes so much time to implement it relaitively to other functions. I understand that tokenization is not a simple transformation but this tokenizer is fast and as I know tokenization has multiprocessing by itself and I can map with only 1 num_proc and anyway it will parallelizae tokenization by itself. But I can’t understand problem that mapping with batch_size=100 for example just stuck in the beginning and not working. Due to that behaviour, other mappings takes like 1-2 minute for implementation while tokenization may take 15-20 minutes with batched=False to not to stuck in the beginnging

Topic		Replies	Views
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	905	December 23, 2024
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	773	November 2, 2023
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1981	April 22, 2022
Tokenize iterable dataset Models	0	263	June 7, 2023
Streaming datasets and batched mapping 🤗Datasets	5	2670	January 10, 2022

Map with tokenize function stuck in the beginning

Related topics