Map with tokenize function stuck in the beginning

sleepywalker · December 26, 2024, 2:21am

Hello. I have a function

block_size = 8192

def tokenize_and_chunk(examples):
    tokenized_inputs = tokenizer(examples['content'], return_tensors='pt', add_special_tokens=False)
    
    input_ids = tokenized_inputs['input_ids'][0]
    total_length = len(input_ids)
    
    if total_length >= block_size:
        input_ids = input_ids[:(total_length // block_size) * block_size]
    
    input_chunks = [input_ids[i:i + block_size] for i in range(0, len(input_ids), block_size)]
    
    return {'input_ids': input_chunks}

and a dataset with 25000 samples where every sample is a string with ~50000 symbols. When I try to tokenize this dataset using tokenized_dataset = concatenated_data.map(tokenize_and_chunk, batched=True, num_proc=200, remove_columns=concatenated_data.column_names) it stucks in the beginning. I tried to change num_proc to 1 to use tokenizer parallelism but it didn’t help. When I try to tokenize with this function only one sample like tokenize_and_chunk(concatenated_data[0]) it succesfully tokenized in a moment. How can I fix that problem with dataset.map for tokenization?

Topic		Replies	Views
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	950	December 23, 2024
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	782	November 2, 2023
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1998	April 22, 2022
Tokenize iterable dataset Models	0	263	June 7, 2023
Streaming datasets and batched mapping 🤗Datasets	5	2685	January 10, 2022

Map with tokenize function stuck in the beginning

Related topics