Hello. I have a function
block_size = 8192
def tokenize_and_chunk(examples):
tokenized_inputs = tokenizer(examples['content'], return_tensors='pt', add_special_tokens=False)
input_ids = tokenized_inputs['input_ids'][0]
total_length = len(input_ids)
if total_length >= block_size:
input_ids = input_ids[:(total_length // block_size) * block_size]
input_chunks = [input_ids[i:i + block_size] for i in range(0, len(input_ids), block_size)]
return {'input_ids': input_chunks}
and a dataset with 25000 samples where every sample is a string with ~50000 symbols. When I try to tokenize this dataset using tokenized_dataset = concatenated_data.map(tokenize_and_chunk, batched=True, num_proc=200, remove_columns=concatenated_data.column_names)
it stucks in the beginning. I tried to change num_proc
to 1 to use tokenizer parallelism but it didn’t help. When I try to tokenize with this function only one sample like tokenize_and_chunk(concatenated_data[0])
it succesfully tokenized in a moment. How can I fix that problem with dataset.map
for tokenization?