and a dataset with 25000 samples where every sample is a string with ~50000 symbols. When I try to tokenize this dataset using tokenized_dataset = concatenated_data.map(tokenize_and_chunk, batched=True, num_proc=200, remove_columns=concatenated_data.column_names) it stucks in the beginning. I tried to change num_proc to 1 to use tokenizer parallelism but it didn’t help. When I try to tokenize with this function only one sample like tokenize_and_chunk(concatenated_data[0]) it succesfully tokenized in a moment. How can I fix that problem with dataset.map for tokenization?
With IterableDataset like ConstnatLengthDataset model will train longer than with predefined tokenized dataset. Moreover, with IterableDataset multigpu training is more complex. It needs something like DistributedSample, DistributedLoader and so on.
As you say, I don’t want to recommend it either. The slowdown and difficulty of implementation are unavoidable overheads, and this is a last resort. However, your dataset is probably too large even in an environment with a lot of RAM…
There may be a way to train it by slicing the dataset itself in advance simply and feeding it in small amounts without streaming.
I mean why function with tokenization takes so much time to implement it relaitively to other functions. I understand that tokenization is not a simple transformation but this tokenizer is fast and as I know tokenization has multiprocessing by itself and I can map with only 1 num_proc and anyway it will parallelizae tokenization by itself. But I can’t understand problem that mapping with batch_size=100 for example just stuck in the beginning and not working. Due to that behaviour, other mappings takes like 1-2 minute for implementation while tokenization may take 15-20 minutes with batched=False to not to stuck in the beginnging