An optimal way to perform partitioning of the dataset

I want to use RedPajama-v2 dataset for LLM training. The English part of the dataset comprises 30.7T tokens, however, I need only 200B tokens. I want to do train 4 expert models asynchronously, and I have 4 pretrained routers that assign a sequence to experts based on the lowest loss on the sequence. My question is: how to organise assignments process and how to store the shards of individual datasets (each chunk of the dataset could be unevenly distributed between experts) so it would be convenient for training on GPUs and TPUs? The context length for experts is 1024 tokens.

1 Like

If using the shuffle function in the datasets library is acceptable, I think that would be the simplest method, but it seems that it is also possible to recreate a subsample for that particular dataset…

I would compute the loss per each expert and assign the sequence to the lowest loss expert. I would store the size of each shard like 100k x 1024 tokens so it is consistent. Use parquet so you get the maximized value from the dataset/transformers library. You can review some stored json per each expert so that you can gauge where you are standing data wise.

If there is an imbalance you can do several things: You can truncate the dataset to a maximum token limit for consistency. Secondarily, you sample from each expert set so you can dynamically set the epochs for each expert to train on. For example if there is heavy imbalance, then you would train the smaller one on more epochs so the gradients equalize on updates.

Hope this helps :slight_smile:

1 Like