Multi-Task dataset with Custom Sampler and Sharding

The current Huggingface Trainer Supports, a single train_dataset (torch.utils.data.dataset.Dataset). While it makes sense for most of the training setups, there are still some cases where it is convenient to have a list of train_dataset. The trainer can randomly select or follow a specific sampling strategy to select the samples from each of the train_dataset. An example is attached below for the custom sampling streategy code.

The sampling strategy for each of the train_dataset (torch.utils.data.dataset.Dataset) (from multiple train dataset) can be varied by a penalty variable (\alpha). The sample code for a custom multinomial distribution based sampling strategy is below (mentioned in XLM paper),

def multinomial_prob(dataset_len, alpha=.5):
    tot_number_of_sent_in_all_lang = 0
    prob = OrderedDict()
    for k, v in dataset_len.items():
        tot_number_of_sent_in_all_lang += v
    for k, v in dataset_len.items():
        neu = v
        den = tot_number_of_sent_in_all_lang
        p = neu/den
        prob[k] = p

    q = OrderedDict()
    q_den = 0.0
    for k, v in prob.items():
        q_den += (v**alpha)
    sum_ = 0.0
    for k, v in prob.items():
        q[k] =  (v**alpha)/q_den
        sum_ += q[k]
    assert math.fabs(1-sum_) < 1e-5
    return q

def iterator_selection_prob(alpha, train_datasets, logger=None):
    dataset_len = OrderedDict()
    for k, v in train_datasets.items():
        dataset_len[k] = len(v)
    for k, v in dataset_len.items():
        logger.info("Total Number of samples in {} : {}".format(k, v))
    prob = multinomial_prob(dataset_len, alpha=alpha)
    logger.info("Language iterator selection probability.")
    ret_prob_index, ret_prob_list  = [], []
    for k,v in prob.items():
        ret_prob_index.append(k)
        ret_prob_list.append(v)
    for k, v in zip(ret_prob_index, ret_prob_list):
        logger.info("{} : {}".format(k, v))
    return dataset_len, ret_prob_index, ret_prob_list

So I have three questions in general,

  1. How to integrate multiple datasets (or sub-dataset) in the same dataset class?
  2. How to apply custom control on the sampling strategy (let’s just say I want to inject the above sampling strategy in my sub-datasets) into different sub-dataset?
  3. Also in the case of the large tenderized dataset that cannot fit into memory how to handle the sharding using huggingface trainer.

Note:

  1. I am not looking for sample codes. A discussion or pointer to the Hf source library is also highly appreciated. However, sample codes are always best.
  2. I would also like to know if you have seen some other repository implements these feature with/without Hf-library.
  3. Discussion on any topic is highly appreciated.

@sgugger Do you have any idea on these topics?