Multi-Task dataset with Custom Sampler and Sharding

sbmaruf · April 18, 2021, 10:42am

The current Huggingface Trainer Supports, a single train_dataset (torch.utils.data.dataset.Dataset). While it makes sense for most of the training setups, there are still some cases where it is convenient to have a list of train_dataset. The trainer can randomly select or follow a specific sampling strategy to select the samples from each of the train_dataset. An example is attached below for the custom sampling streategy code.

The sampling strategy for each of the train_dataset (torch.utils.data.dataset.Dataset) (from multiple train dataset) can be varied by a penalty variable (\alpha). The sample code for a custom multinomial distribution based sampling strategy is below (mentioned in XLM paper),

def multinomial_prob(dataset_len, alpha=.5):
    tot_number_of_sent_in_all_lang = 0
    prob = OrderedDict()
    for k, v in dataset_len.items():
        tot_number_of_sent_in_all_lang += v
    for k, v in dataset_len.items():
        neu = v
        den = tot_number_of_sent_in_all_lang
        p = neu/den
        prob[k] = p

    q = OrderedDict()
    q_den = 0.0
    for k, v in prob.items():
        q_den += (v**alpha)
    sum_ = 0.0
    for k, v in prob.items():
        q[k] =  (v**alpha)/q_den
        sum_ += q[k]
    assert math.fabs(1-sum_) < 1e-5
    return q

def iterator_selection_prob(alpha, train_datasets, logger=None):
    dataset_len = OrderedDict()
    for k, v in train_datasets.items():
        dataset_len[k] = len(v)
    for k, v in dataset_len.items():
        logger.info("Total Number of samples in {} : {}".format(k, v))
    prob = multinomial_prob(dataset_len, alpha=alpha)
    logger.info("Language iterator selection probability.")
    ret_prob_index, ret_prob_list  = [], []
    for k,v in prob.items():
        ret_prob_index.append(k)
        ret_prob_list.append(v)
    for k, v in zip(ret_prob_index, ret_prob_list):
        logger.info("{} : {}".format(k, v))
    return dataset_len, ret_prob_index, ret_prob_list

So I have three questions in general,

How to integrate multiple datasets (or sub-dataset) in the same dataset class?
How to apply custom control on the sampling strategy (let’s just say I want to inject the above sampling strategy in my sub-datasets) into different sub-dataset?
Also in the case of the large tenderized dataset that cannot fit into memory how to handle the sharding using huggingface trainer.

Note:

I am not looking for sample codes. A discussion or pointer to the Hf source library is also highly appreciated. However, sample codes are always best.
I would also like to know if you have seen some other repository implements these feature with/without Hf-library.
Discussion on any topic is highly appreciated.

sbmaruf · April 20, 2021, 4:23pm

@sgugger Do you have any idea on these topics?

amitness · July 29, 2023, 3:15pm

@sbmaruf Were you able to find a solution to this?

sbmaruf · July 30, 2023, 12:00am

I wrote my own torch Iterable dataset.

amitness · August 1, 2023, 8:56am

Did you still use Trainer or wrote your own training loop in PyTorch?

Also, is your code publicly available somewhere? I am training a multilingual model and have to integrate the sampling to handle the imbalance between high and low resource languages.

Topic		Replies	Views
Implementation of Two Distinct Datasets with HuggingFace Trainer Module Intermediate	5	39	June 18, 2025
Multilingual batches 🤗Datasets	3	50	December 12, 2024
Nested datasets and oversampling 🤗Datasets	5	2614	July 5, 2021
How to sample batches from multiple datasets? 🤗Datasets	2	1949	January 18, 2024
Evaluate subset of data during training 🤗Transformers	5	5624	July 6, 2024

Multi-Task dataset with Custom Sampler and Sharding

Related topics