The current Huggingface Trainer Supports, a single train_dataset
(torch.utils.data.dataset.Dataset). While it makes sense for most of the training setups, there are still some cases where it is convenient to have a list of train_dataset
. The trainer can randomly select or follow a specific sampling strategy to select the samples from each of the train_dataset
. An example is attached below for the custom sampling streategy code.
The sampling strategy for each of the train_dataset
(torch.utils.data.dataset.Dataset) (from multiple train dataset) can be varied by a penalty variable (\alpha). The sample code for a custom multinomial distribution based sampling strategy is below (mentioned in XLM paper),
def multinomial_prob(dataset_len, alpha=.5):
tot_number_of_sent_in_all_lang = 0
prob = OrderedDict()
for k, v in dataset_len.items():
tot_number_of_sent_in_all_lang += v
for k, v in dataset_len.items():
neu = v
den = tot_number_of_sent_in_all_lang
p = neu/den
prob[k] = p
q = OrderedDict()
q_den = 0.0
for k, v in prob.items():
q_den += (v**alpha)
sum_ = 0.0
for k, v in prob.items():
q[k] = (v**alpha)/q_den
sum_ += q[k]
assert math.fabs(1-sum_) < 1e-5
return q
def iterator_selection_prob(alpha, train_datasets, logger=None):
dataset_len = OrderedDict()
for k, v in train_datasets.items():
dataset_len[k] = len(v)
for k, v in dataset_len.items():
logger.info("Total Number of samples in {} : {}".format(k, v))
prob = multinomial_prob(dataset_len, alpha=alpha)
logger.info("Language iterator selection probability.")
ret_prob_index, ret_prob_list = [], []
for k,v in prob.items():
ret_prob_index.append(k)
ret_prob_list.append(v)
for k, v in zip(ret_prob_index, ret_prob_list):
logger.info("{} : {}".format(k, v))
return dataset_len, ret_prob_index, ret_prob_list
So I have three questions in general,
- How to integrate multiple datasets (or sub-dataset) in the same dataset class?
- How to apply custom control on the sampling strategy (let’s just say I want to inject the above sampling strategy in my sub-datasets) into different sub-dataset?
- Also in the case of the large tenderized dataset that cannot fit into memory how to handle the sharding using huggingface trainer.
Note:
- I am not looking for sample codes. A discussion or pointer to the Hf source library is also highly appreciated. However, sample codes are always best.
- I would also like to know if you have seen some other repository implements these feature with/without Hf-library.
- Discussion on any topic is highly appreciated.