Implementation of Two Distinct Datasets with HuggingFace Trainer Module

rkabir · June 18, 2025, 4:40pm

Hi, I was wondering if anyone could provide any insight into the most effective/efficient way of implementing finetuning with the HuggingFace Trainer module while incorporating two distinct datasets. Basically, I am looking for a way to have it so that the finetuning process goes over two distinct sets of text such that each epoch guarantees traversing over the entirety of both of them. Moreover, the implementation should allow for distinct batch sizes for each set (e.g. 32 and 128).

If it is any help, the finetuning process I am trying to implement is called “Unsupervised Data Augmentation”, which is described here: [1904.12848] Unsupervised Data Augmentation for Consistency Training.

Mdrnfox · June 18, 2025, 5:11pm

You could try this I think…

Override the Trainer.get_train_dataloader() method to return a custom iterator that uses two different dataloader, one for each dataset. Iterate over each set per epoch and you can handle the batch size difference… This would give you a lot of the flexibility that you can control and implement using the given library code. Then you would just have to extend out the trainer class with a multi dataset trainer class

rkabir · June 18, 2025, 5:23pm

Would I have to override the .inner_training_loop() method in order to iterate over each set per epoch? Or is there another way to do this?

Mdrnfox · June 18, 2025, 5:36pm

Only if you want to customize updates per gradient , use different optimizers ,conditionally skip batches.

You’d do something like this after you implement the extended class

trainer = MultiDatasetTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
dataset_a=dataset1,
dataset_b=dataset2,
bs_a=32,
bs_b=128,
train_dataset=None, # Must be None to use overridden method
)
trainer.train()

rkabir · June 18, 2025, 6:12pm

@Mdrnfox Hmm, I’m still confused as to how the two dataloders would be taken into account. For instance, the compute_loss function takes in a single parameter as input. I assume that this must be overridden to have two input parameters?

Moreover, that would mean that the function that calls the compute_loss function (training_step()) would have to take two inputs as well, no?

Mdrnfox · June 18, 2025, 6:22pm

By design, compute_loss() is called with a single inputs dict per batch. inputs is what comes from your train_dataloader(), one batch at a time. So only one batch is processed per call to training_step(), which then calls compute_loss().

class MultiDatasetTrainer(Trainer):
    def __init__(self, dataset_a=None, dataset_b=None, bs_a=32, bs_b=128, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dataset_a = dataset_a
        self.dataset_b = dataset_b
        self.bs_a = bs_a
        self.bs_b = bs_b

    def get_train_dataloader(self):
        loader_a = DataLoader(
            self.dataset_a,
            batch_size=self.bs_a,
            shuffle=True,
            collate_fn=self.data_collator,
        )
        loader_b = DataLoader(
            self.dataset_b,
            batch_size=self.bs_b,
            shuffle=True,
            collate_fn=self.data_collator,
        ) 
        #You'll need to make this or something like this
        return CombinedLoader(loader_a, loader_b, mode="sequential")

    def compute_loss(self, model, inputs, return_outputs=False):
        source = inputs.pop("source")  

        
        outputs = model(**inputs)
        loss = outputs.loss

        # Optional: weight dataset B less
        if source == "B":
            loss = loss * ( a number)

        return (loss, outputs) if return_outputs else loss

Topic		Replies	Views
Multi-Task dataset with Custom Sampler and Sharding Intermediate	4	1366	August 1, 2023
Train through multiple datasets Beginners	1	1633	June 13, 2022
Wav2vec2 finetuning custom dataset 🤗Transformers	2	2447	December 25, 2024
How to Ensure Each Process Reads Its Own Dataset and Trains Correctly When Using Trainer？ 🤗Transformers	0	15	December 20, 2024
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021

Implementation of Two Distinct Datasets with HuggingFace Trainer Module

Related topics