Multilingual batches

lingvenvist · December 12, 2024, 4:03pm

Hi everyone,

I’m trying to create multilingual batches in a controlled way. Specifically, I want each batch (e.g., with batch_size=32) to contain items from 4 different languages, with 8 examples sampled randomly per language.

I’ve already created a custom list of batches meeting this requirement, but I’m struggling with how to pass these batches to the Trainer. Currently, my Trainer setup looks like this: trainer = SFTTrainer(
model=model,
train_dataset=datasets[‘train’],
eval_dataset=datasets[‘eval’], # Add evaluation dataset
peft_config=peft_config,
tokenizer=tokenizer,
max_seq_length=512,
args=training_arguments,
formatting_func=formatting_prompts_func
)
Does anyone know how to properly integrate my custom batches with the Trainer? Any guidance would be greatly appreciated!

Thank you in advance!

lhoestq · December 12, 2024, 6:35pm

Hi ! you can use datasets.interleave_datasets() to make a dataset with alternating languages (just give it as input your 4 datasets - one per language) and feed it to the Trainer

You might want to pre-shuffle each language dataset and disable the Trainer’s shuffling to make sure your end up with batches of 8 samples per language

lingvenvist · December 12, 2024, 8:06pm

Thank you very much!

system · December 13, 2024, 8:06am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to sample batches from multiple datasets? 🤗Datasets	2	1945	January 18, 2024
Alternating between batches of different datasets Intermediate	0	222	February 8, 2024
Fine-tuning multilingual BERT for sequence classification with Trainer API Beginners	7	661	December 12, 2023
Training with varying lengths of sequences Beginners	0	1619	May 31, 2023
Set batch instead of full train dataset on Trainer 🤗Transformers	1	371	March 11, 2024

Multilingual batches

Related topics