Multilingual batches

Hi everyone,

I’m trying to create multilingual batches in a controlled way. Specifically, I want each batch (e.g., with batch_size=32) to contain items from 4 different languages, with 8 examples sampled randomly per language.

I’ve already created a custom list of batches meeting this requirement, but I’m struggling with how to pass these batches to the Trainer. Currently, my Trainer setup looks like this: trainer = SFTTrainer(
model=model,
train_dataset=datasets[‘train’],
eval_dataset=datasets[‘eval’], # Add evaluation dataset
peft_config=peft_config,
tokenizer=tokenizer,
max_seq_length=512,
args=training_arguments,
formatting_func=formatting_prompts_func
)
Does anyone know how to properly integrate my custom batches with the Trainer? Any guidance would be greatly appreciated!

Thank you in advance!

Hi ! you can use datasets.interleave_datasets() to make a dataset with alternating languages (just give it as input your 4 datasets - one per language) and feed it to the Trainer :slight_smile:

You might want to pre-shuffle each language dataset and disable the Trainer’s shuffling to make sure your end up with batches of 8 samples per language

1 Like

Thank you very much!

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.