Correct Format for Translation Dataset To fine tune pretrained Models

mohdyaser · April 5, 2023, 3:49pm

I’m trying to fine tune Helsinki model l on my own dataset which is a two column CSV file, I turned it into a dictionary that looks like this:
{‘id’: [0,1,2,…], ‘translation’:{‘en’:‘some text’,‘ar’:‘نص’}}
I used the function mentioned in the tutorial:

source_lang = ‘en’
target_lang = ‘ar’

def preprocess_function(examples):

inputs = [example[source_lang] for example in examples["translation"]]

targets = [example[target_lang] for example in examples["translation"]]

model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

return model_inputs

Then tokenized it and followed the tutorial all the way till the training part but I was getting this error:

‘Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers’

It’s a problem with the dataset format right? What’s the right format, and how can I process it?

Topic		Replies	Views
Fine-tuning T5 for translation Beginners	0	1302	November 9, 2021
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1078	August 19, 2021
Defining a custom dataset for fine-tuning translation Beginners	4	5088	July 10, 2021
Data format in run_lm_fine_tuning.py Beginners	2	415	September 8, 2020
Dictionary of two lists to datasets and fine tuning advices for fr-it translation Beginners	6	837	July 15, 2022

Correct Format for Translation Dataset To fine tune pretrained Models

Related topics