I’m trying to fine tune Helsinki model l on my own dataset which is a two column CSV file, I turned it into a dictionary that looks like this:
{‘id’: [0,1,2,…], ‘translation’:{‘en’:‘some text’,‘ar’:‘نص’}}
I used the function mentioned in the tutorial:
source_lang = ‘en’
target_lang = ‘ar’def preprocess_function(examples):
inputs = [example[source_lang] for example in examples["translation"]] targets = [example[target_lang] for example in examples["translation"]] model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True) return model_inputs
Then tokenized it and followed the tutorial all the way till the training part but I was getting this error:
‘Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers’
It’s a problem with the dataset format right? What’s the right format, and how can I process it?