I have a dictionary containing the bible’s verses where the sentences are matched by the index of the list.
Data = { “it” : […,…,…], “fr” : […,…,…] }
How do I convert it to a dataset compatible with huggingface’s transformers?
Also, my objective is to learn french while reading the Bible. Should I fine tune an already trained fr-it translation model, or should I make a model from scratch? Given my objective, which will give me better results?
your example function should be modified. If your tokenizer is based on transformers(huggingface), tokenizer(text, text_pair) make a sequence like this, [CLS] <text> [SEP] <text_pair> [SEP]. Because of that reason, your dataset has single (but merged) sequence.
My suggestion is (It seems to you running a NMT tasks)
def example_function(examples):
tokenized_inputs = tokenizer(
text=examples[text_column],
text_pair=examples[text_pair_column] if text_pair_column else None,
is_split_into_words=is_split_into_words
)
if label_column_name in examples:
with tokenizer.as_target_tokenizer():
tokenized_labels = tokenizer(
text=examples[label_column_name],
padding=padding,
truncation=truncation,
is_split_into_words=is_split_into_words
)
if ignore_pad_tokens_for_loss:
tokenized_labels['input_ids'] = [
[l if l != tokenizer.pad_token_id else -100 for l in label]
for label in tokenized_labels['input_ids']
]
tokenized_inputs['labels'] = tokenized_labels['input_ids']
return tokenized_inputs
is_split_into_words is options(boolean; True or False; “True” if your text column has list values(tokens) else “False”)
ignore_pad_tokens_for_loss(boolean, True or False ; as your favor )
padding, truncation is also the options your favor
text_column, text_pair_column, label_column_name depends on your dataset (in your dataset, text_column=“it” (or “fr”) , text_pair_column=None, label_column_name=“fr”(or "it))
Thanks for the suggestion, I think I didn’t get how to change the variables because with the following code I’m getting all’ones in the attention mask*.
def example_function(examples):
tokenized_inputs = tokenizer(
text=examples["fr"],
text_pair=examples[None] if None else None,
is_split_into_words=False
)
if "it" in examples:
with tokenizer.as_target_tokenizer():
tokenized_labels = tokenizer(
text=examples["it"],
padding="longest",
truncation=True,
is_split_into_words=False
)
if True:
tokenized_labels['input_ids'] = [
[l if l != tokenizer.pad_token_id else -100 for l in label]
for label in tokenized_labels['input_ids']
]
tokenized_inputs['labels'] = tokenized_labels['input_ids']
return tokenized_inputs
import json
from datasets import Dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "Helsinki-NLP/opus-tatoeba-fr-it"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
with open('data.json', 'r') as fp:
data = json.load(fp)
dataset = Dataset.from_dict(data)
example_function(dataset)
I’m doing something wrong, the text_pair_column=None, is not clear to me, maybe I sat it wrong. I want to translate french to italian. Thanks for the help.