Dictionary of two lists to datasets and fine tuning advices for fr-it translation

Lokiii · July 14, 2022, 8:05pm

I have a dictionary containing the bible’s verses where the sentences are matched by the index of the list.
Data = { “it” : […,…,…], “fr” : […,…,…] }
How do I convert it to a dataset compatible with huggingface’s transformers?

Also, my objective is to learn french while reading the Bible. Should I fine tune an already trained fr-it translation model, or should I make a model from scratch? Given my objective, which will give me better results?

psyche · July 14, 2022, 11:49pm

It is very EASY with datasets.Dataset.from_dict

from datasets import Dataset

dataset = Dataset.from_dict(<your dictionary data>)

Lokiii · July 15, 2022, 12:02am

Thank you for the help, with the following code I think I have an issue.

import datasets
from transformers import AutoTokenizer, DataCollatorWithPadding

import json

with open('data.json', 'r') as fp:
    data = json.load(fp)

from datasets import Dataset

dataset = Dataset.from_dict(data)
print(dataset)
checkpoint = "Helsinki-NLP/opus-tatoeba-fr-it"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example['fr'], example['it'], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_datasets

output:

Dataset({
    features: ['it', 'fr'],
    num_rows: 31091
})
100%
32/32 [00:10<00:00, 3.17ba/s]
Dataset({
    features: ['it', 'fr', 'input_ids', 'attention_mask'],
    num_rows: 31091
})

As you can see I need an input_ids for the french sentences and an input_ids for the italian ones, unless I didn’t understand how this works.

psyche · July 15, 2022, 12:34am

your example function should be modified. If your tokenizer is based on transformers(huggingface), tokenizer(text, text_pair) make a sequence like this, [CLS] <text> [SEP] <text_pair> [SEP]. Because of that reason, your dataset has single (but merged) sequence.

My suggestion is (It seems to you running a NMT tasks)

  def example_function(examples):
      tokenized_inputs = tokenizer(
          text=examples[text_column],
          text_pair=examples[text_pair_column] if text_pair_column else None,
          is_split_into_words=is_split_into_words
      )
      if label_column_name in examples:
          with tokenizer.as_target_tokenizer():
              tokenized_labels = tokenizer(
                  text=examples[label_column_name],
                  padding=padding,
                  truncation=truncation,
                  is_split_into_words=is_split_into_words
              )
          if ignore_pad_tokens_for_loss:
              tokenized_labels['input_ids'] = [
                  [l if l != tokenizer.pad_token_id else -100 for l in label]
                  for label in tokenized_labels['input_ids']
              ]
          tokenized_inputs['labels'] = tokenized_labels['input_ids']

      return tokenized_inputs

is_split_into_words is options(boolean; True or False; “True” if your text column has list values(tokens) else “False”)
ignore_pad_tokens_for_loss(boolean, True or False ; as your favor )
padding, truncation is also the options your favor
text_column, text_pair_column, label_column_name depends on your dataset (in your dataset, text_column=“it” (or “fr”) , text_pair_column=None, label_column_name=“fr”(or "it))

Lokiii · July 15, 2022, 3:19am

Thanks for the suggestion, I think I didn’t get how to change the variables because with the following code I’m getting all’ones in the attention mask*.

def example_function(examples):
    tokenized_inputs = tokenizer(
        text=examples["fr"],
        text_pair=examples[None] if None else None,
        is_split_into_words=False
    )
    if "it" in examples:
        with tokenizer.as_target_tokenizer():
            tokenized_labels = tokenizer(
                text=examples["it"],
                padding="longest",
                truncation=True,
                is_split_into_words=False
            )
        if True:
            tokenized_labels['input_ids'] = [
                [l if l != tokenizer.pad_token_id else -100 for l in label]
                for label in tokenized_labels['input_ids']
            ]
        tokenized_inputs['labels'] = tokenized_labels['input_ids']

    return tokenized_inputs

import json
from datasets import Dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "Helsinki-NLP/opus-tatoeba-fr-it"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

with open('data.json', 'r') as fp:
    data = json.load(fp)

dataset = Dataset.from_dict(data)

example_function(dataset)

I’m doing something wrong, the text_pair_column=None, is not clear to me, maybe I sat it wrong. I want to translate french to italian. Thanks for the help.

psyche · July 15, 2022, 5:32am

psyche:

def example_function(examples):
      tokenized_inputs = tokenizer(
          text=examples['fr'],
          text_pair=None,
          is_split_into_words=False
      )
      if label_column_name in examples:
          with tokenizer.as_target_tokenizer():
              tokenized_labels = tokenizer(
                  text=examples['it'],
                  padding="longest",
                  truncation=True,
                  is_split_into_words=False
              )
          if ignore_pad_tokens_for_loss:
              # The code below takes only "input_ids" for Italian (a.k.a. "it")
              # Because conventional nmt task doesn't requires "attention_mask", "token_type_ids" for the label(Italian(target) sequence)
              # But if you want to use them insert the label's  attention_mask to "tokenized_labels" like tokenized_inputs['label_attention_mask'] =  tokenized_labels['attention_mask'].

              tokenized_labels['input_ids'] = [
                  [l if l != tokenizer.pad_token_id else -100 for l in label]
                  for label in tokenized_labels['input_ids']
              ]
          tokenized_inputs['labels'] = tokenized_labels['input_ids']

      return tokenized_inputs

please check the notations

Lokiii · July 15, 2022, 7:37pm

Thanks for the tips.
I modded the function this way to get what I wanted, it seems the if “it” in … part of your code was not working for some reason.

def example_function(examples):
    tokenized_inputs = tokenizer(
        text=examples["fr"],
        text_pair=None,
        is_split_into_words=False
    )
    tokenized_labels = tokenizer(
        text=examples["it"],
        padding="longest",
        truncation=True,
        is_split_into_words=False
    ) 
    tokenized_inputs['labels'] = tokenized_labels['input_ids']
    tokenized_inputs['labels_attention_mask'] =  tokenized_labels['attention_mask']
    return tokenized_inputs

Topic		Replies	Views
Create a dataset for translation 🤗Datasets	4	1326	December 14, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5061	July 10, 2021
Correct Format for Translation Dataset To fine tune pretrained Models Beginners	0	251	April 5, 2023
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1072	August 19, 2021
Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training 🤗Datasets	1	1743	December 8, 2023

Dictionary of two lists to datasets and fine tuning advices for fr-it translation

Related topics