Trouble batch mapping dataset to tokenizer

admarcosai · June 6, 2023, 10:11am

I am working on the WMT14 de_en dataset; I have been trying to tokenize it using batch mapping but I seem to be doing something wrong or do not understand how the mapping function with batching works.

The following is my code:

from datasets import load_dataset
from transformers import GPT2TokenizerFast

dataset_de_en = load_dataset("wmt14", "de-en")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def tokenize(trans_sample):
    src_tokenized = tokenizer(trans_sample['en'])
    trg_tokenized = tokenizer(trans_sample['de'])
    return {'en': src_tokenized,
            'de': trg_tokenized}

tokenized = dataset_de_en['train']['translation'].map(tokenize, batched=True, batch_size=512)
tokenized

ERROR: essentially trans_sample is a list and I would need a for loop to iterate through it to tokenize which to my understanding would mean I am not using map bathing properly here. Could someone please point me in the right direction as in how to do this properly?

AttributeError: 'list' object has no attribute 'map'

mariosasko · June 12, 2023, 8:11pm

You can find the answer by inspecting the processing function from the official translation notebook: transformers/run_translation_no_trainer.py at 41a8fa4e14ae14405a69efc65bfc21c9daa71c1a · huggingface/transformers · GitHub

Topic		Replies	Views
Pass `Dataset.map` result to model Beginners	2	1102	April 4, 2023
How to tokenize using map 🤗Datasets	4	6211	April 14, 2021
About dataset map 🤗Datasets	5	402	August 20, 2023
How can I use tokenized Dataset for Text Generation? Beginners	0	497	January 22, 2023
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12731	October 6, 2021

Trouble batch mapping dataset to tokenizer

Related topics