I am working on the WMT14 de_en dataset; I have been trying to tokenize it using batch mapping but I seem to be doing something wrong or do not understand how the mapping function with batching works.
The following is my code:
from datasets import load_dataset
from transformers import GPT2TokenizerFast
dataset_de_en = load_dataset("wmt14", "de-en")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
def tokenize(trans_sample):
src_tokenized = tokenizer(trans_sample['en'])
trg_tokenized = tokenizer(trans_sample['de'])
return {'en': src_tokenized,
'de': trg_tokenized}
tokenized = dataset_de_en['train']['translation'].map(tokenize, batched=True, batch_size=512)
tokenized
ERROR: essentially trans_sample is a list and I would need a for loop to iterate through it to tokenize which to my understanding would mean I am not using map bathing properly here. Could someone please point me in the right direction as in how to do this properly?
AttributeError: 'list' object has no attribute 'map'