How to tokenize using map

Neel-Gupta · April 13, 2021, 8:02pm

This is a problem that I have gotten: full info on this thread I had a quick question -

I have constructed Dataset object using numpy arrays, however when I use this:-

def tok(example):
  encodings = tokenizer(example['src'], truncation=True, padding="max_length", max_length=2000)
  return encodings

train_encoded_dataset = train_dataset.map(tok, batched=True)
val_encoded_dataset = val_dataset.map(tok, batched=True)

and I explore my train_encoded_dataset, I see that when trying to view the source sequence:

>>> train_encoded_dataset
>>>Dataset({
    features: ['attention_mask', 'input_ids', 'src', 'tgt'],
    num_rows: 4572
})
>>> train_encoded_dataset['src'][0]

The output of this last command produces text that is completely raw (basically untokenized) string- (like: ‘lorem ipsum…’) which is expected since I didn’t call tokenizer.tokenize

So does anyone have any idea how to get the model to tokenized as well? I tried a few obvious ways, but it didn’t yield anything

lewtun · April 13, 2021, 9:35pm

hey @Neel-Gupta, could you share a minimal example of the Dataset object you’re working with?

just so i understand, you’re saying that the tokenizer is not tokenizing the strings in the src field right?

Neel-Gupta · April 13, 2021, 10:04pm

Thanks a lot for the quick reply! Yeah, the tokenizer is not tokenizing those strings, maybe because I didn’t call the tokenize method?

Anyways, plugging these variables do the trick for reproduction

train_text = np.array(['a foxy', 'b ball', 'c cats r bad'])
train_label = np.array([1,2,3])
val_text = np.array(['a foxy', 'r c cats'])
val_label = np.array([1, 2])

if you can’t reproduce the issue with the pre-trained longformer (base) tokenizer, then I can provide you my model - but I doubt it won’t since I have used that same one

tokenizer = AutoTokenizer.from_pretrained(".....", use_fast=True, truncation=True, padding=True, max_length=2000)

sgugger · April 14, 2021, 12:53am

The result of the tokenization is not in the "src" field, but in 'input_ids' and 'attention_mask'. The 'src' field is just the same as before.

Neel-Gupta · April 14, 2021, 3:33pm

so definitely, my tokenizer does tokenize the input right? this means the problem is in the fine-tuning code if I am correct. would you have a clue as to where the problem can be?
Previously, before using datasets it I was able to fine-tune it with the same code
Im getting IndexError: index out of range in self

Topic		Replies	Views
Unable to properly map tensors to examples 🤗Datasets	6	1296	December 15, 2022
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1089	August 19, 2021
Why do I get this error running tokenizer? Beginners	6	17935	August 20, 2020
Custom Dataset with Custom Tokenizer 🤗Datasets	3	790	June 23, 2021
Map method to tokenize raises index error 🤗Datasets	9	4295	June 9, 2021

How to tokenize using map

Related topics