How to tokenize using map

This is a problem that I have gotten: full info on this thread I had a quick question -

I have constructed Dataset object using numpy arrays, however when I use this:-

def tok(example):
  encodings = tokenizer(example['src'], truncation=True, padding="max_length", max_length=2000)
  return encodings

train_encoded_dataset = train_dataset.map(tok, batched=True)
val_encoded_dataset = val_dataset.map(tok, batched=True)

and I explore my train_encoded_dataset, I see that when trying to view the source sequence:

>>> train_encoded_dataset
>>>Dataset({
    features: ['attention_mask', 'input_ids', 'src', 'tgt'],
    num_rows: 4572
})
>>> train_encoded_dataset['src'][0]

The output of this last command produces text that is completely raw (basically untokenized) string- (like: ‘lorem ipsum…’) which is expected since I didn’t call tokenizer.tokenize

So does anyone have any idea how to get the model to tokenized as well? I tried a few obvious ways, but it didn’t yield anything

hey @Neel-Gupta, could you share a minimal example of the Dataset object you’re working with?

just so i understand, you’re saying that the tokenizer is not tokenizing the strings in the src field right?

Thanks a lot for the quick reply! Yeah, the tokenizer is not tokenizing those strings, maybe because I didn’t call the tokenize method?

Anyways, plugging these variables do the trick for reproduction

train_text = np.array(['a foxy', 'b ball', 'c cats r bad'])
train_label = np.array([1,2,3])
val_text = np.array(['a foxy', 'r c cats'])
val_label = np.array([1, 2])

if you can’t reproduce the issue with the pre-trained longformer (base) tokenizer, then I can provide you my model - but I doubt it won’t since I have used that same one :hugs:

tokenizer = AutoTokenizer.from_pretrained(".....", use_fast=True, truncation=True, padding=True, max_length=2000)

The result of the tokenization is not in the "src" field, but in 'input_ids' and 'attention_mask'. The 'src' field is just the same as before.

2 Likes

so definitely, my tokenizer does tokenize the input right? this means the problem is in the fine-tuning code if I am correct. would you have a clue as to where the problem can be?
Previously, before using datasets it I was able to fine-tune it with the same code :disappointed:
Im getting IndexError: index out of range in self