Hello!
Does batch mapping ( i.e. dataset.map(batched=True)
) preserve individual data samples? How do I access each individual sample after batch mapping?
I have a 50K dataset and after batching, that reduces the dataset to 50 examples (which is expected because batch_size=1000
). When I look at the first sample, it reached the maximum token length.
But if I set batched=False
, the first sample does not reach the maximum token length.
Does this mean that batch mapping concatenates the examples? If I want my model to read every example (without losing info because of concatenation), should I set batched=False
?
Ok, my understanding was wrong. If I preprocess the dataset before passing .map(batched=True)
, it will preserve all samples. In other words:
def tokenizer_func(examples):
return tokenizer(examples['final'],
truncation=True, padding=True, max_length=128, return_tensors="pt")
tokenized_train = train_dset.map(tokenizer_func,
remove_columns=train_dset.column_names, batched=True)
>>> Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 50000
})
However, if I preprocess the dataset during the .map()
call, I end up with 50 examples.
def tokenizer_func(examples):
return tokenizer(generate_prompt(examples['raw']),
truncation=True, padding=True, max_length=128, return_tensors="pt")
# examples follow format of resp json files
tokenized_train = train_dset.map(tokenizer_func,
remove_columns=train_dset.column_names, batched=True)
>>> Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 50
})
Can someone explain the difference in behavior? generate_prompt
is simply a function that returns a string.
When batched=True
, both examples['final']
and examples['raw']
will be a list of batch_size
elements. You may have to use
def tokenizer_func(examples):
return tokenizer([generate_prompt(raw_text) for raw_text in examples['raw']]),
truncation=True, padding=True, max_length=128, return_tensors="pt")