Hello!
Does batch mapping ( i.e. dataset.map(batched=True)) preserve individual data samples? How do I access each individual sample after batch mapping?
I have a 50K dataset and after batching, that reduces the dataset to 50 examples (which is expected because batch_size=1000). When I look at the first sample, it reached the maximum token length.
But if I set batched=False, the first sample does not reach the maximum token length.
Does this mean that batch mapping concatenates the examples? If I want my model to read every example (without losing info because of concatenation), should I set batched=False?
Ok, my understanding was wrong. If I preprocess the dataset before passing .map(batched=True), it will preserve all samples. In other words:
def tokenizer_func(examples):
return tokenizer(examples['final'],
truncation=True, padding=True, max_length=128, return_tensors="pt")
tokenized_train = train_dset.map(tokenizer_func,
remove_columns=train_dset.column_names, batched=True)
>>> Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 50000
})
However, if I preprocess the dataset during the .map() call, I end up with 50 examples.
def tokenizer_func(examples):
return tokenizer(generate_prompt(examples['raw']),
truncation=True, padding=True, max_length=128, return_tensors="pt")
# examples follow format of resp json files
tokenized_train = train_dset.map(tokenizer_func,
remove_columns=train_dset.column_names, batched=True)
>>> Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 50
})
Can someone explain the difference in behavior? generate_prompt is simply a function that returns a string.
When batched=True, both examples['final'] and examples['raw'] will be a list of batch_size elements. You may have to use
def tokenizer_func(examples):
return tokenizer([generate_prompt(raw_text) for raw_text in examples['raw']]),
truncation=True, padding=True, max_length=128, return_tensors="pt")