Clarification on Batch mapping

Hello!

Does batch mapping ( i.e. dataset.map(batched=True)) preserve individual data samples? How do I access each individual sample after batch mapping?

I have a 50K dataset and after batching, that reduces the dataset to 50 examples (which is expected because batch_size=1000). When I look at the first sample, it reached the maximum token length.

But if I set batched=False, the first sample does not reach the maximum token length.

Does this mean that batch mapping concatenates the examples? If I want my model to read every example (without losing info because of concatenation), should I set batched=False?

Ok, my understanding was wrong. If I preprocess the dataset before passing .map(batched=True), it will preserve all samples. In other words:

def tokenizer_func(examples):
        return tokenizer(examples['final'], 
        truncation=True, padding=True, max_length=128, return_tensors="pt")


tokenized_train = train_dset.map(tokenizer_func,
                                 remove_columns=train_dset.column_names, batched=True)
>>> Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 50000
    })

However, if I preprocess the dataset during the .map() call, I end up with 50 examples.

def tokenizer_func(examples):
        return tokenizer(generate_prompt(examples['raw']), 
        truncation=True, padding=True, max_length=128, return_tensors="pt")

# examples follow format of resp json files
tokenized_train = train_dset.map(tokenizer_func,
                                 remove_columns=train_dset.column_names, batched=True)
>>> Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 50
    })

Can someone explain the difference in behavior? generate_prompt is simply a function that returns a string.

When batched=True, both examples['final'] and examples['raw'] will be a list of batch_size elements. You may have to use

def tokenizer_func(examples):
        return tokenizer([generate_prompt(raw_text) for raw_text in examples['raw']]),
                                  truncation=True, padding=True, max_length=128, return_tensors="pt")