Not sure why padding isn't working for me

thecity2 · January 22, 2021, 4:17pm

There doesn’t seem to be any padding occurring here:

     train_dataset = dataset.shard(10, 1)

    train_dataset.set_format(columns=['text'])

    train_dataset.cleanup_cache_files()

    encoded_dataset = train_dataset.map(lambda examples: tokenizer(examples['text'], padding=True))

    encoded_dataset[:1]

    {'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
     'input_ids': [[101, 1714, 22233, 21365, 4515, 8618, 102]],
     'text': ['free instagram followers '],
     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]]}

What am I missing?

sgugger · January 22, 2021, 6:00pm

You are using the strategy to pad to the length of the longest sample while also passing your samples one by one to the tokenizer, so no padding happens.
If you want to pass several samples at once, using batched=True in your call to map. If you want to pass to a specific max_length, pass max_length=xxx and padding="max_length" to your call to the tokenizer.

thecity2 · January 22, 2021, 6:29pm

Thank you so much @sgugger! That makes sense now.

Topic		Replies	Views
Padding in datasets 🤗Datasets	6	5052	October 21, 2021
Padding should be True, please explain Beginners	1	12	August 18, 2024
How to pad tokens to a fixed length on a single sentence? Beginners	1	3188	May 19, 2021
DataCollator not padding as expected Intermediate	0	663	August 17, 2022
Padding strategy for classification Beginners	3	2488	July 20, 2020

Not sure why padding isn't working for me

Related topics