Not sure why padding isn't working for me

There doesn’t seem to be any padding occurring here:

     train_dataset = dataset.shard(10, 1)

    train_dataset.set_format(columns=['text'])

    train_dataset.cleanup_cache_files()

    encoded_dataset = train_dataset.map(lambda examples: tokenizer(examples['text'], padding=True))

    encoded_dataset[:1]

    {'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
     'input_ids': [[101, 1714, 22233, 21365, 4515, 8618, 102]],
     'text': ['free instagram followers '],
     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]]}

What am I missing?

You are using the strategy to pad to the length of the longest sample while also passing your samples one by one to the tokenizer, so no padding happens.
If you want to pass several samples at once, using batched=True in your call to map. If you want to pass to a specific max_length, pass max_length=xxx and padding="max_length" to your call to the tokenizer.

3 Likes

Thank you so much @sgugger! That makes sense now.