There doesn’t seem to be any padding occurring here:
train_dataset = dataset.shard(10, 1)
train_dataset.set_format(columns=['text'])
train_dataset.cleanup_cache_files()
encoded_dataset = train_dataset.map(lambda examples: tokenizer(examples['text'], padding=True))
encoded_dataset[:1]
{'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
'input_ids': [[101, 1714, 22233, 21365, 4515, 8618, 102]],
'text': ['free instagram followers '],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]]}
What am I missing?