How can I make sure Tokenizer pads to a fixed length?

Hi, I’m trying to use Distilbert as a layer in keras, however the tokenizer doesn’t pad to a fixed length but rather just some minimum depending on the batch. I guess that is expected reading up on it.
However that doesn’t work since the input layer (because I’m combining) needs a fixed length.
Can I somehow make sure the Tokenizer always pads to max_length ?

thanks for any help and insights :slight_smile:

Hello and welcome to our forum :hugs:

When you’re passing your sequences, can you set padding to “max_length” and pass a value to the max_length argument like so:

tokenizer(sequence, return_tensors="tf", padding="max_length", max_length=15)

Let me know if it works :slightly_smiling_face:

2 Likes

Hi merve,
thanks a lot, this is something I tried, and many others, it didn’t work.
I tried with padding = True, max_length = 512 (or just fixed length) as well.

Weirdly a deprecated command does pad_to_max_length = True and the outputs are full length. Very strange. But I’m happy this works now. :slight_smile:

I’m thinking the issue is maybe that none of the text in the dataset are very long, if I make it some small length it works with just the padding/max_length args.

Edit: I’m also thinking, maybe this is wanted behaviour and I don’t quite understand it, lol. :hugs:

1 Like