How can I make sure Tokenizer pads to a fixed length?

ollibolli · March 26, 2022, 7:34pm

Hi, I’m trying to use Distilbert as a layer in keras, however the tokenizer doesn’t pad to a fixed length but rather just some minimum depending on the batch. I guess that is expected reading up on it.
However that doesn’t work since the input layer (because I’m combining) needs a fixed length.
Can I somehow make sure the Tokenizer always pads to max_length ?

thanks for any help and insights

merve · March 28, 2022, 11:28am

Hello and welcome to our forum

When you’re passing your sequences, can you set padding to “max_length” and pass a value to the max_length argument like so:

tokenizer(sequence, return_tensors="tf", padding="max_length", max_length=15)

Let me know if it works

ollibolli · March 29, 2022, 3:02pm

Hi merve,
thanks a lot, this is something I tried, and many others, it didn’t work.
I tried with padding = True, max_length = 512 (or just fixed length) as well.

Weirdly a deprecated command does pad_to_max_length = True and the outputs are full length. Very strange. But I’m happy this works now.

I’m thinking the issue is maybe that none of the text in the dataset are very long, if I make it some small length it works with just the padding/max_length args.

Edit: I’m also thinking, maybe this is wanted behaviour and I don’t quite understand it, lol.

Topic		Replies	Views
Sequences shorter than model's input window size 🤗Transformers	2	1172	January 4, 2022
Padding should be True, please explain Beginners	1	12	August 18, 2024
Padding causes wrong predictions? Beginners	2	1547	August 11, 2021
How padding in huggingface tokenizer works? 🤗Tokenizers	4	6753	November 22, 2021
How to pad tokens to a fixed length on a single sentence? Beginners	1	3187	May 19, 2021

How can I make sure Tokenizer pads to a fixed length?

Related topics