Hi, I’m trying to use Distilbert as a layer in keras, however the tokenizer doesn’t pad to a fixed length but rather just some minimum depending on the batch. I guess that is expected reading up on it.
However that doesn’t work since the input layer (because I’m combining) needs a fixed length.
Can I somehow make sure the Tokenizer always pads to max_length ?
thanks for any help and insights
Hello and welcome to our forum
When you’re passing your sequences, can you set padding to “max_length” and pass a value to the max_length argument like so:
tokenizer(sequence, return_tensors="tf", padding="max_length", max_length=15)
Let me know if it works
thanks a lot, this is something I tried, and many others, it didn’t work.
I tried with padding = True, max_length = 512 (or just fixed length) as well.
Weirdly a deprecated command does pad_to_max_length = True and the outputs are full length. Very strange. But I’m happy this works now.
I’m thinking the issue is maybe that none of the text in the dataset are very long, if I make it some small length it works with just the padding/max_length args.
Edit: I’m also thinking, maybe this is wanted behaviour and I don’t quite understand it, lol.