Tokenizer truncation

I’m trying to run sequence classification with a trained Distilibert but I can’t get truncation to work properly and I keep getting

RuntimeError: The size of tensor a (N) must match the size of tensor b (512) at non-singleton dimension 1.

I can work around it by manually truncating all the documents I pass into the classifier, but that’s really not ideal.

Here is my setup for the pipeline:

model_dir = "./classifier_52522_3"
tokenizer = AutoTokenizer.from_pretrained(
    model_dir, model_max_length=512, max_length=512, padding="max_length", truncation=True
)
config = DistilBertConfig.from_pretrained(model_dir)
model = DistilBertForSequenceClassification(config)

pipe = TextClassificationPipeline(
    model=model, 
    tokenizer=tokenizer,
    return_all_scores=True
)

I have tried adding the truncation params directly to the saved tokenizer_config.json file too but no dice.

Thanks!

Can you provide a bit more information about this, notably the call stack (or its most relevant subset)? I cannot provide help, but would like to compare with an issue I have, which I think is similar.

From scouting this forum there are probably quite a few beginners that get stuck on it as well, trying to adapt example fine-tuning workflows to their own data.

I get a similar error message when calling the train method on a Trainer. From a beginner’s perspective this is frustrating as the value N of the tensor a is not coming from an obvious place.

I can see in debug mode in my case that this tensor dimension of size N arises from inputs_embeds = self.word_embeddings(input_ids) somewhere in transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Embeddings, but this is hardly comforting for the newbie (me).

I am not using the TextClassificationPipeline like you, but I gather this is probably at an equivalent place (training) that the exception occurs?

I’ve just published a blog post, Lithology classification using Hugging Face, part 2 with much more details about this.