Tokenizer truncation

afriedman412 · May 29, 2022, 8:20pm

I’m trying to run sequence classification with a trained Distilibert but I can’t get truncation to work properly and I keep getting

RuntimeError: The size of tensor a (N) must match the size of tensor b (512) at non-singleton dimension 1.

I can work around it by manually truncating all the documents I pass into the classifier, but that’s really not ideal.

Here is my setup for the pipeline:

model_dir = "./classifier_52522_3"
tokenizer = AutoTokenizer.from_pretrained(
    model_dir, model_max_length=512, max_length=512, padding="max_length", truncation=True
)
config = DistilBertConfig.from_pretrained(model_dir)
model = DistilBertForSequenceClassification(config)

pipe = TextClassificationPipeline(
    model=model, 
    tokenizer=tokenizer,
    return_all_scores=True
)

I have tried adding the truncation params directly to the saved tokenizer_config.json file too but no dice.

Thanks!

j-m · June 14, 2022, 5:34am

Can you provide a bit more information about this, notably the call stack (or its most relevant subset)? I cannot provide help, but would like to compare with an issue I have, which I think is similar.

From scouting this forum there are probably quite a few beginners that get stuck on it as well, trying to adapt example fine-tuning workflows to their own data.

I get a similar error message when calling the train method on a Trainer. From a beginner’s perspective this is frustrating as the value N of the tensor a is not coming from an obvious place.

I can see in debug mode in my case that this tensor dimension of size N arises from inputs_embeds = self.word_embeddings(input_ids) somewhere in transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Embeddings, but this is hardly comforting for the newbie (me).

I am not using the TextClassificationPipeline like you, but I gather this is probably at an equivalent place (training) that the exception occurs?

I’ve just published a blog post, Lithology classification using Hugging Face, part 2 with much more details about this.

Topic		Replies	Views
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	326	September 29, 2021
Truncating sequence -- within a pipeline Beginners	7	5799	May 3, 2024
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023
Limit max # of tokens for inference in pipeline? Beginners	0	1080	April 7, 2023
Token classification on custom BERT and data Intermediate	2	1499	December 28, 2020

Tokenizer truncation

Related topics