Tokenizer truncation

afriedman412 · May 29, 2022, 8:20pm

I’m trying to run sequence classification with a trained Distilibert but I can’t get truncation to work properly and I keep getting

RuntimeError: The size of tensor a (N) must match the size of tensor b (512) at non-singleton dimension 1.

I can work around it by manually truncating all the documents I pass into the classifier, but that’s really not ideal.

Here is my setup for the pipeline:

model_dir = "./classifier_52522_3"
tokenizer = AutoTokenizer.from_pretrained(
    model_dir, model_max_length=512, max_length=512, padding="max_length", truncation=True
)
config = DistilBertConfig.from_pretrained(model_dir)
model = DistilBertForSequenceClassification(config)

pipe = TextClassificationPipeline(
    model=model, 
    tokenizer=tokenizer,
    return_all_scores=True
)

I have tried adding the truncation params directly to the saved tokenizer_config.json file too but no dice.

Thanks!

Topic		Replies	Views
MiniLM RuntimeError: The size of tensor a (599) must match the size of tensor b (512) at non-singleton dimension 1 Beginners	0	445	July 13, 2023
How to specify sequence length when using "feature-extraction" 🤗Transformers	3	1305	April 28, 2021
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	331	September 29, 2021
Error for Training job huggingface-sdk-extension-2022-01-24-16-31-30-883: Failed. Reason: AlgorithmError: ExecuteUserScriptError: Models	1	1396	January 25, 2022
Truncate the seq. not working 🤗Transformers	0	839	August 17, 2022

Tokenizer truncation

Related topics