How to stop at 512 tokens when sending text to pipeline?

roxl12 · February 6, 2024, 2:26pm

Hi!

I want to test my model using Pipeline by Transformers. My model is a pretrained BERT, which works great if the given text is < 512 tokens. However, when sending the a larger text to the pipeline, it breaks, because it’s too long. I tried to search, but couldn’t figure out how to solve this issue.

This is my code:

def get_predicted_folder(text, model):
    pipe = pipeline("text-classification", model=model)
    if text:
        predicted_folder = pipe(text)
        label = predicted_folder[0]['label']
        score = predicted_folder[0]['score']
        return label, score
    else:
        err = "Error: The provided text is empty."
        return err, None

my_saved_model = "model/danish_bert_model" (it is saved locally)
label, score = get_predicted_folder(text, my_saved_model)

The tokenizer inside the model looks like this:

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    dataset = dataset.map(lambda examples: tokenize_dataset(tokenizer, examples), batched=True)

It gives me this error:
RuntimeError: The size of tensor a (1593) must match the size of tensor b (512) at non-singleton dimension 1

I tried to give tokenizer=model to the pipeline, and have this tokenizer = AutoTokenizer.from_pretrained(model_ckpt) before calling the get_predicted_folder method, but it doesn’t solve the issue.

Can someone please help me?

Thanks so much in advance!

roxl12 · February 6, 2024, 6:11pm

I solve this. If anyone is curious how, I just added the tokenizer, maximum length, and truncation to the pipe:

pipe = pipeline("text-classification", model=model, tokenizer=model_path, max_length=512, truncation=True)

system · February 7, 2024, 6:12am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why do Pipelines allow more than 512 tokens? Beginners	1	631	April 4, 2023
Limit max # of tokens for inference in pipeline? Beginners	0	1080	April 7, 2023
Truncating sequence -- within a pipeline Beginners	7	5814	May 3, 2024
Tokenizer behaviour with pipeline 🤗Tokenizers	0	923	August 1, 2023
Question about maximum number of tokens Research	1	6206	February 9, 2021

How to stop at 512 tokens when sending text to pipeline?

Related topics