How to stop at 512 tokens when sending text to pipeline?

Hi!

I want to test my model using Pipeline by Transformers. My model is a pretrained BERT, which works great if the given text is < 512 tokens. However, when sending the a larger text to the pipeline, it breaks, because it’s too long. I tried to search, but couldn’t figure out how to solve this issue.

This is my code:

def get_predicted_folder(text, model):
    pipe = pipeline("text-classification", model=model)
    if text:
        predicted_folder = pipe(text)
        label = predicted_folder[0]['label']
        score = predicted_folder[0]['score']
        return label, score
    else:
        err = "Error: The provided text is empty."
        return err, None

my_saved_model = "model/danish_bert_model" (it is saved locally)
label, score = get_predicted_folder(text, my_saved_model)

The tokenizer inside the model looks like this:

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    dataset = dataset.map(lambda examples: tokenize_dataset(tokenizer, examples), batched=True)

It gives me this error:
RuntimeError: The size of tensor a (1593) must match the size of tensor b (512) at non-singleton dimension 1

I tried to give tokenizer=model to the pipeline, and have this tokenizer = AutoTokenizer.from_pretrained(model_ckpt) before calling the get_predicted_folder method, but it doesn’t solve the issue.

Can someone please help me?

Thanks so much in advance!

I solve this. If anyone is curious how, I just added the tokenizer, maximum length, and truncation to the pipe:

pipe = pipeline("text-classification", model=model, tokenizer=model_path, max_length=512, truncation=True)
2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.