Why do Pipelines allow more than 512 tokens?

Oweys · November 7, 2022, 1:49pm

Hey,

I noticed that some pipelines allow to exceed the 512 tokenlimitation.

For instance, trying to use a string with more than 35 000 tokens:

url = 'https://de.wikipedia.org/wiki/Gesch%C3%A4ftsbericht'
r = requests.get(url)
doc = r.text # more than 35 000 tokens!
question = "Wie teuer ist ein Geschäftsbericht?"

This works perfectly fine with the pipeline, without any truncation (since the answer is on the end of the string):

from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="deepset/gelectra-base-germanquad",
    tokenizer="deepset/gelectra-base-germanquad"
)


qa_pipeline({
    "context": doc,
    'question': question
}) 

>>> [out]: 'Über 100 0000', score: 0.43

This does not work, if the model is loaded:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("deepset/gelectra-base-germanquad")

model = AutoModelForQuestionAnswering.from_pretrained("deepset/gelectra-base-germanquad")
encoding = tokenizer(question, doc, return_tensors="pt", max_length=10000)
outputs = model(**encoding)

>>> [out]: ...
'RuntimeError: The size of tensor a (35664) must match the size of tensor b (512) at non-singleton dimension 1'

I know that bert is limited to 512 tokens and need to be truncated. But why does that not apply to that pipeline?

Yanis · April 4, 2023, 6:26am

Hey @Oweys ! Do you have an explanation for this? I was wondering the same as I finetuned a Token Classification model with limitation of 1024 tokens BUT the pipeline object is processing and detecting tokens from the entire document!!

Topic		Replies	Views
How to stop at 512 tokens when sending text to pipeline? 🤗Transformers	2	1508	February 7, 2024
Limit max # of tokens for inference in pipeline? Beginners	0	1084	April 7, 2023
Question about maximum number of tokens Research	1	6273	February 9, 2021
Increasing Token Limits for long strings for knkarthickMEETING_SUMMARY Beginners	0	559	November 9, 2022
Why pipeline can handle longer sentence than max_position_embeddings? Beginners	0	228	September 4, 2022

Why do Pipelines allow more than 512 tokens?

Related topics