Tokenizer behaviour with pipeline

Hey all,

I am trying to use one of the Llama2 models to generate text (summarise docs). This means that I am dealing with long texts that need to be truncated. Upon using the Pipeline API I have discovered some weird behaviour.

tokenizer = AutoTokenizer.from_pretrained(
  local_model_dir, 
  padding_side="right", 
  model_max_length=MAX_LENGTH, 
  truncation=True
)

# error in source, context should be 4k
upldated_config = LlamaConfig(max_position_embeddings=MAX_POS_EMBEDDINGS)

# 8 bit precision to fit GPU -> maybe try 4 bits?
model = AutoModelForCausalLM.from_pretrained(
    local_model_dir,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto",
    load_in_8bit=True,
    config=upldated_config
)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
)

it seems to me that when I call pipeline(long_input) the input text is not correctly truncated. This is checked by calling pipeline.pre_process(long_input) which results in the following message:

Token indices sequence length is longer than the specified maximum sequence length for this model (12351 > 3896).

I have come to understand that truncation is only done when calling tokenizer(long_input, truncation=True, max_length=MAX_LENGTH) However when i try to pass truncation to the Pipeline like this pipeline(long_input, truncation=True). It also throws as error:

The following model_kwargs are not used by the model: [‘truncation’] (note: typos in the generate arguments will also show up in this list)

So this all begs the question: How do i make truncation happen while using a pipeline? To me it seems inuitive that when Truncation=True while initiating the tokenizer, and adding it to the pipeline. All downstream texts should be tokenized.