I am trying to use one of the Llama2 models to generate text (summarise docs). This means that I am dealing with long texts that need to be truncated. Upon using the Pipeline API I have discovered some weird behaviour.
tokenizer = AutoTokenizer.from_pretrained( local_model_dir, padding_side="right", model_max_length=MAX_LENGTH, truncation=True ) # error in source, context should be 4k upldated_config = LlamaConfig(max_position_embeddings=MAX_POS_EMBEDDINGS) # 8 bit precision to fit GPU -> maybe try 4 bits? model = AutoModelForCausalLM.from_pretrained( local_model_dir, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto", load_in_8bit=True, config=upldated_config ) pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, return_full_text=False, )
it seems to me that when I call
pipeline(long_input) the input text is not correctly truncated. This is checked by calling
pipeline.pre_process(long_input) which results in the following message:
Token indices sequence length is longer than the specified maximum sequence length for this model (12351 > 3896).
I have come to understand that truncation is only done when calling
tokenizer(long_input, truncation=True, max_length=MAX_LENGTH) However when i try to pass truncation to the Pipeline like this
pipeline(long_input, truncation=True). It also throws as error:
The following model_kwargs are not used by the model: [‘truncation’] (note: typos in the generate arguments will also show up in this list)
So this all begs the question: How do i make truncation happen while using a pipeline? To me it seems inuitive that when Truncation=True while initiating the tokenizer, and adding it to the pipeline. All downstream texts should be tokenized.