Number of tokens (2331) exceeded maximum context length (512) error.Even when model supports 8k Context length

Ive loaded Mistral7b on aws sagemaker with context length 8k.

But im getting error ;

Number of tokens (2332) exceeded maximum context length (512).

on line :

print(llm("""{Something with 3000 tokens}"""))

with context length of 8k tokens of the model , how can number of tokens (2332) exceeded maximum context length (512) error arise.

The code im using is ;
imports :

# Base ctransformers with no GPU acceleration
!pip install ctransformers
# Or with CUDA GPU acceleration
!pip install ctransformers[cuda]
# Or with AMD ROCm GPU acceleration (Linux only)
!CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers
# Or with Metal GPU acceleration for macOS systems only
!CT_METAL=1 pip install ctransformers --no-binary ctransformers

Code to load and run model ;

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF", model_file="mistral-7b-v0.1.Q5_K_M.gguf", model_type="mistral",gpu_layers=0)
print(llm("""{Something with 3000 tokens}"""))

Im using AWS Sagemaker Studio lab notebook that provides CPU :

instance - t3.xlarge
vCPUs - 4
memory - 16GB

2 Likes

You can expand the context length with a config parameter

You could try this

from ctransformers import AutoModelForCausalLM,AutoConfig

config = AutoConfig.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF")
# Explicitly set the max_seq_len
config.max_seq_len = 4096
config.max_answer_len= 1024

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF", model_file="mistral-7b-v0.1.Q5_K_M.gguf", model_type="mistral",gpu_layers=0, config=config)
print(llm("""{Something with 3000 tokens}"""))
2 Likes

I have the same issue even using above scripts. Does anybody find a solution? Here is my testing code and results.
from ctransformers import AutoModelForCausalLM,AutoConfig

model_name= “TheBloke/Mistral-7B-Instruct-v0.1-GGUF”
model_path = r’D:\Mistral\mistral-7b-instruct-v0.1.Q6_K.gguf’
config = AutoConfig.from_pretrained(model_name)
config.max_seq_len = 4096
config.max_answer_len= 1024
llm = AutoModelForCausalLM.from_pretrained(model_name, model_file=model_path, model_type=“mistral”,gpu_layers=0, config=config)

prompt = “”"
Please summarize below article in one sentence.
####
… 600 words
“”"
print(llm(prompt))

(myenv) C:\myenv\test >python C:\myenv\test\mytest.py
…
Number of tokens (893) exceeded maximum context length (512).
Number of tokens (894) exceeded maximum context length (512).
Number of tokens (895) exceeded maximum context length (512).
{‘input’: 'Summary: ‘, ‘text’: ’ 10006 7, 8, 4 3\n\t2121212983’}
duration: 119.195716

Set config like below
config.config.max_new_tokens = 2048
config.config.context_length = 4096

It solved the issue for me.

It does not work for me either. I got “Number of tokens (692) exceeded maximum context length (512)”.

This worked for me:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-beta-GGUF", 
                                           model_file="zephyr-7b-beta.Q5_K_M.gguf", 
                                           model_type="mistral", 
                                           gpu_layers=50,
                                           max_new_tokens = 1000,
                                           context_length = 6000)

No warnings output.

llm = AutoModelForCausalLM.from_pretrained(“TheBloke/Llama-2-7B-GGUF”, model_file=“llama-2-7b.Q6_K.gguf”, model_type=“llama”,context_length=4096, max_new_tokens=4096, gpu_layers=0)

this worked for me

Also, I think I read somewhere that ctransformers only support Llama and 2 other base model for context changes. Please correct me if I am wrong.