How to increase max_new_tokens beyond 1200 in code llama

GireeshS · September 18, 2024, 5:40pm

I am using codellama on inference end point, but i am unable to give max_new_tokens beyond 1100 in the parameters. it throws me this error:

"parameters": {
            "max_new_tokens": 1024, # adjust this value to generate more tokens
            "return_full_text": False,
            }

{‘error’: ‘Input validation error: inputs tokens + max_new_tokens must be <= 1512. Given: 345 inputs tokens and 1324 max_new_tokens’, ‘error_type’: ‘validation’}

Is there any way around this?

John6666 · September 18, 2024, 10:14pm

I thought I would have to tinker with the Endpoint side, but it looks like there is a way to do it.

github.com/huggingface/text-generation-inference

Input validation error: `inputs` tokens + `max_new_tokens` must be <= 2048 on MPT-30b supporting 8k context

opened 12:04AM - 18 Jul 23 UTC

closed 07:17AM - 18 Jul 23 UTC

kalvin1024

### System Info **ValidationError: Input validation error: `inputs` tokens + …`max_new_tokens` must be <= 2048. Given: 1244 `inputs` tokens and 1000 `max_new_tokens`** I am using huggingface text generation inference hosting on a runpod GPU. The model is intended to support 8192 context length but Huggingface text generation inference is imposing a hard limit on the input, how can I resolve this problem? ``` llm = HuggingFaceTextGenInference( inference_server_url=inference_server_url, max_new_tokens=1000, top_k=10, top_p=0.95, typical_p=0.95, temperature=0.1, repetition_penalty=1.03, stream=True ) ``` and I am using a langchain qa_chain to load up to 5 candidate documents to be inserted in my prompt. Theoretically with MPT this should work, but Huggingface is outputting error, how can I override this limit, otherwise, supporting long context model would become meaningless. ``` qa_chain = RetrievalQA.from_chain_type( llm=llm, # answer bot is the highest quality chain_type="stuff", retriever=vector_db.as_retriever(search_type = 'mmr', search_kwargs = { 'k': top_k, 'filter': {'category': category}# this filters on metadata. }), # pull over 5 most (credible) relatable document from the indexed db to answer the user's question, the data are cited from the metadata indices chain_type_kwargs=chain_type_kwargs, return_source_documents=True ) ``` ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [X] An officially supported command - [ ] My own modifications ### Reproduction ``` !pip install langchain===0.0.230 openai chromadb==0.3.26 pydantic==1.10.8 GitPython ipython tiktoken runpod text-generation transformers runpod python-dotenv from dotenv import load_dotenv, find_dotenv _ = load_dotenv(find_dotenv()) import runpod import os os.environ['RUNPOD_API_KEY'] = aaa runpod.api_key = os.getenv("RUNPOD_API_KEY", "your_runpod_api_key") gpu_count = 1 pod = runpod.create_pod( name="MPT-30B-Instruct", image_name="ghcr.io/huggingface/text-generation-inference:0.9.1", gpu_type_id="NVIDIA A100 80GB PCIe", cloud_type="SECURE", docker_args=f"--model-id mosaicml/mpt-30b-instruct --num-shard {gpu_count} --trust-remote-code", gpu_count=gpu_count, volume_in_gb=225, container_disk_in_gb=75, ports="80/http", volume_mount_path="/data", ) from langchain.llms import HuggingFaceTextGenInference inference_server_url = f'https://{pod["id"]}-80.proxy.runpod.net' llm = HuggingFaceTextGenInference( inference_server_url=inference_server_url, max_new_tokens=1000, top_k=10, top_p=0.95, typical_p=0.95, temperature=0.1, repetition_penalty=1.03, stream=True ) final_answer_prompt_template = """ # INSTRUCTIONS - instructions to execute the prompt # CONTEXT {context} # QUERY {question} """ FINAL_ANSWER_PROMPT = PromptTemplate( template=final_answer_prompt_template, input_variables=["context", "question"] ) def ask(query, category, top_k=5, show_sources=False): display(Markdown(f"### Query\n{query}")) chain_type_kwargs = {"prompt": FINAL_ANSWER_PROMPT} qa_chain = RetrievalQA.from_chain_type( llm=llm, # answer bot is the highest quality chain_type="stuff", retriever=vector_db.as_retriever(search_type = 'mmr', search_kwargs = { 'k': top_k, 'filter': {'category': category}# this filters on metadata. }), # pull over 5 most (credible) relatable document from the indexed db to answer the user's question, the data are cited from the metadata indices chain_type_kwargs=chain_type_kwargs, return_source_documents=True ) answer = qa_chain({'query': query}) ``` ### Expected behavior The model output natural language response as if it is being called on a short prompt.

GireeshS · September 25, 2024, 1:03pm

Appreciate your response, but it is infact a container config of the end point. I updated it and am able to generate response up to 10000 tokens

Topic		Replies	Views
Validation Error: Meta-Llama-3-8B-Instruct Models	3	252	November 19, 2024
Meta-Llama-3-8B-Instruct: Validation Error "Max_new_tokens" Models	6	638	October 2, 2024
Token restriction via the Huggin face API Beginners	1	141	October 16, 2024
Meta-Llama-3-8B-Instruct: "max_new_tokens" is not working for /v1/chat/completions Intermediate	1	816	July 2, 2024
Problem for large context window (400k) Models	4	584	July 24, 2024

How to increase max_new_tokens beyond 1200 in code llama

Related topics