StarCoder generates prompts very slowly even with a simple request

teddixclzx · November 26, 2024, 10:26pm

Hi everyone,

I’m working with the StarCoder model on my local machine, but I’m experiencing significant delays even for generating simple prompts. Here are the details:

System Information:

OS: Windows 11
Processor: [Insert your processor information, e.g., Intel i7/i9 or AMD equivalent]
RAM: 16 GB
GPU: NVIDIA [Insert model, e.g., RTX 3060, RTX 3070, etc.] with CUDA enabled
Python Version: 3.11
Transformers Version: [Ensure to mention your installed version of transformers`, e.g., 4.46.3]
Torch Version: [Add your PyTorch version, e.g., torch 2.1.0]

Issue

I’m using the following simple prompt for testing the StarCoder model:

plaintext

Copy code

“Write a Python function to calculate the factorial of a number.”

Despite the simplicity of this prompt, the model hangs for an extended time after showing the Setting pad_token_id to eos_token_id log. I waited over 40 minutes, and no response was generated. My cache seems set up correctly, and the model loading is relatively fast (approx. 7 seconds).

Here’s the full code I’m running:

import os
import logging
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

# Logging setup
logging.basicConfig(
    filename="generation_log.txt",
    filemode="w",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)
log = logging.getLogger()

# Timer utility
import time
def timer_start():
    return time.time()

def timer_stop(start_time, name):
    elapsed = time.time() - start_time
    log.info(f"[{name}] completed in {elapsed:.2f} seconds.")

# Initialize variables
device = "cuda" if torch.cuda.is_available() else "cpu"
cache_dir = "./cache"
model_name = "bigcode/starcoder"
log.info(f"Using device: {device}")
log.info(f"Cache directory: {os.path.abspath(cache_dir)}")

# Load model and tokenizer
start = timer_start()
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
timer_stop(start, "Tokenizer load")

start = timer_start()
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    device_map="auto",
    offload_folder="./offload"
)
timer_stop(start, "Model load")

# Configure generation
generation_config = GenerationConfig(
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

prompt = "Write a Python function to calculate the factorial of a number."
log.info(f"Prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Text generation
start = timer_start()
try:
    output = model.generate(**inputs, generation_config=generation_config)
    timer_stop(start, "Text generation")
    result = tokenizer.decode(output[0], skip_special_tokens=True)
    log.info(f"Generated Result:\n{result}")
except Exception as e:
    log.error(f"Error during generation: {e}")

John6666 · November 27, 2024, 12:50am

The code doesn’t seem to be the problem. I think it’s also highly likely that the CUDA version of torch is installed.

The most likely cause of the slowness is insufficient GPU VRAM. This model is so large that even with 40GB of VRAM it’s still not enough.

Topic		Replies	Views
Simple example run takes 5+ minutes on rtx3060 - falcon7B Beginners	1	495	February 18, 2024
Attempt to generate Text, but its to slow Beginners	0	156	July 25, 2024
Installation on Windows Beginners	1	1254	September 11, 2023
Starcoder: CUDA out of memory Beginners	4	441	February 23, 2024
Is there any way to avoid CPU bottlenecks when doing single prompt inference? Intermediate	1	972	June 12, 2023

StarCoder generates prompts very slowly even with a simple request

Related topics