StarCoder generates prompts very slowly even with a simple request

Hi everyone,

I’m working with the StarCoder model on my local machine, but I’m experiencing significant delays even for generating simple prompts. Here are the details:

System Information:

OS: Windows 11
Processor: [Insert your processor information, e.g., Intel i7/i9 or AMD equivalent]
RAM: 16 GB
GPU: NVIDIA [Insert model, e.g., RTX 3060, RTX 3070, etc.] with CUDA enabled
Python Version: 3.11
Transformers Version: [Ensure to mention your installed version of transformers`, e.g., 4.46.3]
Torch Version: [Add your PyTorch version, e.g., torch 2.1.0]

Issue

I’m using the following simple prompt for testing the StarCoder model:

plaintext

Copy code

“Write a Python function to calculate the factorial of a number.”

Despite the simplicity of this prompt, the model hangs for an extended time after showing the Setting pad_token_id to eos_token_id log. I waited over 40 minutes, and no response was generated. My cache seems set up correctly, and the model loading is relatively fast (approx. 7 seconds).

Here’s the full code I’m running:

import os
import logging
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

# Logging setup
logging.basicConfig(
    filename="generation_log.txt",
    filemode="w",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)
log = logging.getLogger()

# Timer utility
import time
def timer_start():
    return time.time()

def timer_stop(start_time, name):
    elapsed = time.time() - start_time
    log.info(f"[{name}] completed in {elapsed:.2f} seconds.")

# Initialize variables
device = "cuda" if torch.cuda.is_available() else "cpu"
cache_dir = "./cache"
model_name = "bigcode/starcoder"
log.info(f"Using device: {device}")
log.info(f"Cache directory: {os.path.abspath(cache_dir)}")

# Load model and tokenizer
start = timer_start()
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
timer_stop(start, "Tokenizer load")

start = timer_start()
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    device_map="auto",
    offload_folder="./offload"
)
timer_stop(start, "Model load")

# Configure generation
generation_config = GenerationConfig(
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

prompt = "Write a Python function to calculate the factorial of a number."
log.info(f"Prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Text generation
start = timer_start()
try:
    output = model.generate(**inputs, generation_config=generation_config)
    timer_stop(start, "Text generation")
    result = tokenizer.decode(output[0], skip_special_tokens=True)
    log.info(f"Generated Result:\n{result}")
except Exception as e:
    log.error(f"Error during generation: {e}")

1 Like

The code doesn’t seem to be the problem. I think it’s also highly likely that the CUDA version of torch is installed.

The most likely cause of the slowness is insufficient GPU VRAM. This model is so large that even with 40GB of VRAM it’s still not enough.