Hi everyone,
I’m working with the StarCoder model on my local machine, but I’m experiencing significant delays even for generating simple prompts. Here are the details:
System Information:
OS: Windows 11
Processor: [Insert your processor information, e.g., Intel i7/i9 or AMD equivalent]
RAM: 16 GB
GPU: NVIDIA [Insert model, e.g., RTX 3060, RTX 3070, etc.] with CUDA enabled
Python Version: 3.11
Transformers Version: [Ensure to mention your installed version of transformers`, e.g., 4.46.3]
Torch Version: [Add your PyTorch version, e.g., torch 2.1.0]
Issue
I’m using the following simple prompt for testing the StarCoder model:
plaintext
Copy code
“Write a Python function to calculate the factorial of a number.”
Despite the simplicity of this prompt, the model hangs for an extended time after showing the Setting pad_token_id to eos_token_id
log. I waited over 40 minutes, and no response was generated. My cache seems set up correctly, and the model loading is relatively fast (approx. 7 seconds).
Here’s the full code I’m running:
import os
import logging
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch
# Logging setup
logging.basicConfig(
filename="generation_log.txt",
filemode="w",
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
log = logging.getLogger()
# Timer utility
import time
def timer_start():
return time.time()
def timer_stop(start_time, name):
elapsed = time.time() - start_time
log.info(f"[{name}] completed in {elapsed:.2f} seconds.")
# Initialize variables
device = "cuda" if torch.cuda.is_available() else "cpu"
cache_dir = "./cache"
model_name = "bigcode/starcoder"
log.info(f"Using device: {device}")
log.info(f"Cache directory: {os.path.abspath(cache_dir)}")
# Load model and tokenizer
start = timer_start()
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
timer_stop(start, "Tokenizer load")
start = timer_start()
model = AutoModelForCausalLM.from_pretrained(
model_name,
cache_dir=cache_dir,
device_map="auto",
offload_folder="./offload"
)
timer_stop(start, "Model load")
# Configure generation
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True
)
prompt = "Write a Python function to calculate the factorial of a number."
log.info(f"Prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Text generation
start = timer_start()
try:
output = model.generate(**inputs, generation_config=generation_config)
timer_stop(start, "Text generation")
result = tokenizer.decode(output[0], skip_special_tokens=True)
log.info(f"Generated Result:\n{result}")
except Exception as e:
log.error(f"Error during generation: {e}")