Llama3 so much slow compared to ollama

hi
i tried llama 3 instruct version both using hugging face and ollama, huggingface version is about 10 times slower, both running on gpu. do you know what is the problem ?

First assumption would be the quantisation that’s being used. Ollama typically defaults to the smallest q4 quantised version of the model, and if you go and download the fp16 version of the model manually, for example, this can skew the results to make it seem like one is faster than the other. You can go to the ollama web page and find the fp16 version, or whatever version you are manually downloading, for a more accurate comparison.

If this doesn’t seem to make sense then please provide the instructions you’re following for both so that we can see what process you’re following and hopefully provide you with more assistance.

well i applied quantization too , here is the code :

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("./llama3",                   
                                          padding_side="left")
model = AutoModelForCausalLM.from_pretrained("./llama3",
                                              quantization_config=bnb_config,
                                              device_map="auto")
model=model.eval()
tokenizer.pad_token=tokenizer.eos_token
model.generation_config.pad_token_id=tokenizer.eos_token_id
streamer = TextIteratorStreamer(tokenizer,skip_prompt=True,timeout=30)
generation_config=dict(           
            eos_token_id=tokenizer.eos_token_id,
            max_new_tokens =512,
            num_return_sequences=1,  
            do_sample=True,
            temperature=0.9,
            top_p=0.7,
            top_k=40,
            num_beams=1,
                                )

and code for generating answer:

def generate_wrapper(chat):

    t0_1=time.time()
    inputs=tokenizer.apply_chat_template(chat, return_tensors="pt" ,
                                         add_generation_prompt=True,
                                         padding=True,
                                         return_dict=True).to("cuda")
    t0_2=time.time()
    print("input tokenizer time:",t0_2 - t0_1)

    t1=time.time()

    model.generate(
        inputs["input_ids"], 
        attention_mask=inputs["attention_mask"],
        streamer=streamer,
        **generation_config,
    )
    t2=time.time()
    diff=(t2-t1)

    print ("time took:",diff)   

so i call it like this :

chat=[]
chat.append(system_prompt)
while True:
    user_input=input("")
    if(user_input=="break"):
        break     
    chat.append({"role":"user","content":user_input})
    
    thread = Thread(target=generate_wrapper,args=(chat,))
    thread.start()
    decoded_answer=""
    for new_word in streamer:
        new_word=new_word.replace(tokenizer.eos_token,"")
        print(new_word,end="")
        decoded_answer +=new_word
    thread.join()     
    chat.append({"role":"assistant","content":decoded_answer})

i downloaded llama3 locally but for both ollama and huggingface i am using ‘llama 3 instruct’. ollama quantization method is Q4_0.
I should correct, the speed is not 10 times slower, but Ollama is at least 3 times faster.

1 Like

I see. My next guess would be that maybe the tokenizer isn’t a fast tokenizer? The fast tokenizer are written in Rust vs. the standard tokenizer which are written in Python so that could be the source of potential slowdown. Maybe print the type of the tokenizer and test if the fast property is set to true, and if it’s not a fast tokenizer then you might be able to use letting like LlamaTokeniderFast.from_pretrained to ensure that the fast tokenizer is being used.

thanks but i printed it’s type and it was

transformers.tokenization_utils_fast.PreTrainedTokenizerFast

and tokenizer.is_fast returns True. so i don’t think that’s the problem.
By the way, how much time does it take for you to generate, for example, a sentence of 15 words? This generated sentence:

Hello! It’s great to chat with you. I’m doing well, thanks for asking

took about 3.5 seconds using Hugging Face, while generating a sentence with this number of tokens takes less than a second in Ollama.

so no one faced this issue ?

Same issue here.

I have not yet tried directly using GGUF files via transformers and use Quanto (instead of BNB) but see the same performance differences.

Edit: does not help in (our?) case since transformers can currently only import GGUF. Working directly with llama_cpp seems to be essentially the same in terms of performance - this integration is currently WIP (Add support for llama.cpp ¡ Issue #27712 ¡ huggingface/transformers ¡ GitHub). I am going to try combine the usage of the llama-cpp bindings (GitHub - abetlen/llama-cpp-python: Python bindings for llama.cpp) with the low-level transformers API next.

i used a GGUF file for hugging face but it attempts to convert it back to normal format so nothing changed in terms of performance, it also converts quantized gguf files back to dequantized version.

on a different note, major problem of ollama was that it could not handle multiple requests and instead it puts requests in a queue and also it cannot do batch processing. is it because of how llama.cpp was implemented or another thing ?

also i tried VLLM which was so much faster than hugging face but still can’t beat ollama

I did not measure at all but it feels as if the Python bindings for cpp llama are essentially the same speed as ollama. Integrating HF transformers or raw inputs is pretty tricky though so I abandoned that for now.

GGUF in HF - yes, that’s what I meant with “just import”. Sorry for being unclear.

Multiple requests: yeah, the actual resource usage of inference.

Hi,
I have observed the same as you. I think it stems from certain optimization going in the background of ollama or llama.cpp. In my case I am running the models on A10 GPU 24GB running Bfloat16 versions and they are relatively fast BUT when my co-worker in the research group uses ollama on laptop with RTX 3070 plus using CPU running the quantized versions it seems much faster. The average time for my coworker on a task is 1.87s. while for me is 5.90s. (for more context we are using benchmarks like HumanEval to evaluate the models). I would expect using A10 would be faster or at least the difference in time would be not that big. I know that there are a lot of variables in our case but still I think A10 would beat the setup of a commercial device such as my coworker’s one.

i switched to VLLM and used a quantized model, it is as fast as ollama and can handle multiple requests
this is my code in case anybody is struggling with the speed

class Vllm_Openai(Base_QuestionAnswerer):

    def __init__(self, re_run_server=False):
        super().__init__()
        print("screen wipe:\n", subprocess.getoutput('screen -wipe'))

        os.environ['MKL_SERVICE_FORCE_INTEL'] = '1'
        screen_session_name = "vastai_openai"

        self.run_model(re_run_server, screen_session_name)

        print("vllm openai model loaded")

        self.client = AsyncOpenAI(
            base_url="http://localhost:8000/v1",
            api_key="token-what-a-day",
        )
        self.model_name = "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"

    def run_model(self, re_run_server, screen_session_name):
        is_running = subprocess.getoutput(
            f'screen -ls | grep {screen_session_name}')
        if is_running:
            print("vllm openai model already running")

        if re_run_server or not is_running:
            utils.start_screen_session(
                "./startup_files/run_files", screen_session_name, "run_vllm_openai.sh")

        self.wait_for_model_load(1 if is_running else 10)

    async def get_answer_stream(self, chat):
        stream = await self.client.chat.completions.create(
            model=self.model_name,
            messages=chat,
            stream=True,
        )
        return stream

    def wait_for_model_load(self, init_wait):
        time.sleep(init_wait)

        log_file_path = "./startup_files/run_files/vllm_openai_logs.log"
        keyword = "metrics.py"
        check_interval = 5
        while True:
            try:
                # Open the log file and read the last line
                with open(log_file_path, 'r') as file:
                    lines = file.readlines()
                    if lines:
                        last_line = lines[-1].strip()
                    else:
                        last_line = ''

                # Check if the keyword is in the last line
                if keyword in last_line:
                    print(f"Found the keyword '{keyword}' in the last line.")
                    break
                else:
                    print(
                        f"Keyword '{keyword}' not found. Checking again in {check_interval} seconds...")

                # Wait for the specified interval before checking again
                time.sleep(check_interval)

            except FileNotFoundError:
                print(
                    f"File not found: {log_file_path}. Retrying in {check_interval} seconds...")
                time.sleep(check_interval)
            except Exception as e:
                print(
                    f"An error occurred: {e}. Retrying in {check_interval} seconds...")
                time.sleep(check_interval)

    async def first_call(self, chat):
        stream = await self.get_answer_stream(chat)
        print("init call:", flush=True)
        print("Assistant: ", end="", flush=True)
        async for chunk in stream:
            text = self.get_chunk_text(chunk)
            print(text, end="", flush=True)

    def get_chunk_text(self, chunk):
        return chunk.choices[0].delta.content or ""

and the shell files that runs vllm open ai compatible server in the background :

#!/bin/bash

# Define the path to the log file
LOG_FILE="vllm_openai_logs.log"

# Redirect all output and errors to the log file
exec > >(tee -a $LOG_FILE) 2>&1

vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-FP8 --dtype auto --api-key token-what-a-day --gpu-memory-utilization 0.7 --enable-prefix-caching

utils folder :


def give_run_premission(working_direcoty,bash_command):
    full_command = f"chmod +x ./{bash_command}"
    
    result = subprocess.run(full_command,
                            cwd=working_direcoty,
                            shell=True,
                            capture_output=True,
                            text=True)
    print(result)
    
def start_screen_session(working_direcoty, custom_name, bash_command):
    give_run_premission(working_direcoty,bash_command)
    print("session name:", custom_name)
    print("session bash command:", bash_command)

    kill_screen_session(custom_name)
    time.sleep(5)
    print(f"starting screen session : {custom_name}")
    full_command = f"screen -dmS {custom_name}  bash -c './{bash_command}'"

    print("start screen command:")
    print(full_command)
    result = subprocess.run(full_command,
                            cwd=working_direcoty,
                            shell=True,
                            capture_output=True,
                            text=True)
    print(result)
    
def kill_screen_session(custom_name):
    try:
        result = subprocess.run(f"screen -S {custom_name} -X quit",
                                shell=True,
                                capture_output=True,
                                text=True)

        print("kill screen reuslt:")
        print(result)

        if result.returncode == 0:
            print(f"Successfully killed screen session: {custom_name}")
        else:
            print(f"Failed to kill screen session: {custom_name}")

    except Exception as e:
        print(f"An error occurred: {e}")

Hi,

That’s because they are written in different programming languages. Transformers is written in Python, ollama uses llama cpp as backend, which is written in C++.

C++ is a faster programming language than Python.

As for deployment of LLMs on a GPU, frameworks like TGI, vLLM and NVIDIA-TensorRT-LLM can be used, although perhaps nowadays you could even use llama cpp for that.

Thanks for the information! However, doesn’t PyTorch also utilize C and C++ under the hood for its computations? I understand that the frontend is in Python, but since the heavy lifting is done by optimized C++ code, wouldn’t the impact on performance be minimal, adding only a constant latency from the Python layer?