HFLM Python integration way slower than CLI

sixte99 · August 6, 2025, 9:33am

Hello, I have been currently trying to evaluate some models. When doing it with the following script, it takes about 5 minutes (which is fine):

#!/bin/bash
#SBATCH --account bsc03
#SBATCH --qos=acc_debug
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=80
#SBATCH --time 00:15:00

export NUMEXPR_MAX_THREADS=128
export CUDA_VISIBLE_DEVICES=0

model="models/Llama-2-7b-hf"
task="arc_easy,arc_challenge,hellaswag,lambada_openai,lambada_standard,openbookqa,piqa,winogrande"
lm_eval --model hf \
        --model_args pretrained=$model,dtype=bfloat16,device_map=auto \
        --task $task \
        --batch_size 64

However, when using HFLM class, it takes more than 20, but I am presumably doing the same thing:

import transformers
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

eval_model = HFLM(
    pretrained="models/Llama-2-7b-hf",
    dtype="bfloat16",
    device_map="auto",
)
results = evaluator.simple_evaluate(
    model=eval_model,
    tasks=["arc_easy","arc_challenge","hellaswag","lambada_openai","lambada_standard","openbookqa","piqa","winogrande"],
    batch_size=64,
)
print(results['results'])

I want to use HFLM because I intend to pass an already-loaded model and it’s way more convinient than saving the model and calling the CLI script.

Any advice? What am I doing wrong? Is this a bug?

Thanks

John6666 · August 6, 2025, 1:56pm

This may help avoid some of the problems.

eval_model = HFLM(
    pretrained="models/Llama-2-7b-hf",
    dtype="bfloat16",
    device="cuda",
    #device_map="auto",
)
results = evaluator.simple_evaluate(
    model=eval_model,
    tasks=["arc_easy","arc_challenge","hellaswag","lambada_openai","lambada_standard","openbookqa","piqa","winogrande"],
    batch_size=64,
    max_batch_size=64,
)

github.com/EleutherAI/lm-evaluation-harness

Batch size auto is wrong?

opened 11:20PM - 19 Jan 24 UTC

closed 01:48PM - 07 Feb 24 UTC

djstrong

Batch size `auto` is not working correctly (with `generate_until` tasks?). `l…m_eval --model hf --model_args pretrained=HuggingFaceH4/zephyr-7b-alpha,dtype=bfloat16 --tasks polemo2_in --device cuda:0 --batch_size auto` ```Passed argument batch_size = auto. Detecting largest batch size Traceback (most recent call last): File "/net/tscratch/people/plgkwrobel/llm-benchmark/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/net/tscratch/people/plgkwrobel/llm-benchmark/lm-evaluation-harness/lm_eval/__main__.py", line 231, in cli_evaluate results = evaluator.simple_evaluate( File "/net/tscratch/people/plgkwrobel/llm-benchmark/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper return fn(*args, **kwargs) File "/net/tscratch/people/plgkwrobel/llm-benchmark/lm-evaluation-harness/lm_eval/evaluator.py", line 150, in simple_evaluate results = evaluate( File "/net/tscratch/people/plgkwrobel/llm-benchmark/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper return fn(*args, **kwargs) File "/net/tscratch/people/plgkwrobel/llm-benchmark/lm-evaluation-harness/lm_eval/evaluator.py", line 325, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/net/tscratch/people/plgkwrobel/llm-benchmark/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1051, in generate_until batch_size = self._detect_batch_size() File "/net/tscratch/people/plgkwrobel/llm-benchmark/lm-evaluation-harness/lm_eval/models/huggingface.py", line 610, in _detect_batch_size batch_size = forward_batch() File "/net/tscratch/people/plgkwrobel/llm-benchmark/venv/lib/python3.10/site-packages/accelerate/utils/memory.py", line 134, in decorator raise RuntimeError("No executable batch size found, reached zero.") RuntimeError: No executable batch size found, reached zero. ``` The benchmark works with at least `batch_size 2` (on 40GB VRAM card).

sixte99 · August 6, 2025, 2:17pm

Thank you for your message! However, your code still takes about 20 minutes to run. The CLI script takes about 5. I don’t know what is going on.

Thanks!

John6666 · August 6, 2025, 2:27pm

Hmm… It seems like CLI one is running with accelerate, so how about trying to imitate that?

accelerate launch --num_processes 1 python eval.py

sixte99 · August 6, 2025, 3:26pm

Hi, I just tried it . Still same thing, it still takes 20 minutes instead of 5.

Topic		Replies	Views
Evaluating the model during the run Intermediate	0	483	December 29, 2021
[seq2seq] Run distributed eval somewhat faster than run_eval 🤗Transformers	0	262	October 28, 2020
Run_clm.py is very slow on gpu (used to take seconds) Beginners	0	897	May 20, 2021
Transformers-cli - Python SDK 🤗Transformers	0	355	February 23, 2021
Huggingface Trainer eval while training 🤗Transformers	1	746	December 31, 2022

HFLM Python integration way slower than CLI

Related topics