How to run batch inference with structured output fast?

MRiabov · September 9, 2025, 8:14pm

Hello.

I have a dataset of 100k+ rows of documents, each is on average 100 to 300 words, some are a few thousands long. I’m looking to make feature extraction from the dataset, i.e. populate a SQL table using a self-hosted model.

How to do it optimally?

One way to do it is to simply run a for loop over the dataset with batch_size=64. That works, but I think we would get heavy GPU underutilization.
Many HF libraries allow for Trainer or similar, which allows in turn for no-python-overhead training. However, I don’t know such a lib for inference over an offline dataset.

After the inference, we need to populate a SQL table which requires calling python for schemas and that’s slow.
What do you think? what is a good way to do it?
Langchain+vllm async generation?

Thank you.
Running

John6666 · September 9, 2025, 10:18pm

Python isn’t really suited for applications that need to fully utilize hardware resources due to the GIL constraint, you know…

MRiabov · September 10, 2025, 8:33am

Does vllm not handle this under the hood? I thought that it’s written on some C++/CUDA or similar low-level languages.

John6666 · September 10, 2025, 10:19am

it’s written on some C++/CUDA or similar low-level languages.

Yeah. vLLM significantly accelerates inference, especially on GPUs. However, it isn’t specifically optimized for handling large numbers of concurrent requests with other tasks. TEI and TGI are designed with larger scales in mind, so they might be more advantageous in such case.
That said, vLLM’s server mode might not differ much…

Anyway, this is less a vLLM issue and more a problem on the pure Python side, including LangChain. For batch processing handling large numbers of files, running everything through a single Python script introduces overhead.
Offloading some of that to the OS (like when setting up a local server, where the OS handles some resource management) can sometimes improve efficiency. It may get messy though…

If the dataset isn’t enormous, a single script is perfectly fine.

aiflux · September 10, 2025, 3:10pm

I’ve encountered issues like this downstream of LangChain quite often - I wonder why their tooling always tends to be messy or problematic.

bhaswata08 · September 11, 2025, 5:47am

Feel free to correct me, but what i would recommend is running a LLM server through TGI, vLLM, SGLang or a server of your choice, and if python is really bothering you then writing a simple rust script for loading and the docs in batches and pushing them to the server on an openai endpoint using (fearless) concurrency.

You can see some examples here: openai_client - Rust

John6666 · September 11, 2025, 7:26am

Yeah. We could use bash or PowerShell if needed, or even just subprocess.run Python scripts from Python for speed, but relying solely on scripts makes resource management tricky…

Anyway, having something server-driven makes it easier to ensure both stability and speed…

mattewwade06 · September 11, 2025, 9:41am

Good question. For batch inference with structured outputs, you might want to wrap your requests into a pipeline or use the transformers Dataset.map() approach. That way you can process multiple inputs efficiently without hitting performance bottlenecks.

MRiabov · September 11, 2025, 5:39pm

The problem I have with TGI is that it’s a docker image which I personally couldn’t get to work (within reasonable timeframe) with a local dataset. The dataset is 400mb of text, I think some 17 000 000 tokens, and is to grow from that.
pipeline’s don’t support vLLM autobatching + concurrency. Which probably isn’t an issue given that I’m running on 5060 ti which doesn’t support that out of the box anyway, but…
The closest I’ve gotten is to literally a) iterating over a dataset, b) send async requests to endpoints, and vllm handles it.
TBH the transformers should have a pipeline that would support vllm auto-batching, whereas instead we can only get for-loop kind of performance. I’ll open the issue.

But I think sending async requests to a vllm endpoint is the solution.

Topic		Replies	Views
What's the best way to speed up inference on a large dataset? Beginners	3	3937	March 13, 2022
How to batch process 5mm prompts of llama 2 using inference endpoints? Inference Endpoints on the Hub	0	1331	July 30, 2023
Inference API detailed request Beginners	5	2337	September 11, 2020
GPU inference slows down if done in a loop 🤗Transformers	1	1588	July 20, 2020
Recommended way to perform batch inference for generation 🤗Transformers	0	2547	March 6, 2021

How to run batch inference with structured output fast?

Related topics