I have a dataset of 100k+ rows of documents, each is on average 100 to 300 words, some are a few thousands long. I’m looking to make feature extraction from the dataset, i.e. populate a SQL table using a self-hosted model.
How to do it optimally?
One way to do it is to simply run a for loop over the dataset with batch_size=64. That works, but I think we would get heavy GPU underutilization.
Many HF libraries allow for Trainer or similar, which allows in turn for no-python-overhead training. However, I don’t know such a lib for inference over an offline dataset.
After the inference, we need to populate a SQL table which requires calling python for schemas and that’s slow.
What do you think? what is a good way to do it?
Langchain+vllm async generation?
it’s written on some C++/CUDA or similar low-level languages.
Yeah. vLLM significantly accelerates inference, especially on GPUs. However, it isn’t specifically optimized for handling large numbers of concurrent requests with other tasks. TEI and TGI are designed with larger scales in mind, so they might be more advantageous in such case.
That said, vLLM’s server mode might not differ much…
Anyway, this is less a vLLM issue and more a problem on the pure Python side, including LangChain. For batch processing handling large numbers of files, running everything through a single Python script introduces overhead.
Offloading some of that to the OS (like when setting up a local server, where the OS handles some resource management) can sometimes improve efficiency. It may get messy though…
If the dataset isn’t enormous, a single script is perfectly fine.