Here is my issue. I have a model trained with huggingface (yay) that works well. Now I need to perform inference and compute predictions on a very large dataset of several millions of observations.
What is the best way to proceed here? Should I write the prediction loop myself? Which routines in datasets should be useful here? My computer has lots of RAM (100GB+), 20+ cpus and a big GPU card.
Hi ! If you have enough RAM I guess you can use any tool to load your data.
However if your dataset doesn’t fit in RAM I’d suggest to use the datasets, since it allows to load datasets without filling up your RAM and gives excellent performance.
Then I guess you can write your prediction loop yourself. Just make sure to pass batches of data to your model to make sure that your GPU is fully utilized. You can also use my_dataset.map() with batched=True and set batch_size to a reasonable value
I have a large dataset that I want to use for eval/other tasks that requires a trained model to do inference on it. (for context: i am using a translation model to translate multiple SFT, DPO datasets to multiple other language from english)
I’ve been using the .map() function from datasets with batched=True, and batch_size specified.
The problem is the inference model takes way too long to process even a couple of thousand datasets.
i have lots of vram and lots of GPU such that I can launch multiple instances of the same model on the same GPU and even have multiple GPUs.
Is there a way where I can use the map() function and do batched inference but utilising multiple instances of the model to gain more throughput more samples processed / second.
something like multithreading/multiprocessing where each thread accesses seperate instance of the model.
@lhoestq thanks a ton for replying back, my vram size is quite large and i have 8xA6000 and the model i am using to do the inference only requires 4 GB of vram, so i have orders of magnitude worth of vram sitting idleas the example only scales upto num GPUs.
is there a way to have multiple instances of the model on the same GPU intil the vram is full. such that i can max out the full performance of my GPUs and thus leverage much more faster processing (less total time to proc) ?