Fastest way to do inference on a large dataset in huggingface?

Hello there!

Here is my issue. I have a model trained with huggingface (yay) that works well. Now I need to perform inference and compute predictions on a very large dataset of several millions of observations.

What is the best way to proceed here? Should I write the prediction loop myself? Which routines in datasets should be useful here? My computer has lots of RAM (100GB+), 20+ cpus and a big GPU card.

Thanks!

Hi ! If you have enough RAM I guess you can use any tool to load your data.

However if your dataset doesn’t fit in RAM I’d suggest to use the datasets, since it allows to load datasets without filling up your RAM and gives excellent performance.

Then I guess you can write your prediction loop yourself. Just make sure to pass batches of data to your model to make sure that your GPU is fully utilized. You can also use my_dataset.map() with batched=True and set batch_size to a reasonable value

1 Like