Batching large csv for embedding

s00060942p · October 30, 2023, 10:48am

I have a large csv file (35m rows) in the following format:

id, sentence, description

Normally in inference mode, Id like to use model like so:

for iter_through_csv:
    model = SentenceTransformer('flax-sentence-embeddings/some_model_here', device=gpu_id)
    encs = model.encode(row[1], normalize_embeddings=True)

But since I have GPUs Id like to batch it. However, the size is large (35m), so I do not want to read in memory and batch.

I am struggling to find a template to batch csv on huggingface.
What is the most optimal way to do this?

Topic		Replies	Views
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1589	August 23, 2021
Looking for tool class to do predictions 🤗Transformers	3	564	October 9, 2020
Make bert inference faster 🤗Transformers	6	11054	September 16, 2021
Transform list-like elements to rows 🤗Datasets	2	1178	May 14, 2021
How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel? 🤗Datasets	10	2789	October 28, 2022

Batching large csv for embedding

Related topics