What can I do to optimize this process?

FreedomFighter678 · November 20, 2022, 3:34pm

Hi all,

First time posting and my first time using the Hugginface library.

I am trying to get the CLIP embeddings of a series of strings in a pandas dataframe. However, there are entries that exceed the default max token size of 77 and I’m not sure how to deal with them.

This is roughly what I intend to do and seems to work:

from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
tqdm.pandas()

model = SentenceTransformer('clip-ViT-L-14')

def encode_else(text):
    try:
        return model.encode(sentences = text, device = torch.device('cuda'))
    except Exception as e:
        return e

df = some_relatively_big_dataframe

new_df = df.progress_apply(encode_or_else)

But it is extremely slow (upward of 10 hours on my rig) as it just passes on one sentence at a time to model.encode(). If I pass on a batch or chunk, I’m just not sure how to handle the errors.

So what can I do to make it faster?

Topic		Replies	Views
Optimization strategie 🤗Transformers	0	267	October 21, 2022
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022
UnicodeEncodeError: surrogates not allowed 🤗Datasets	2	2016	May 26, 2024
Huggingface using only half of the cores for inference Intermediate	0	518	September 6, 2023
ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds 🤗Transformers	3	1755	November 14, 2023

What can I do to optimize this process?

Related topics