What can I do to optimize this process?

Hi all,

First time posting and my first time using the Hugginface library.

I am trying to get the CLIP embeddings of a series of strings in a pandas dataframe. However, there are entries that exceed the default max token size of 77 and I’m not sure how to deal with them.

This is roughly what I intend to do and seems to work:

from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
tqdm.pandas()

model = SentenceTransformer('clip-ViT-L-14')

def encode_else(text):
    try:
        return model.encode(sentences = text, device = torch.device('cuda'))
    except Exception as e:
        return e

df = some_relatively_big_dataframe

new_df = df.progress_apply(encode_or_else)


But it is extremely slow (upward of 10 hours on my rig) as it just passes on one sentence at a time to model.encode(). If I pass on a batch or chunk, I’m just not sure how to handle the errors.

So what can I do to make it faster?