Hi all,
First time posting and my first time using the Hugginface library.
I am trying to get the CLIP embeddings of a series of strings in a pandas dataframe. However, there are entries that exceed the default max token size of 77 and I’m not sure how to deal with them.
This is roughly what I intend to do and seems to work:
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
tqdm.pandas()
model = SentenceTransformer('clip-ViT-L-14')
def encode_else(text):
try:
return model.encode(sentences = text, device = torch.device('cuda'))
except Exception as e:
return e
df = some_relatively_big_dataframe
new_df = df.progress_apply(encode_or_else)
But it is extremely slow (upward of 10 hours on my rig) as it just passes on one sentence at a time to model.encode(). If I pass on a batch or chunk, I’m just not sure how to handle the errors.
So what can I do to make it faster?