I am trying to use the pre-trained model - “openai/clip-vit-large-patch14” for generating the text embeddings (reference code attached below) and running to the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (84 > 77).
The size of tensor a (84) must match the size of tensor b (77) at non-singleton dimension 1"
From the error message I understand that the sequence length for my input text is more than what the pre-trained model can handle (i.e 77). I looked into few threads asking to truncate the data but I want to check if there is any parameter that we can configure to let model handle the truncation procedure on its own.
from transformers import CLIPProcessor, CLIPModel from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") inputs = processor(text=<some large text>, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) text_embeds = outputs['text_embeds']