I am trying to batch encode frames of a video using CLIP model. This works fine using the clip
API from OpenAI
import clip
# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
batch_size = 256
batches = math.ceil(len(video_frames) / batch_size)
# Process each batch
for i in range(batches):
print(f"Processing batch {i+1}/{batches}")
# Get the relevant frames
batch_frames = video_frames[i*batch_size : (i+1)*batch_size]
# Preprocess the images for the batch
batch_preprocessed = torch.stack([preprocess(frame) for frame in batch_frames]).to(device)
I want to do this using transformer API. So instead of model, preprocess = clip.load("ViT-B/32", device=device)
I wrote
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
preprocess = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
But it throws error in the batch processing line. The full stack trace is here
Processing batch 1/4
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-ff1a9e68dce5> in <module>()
18
19 # Preprocess the images for the batch
---> 20 batch_preprocessed = torch.stack([preprocess(frame) for frame in batch_frames]).to(device)
21
22 # Encode with CLIP and normalize
2 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2450 if not _is_valid_text_input(text):
2451 raise ValueError(
-> 2452 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
2453 "or `List[List[str]]` (batch of pretokenized examples)."
2454 )
ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
So it looks like the input to the preprocess
is looking for some text. Maybe I am not choosing the right transformer API for CLIP?
Please suggest.