Encoding video frames using CLIP

I am trying to batch encode frames of a video using CLIP model. This works fine using the clip API from OpenAI

import clip

# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

batch_size = 256
batches = math.ceil(len(video_frames) / batch_size)

# Process each batch
for i in range(batches):
  print(f"Processing batch {i+1}/{batches}")

  # Get the relevant frames
  batch_frames = video_frames[i*batch_size : (i+1)*batch_size]
  
  # Preprocess the images for the batch
  batch_preprocessed = torch.stack([preprocess(frame) for frame in batch_frames]).to(device)

I want to do this using transformer API. So instead of model, preprocess = clip.load("ViT-B/32", device=device) I wrote

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
preprocess = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

But it throws error in the batch processing line. The full stack trace is here

Processing batch 1/4

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-11-ff1a9e68dce5> in <module>()
     18 
     19   # Preprocess the images for the batch
---> 20   batch_preprocessed = torch.stack([preprocess(frame) for frame in batch_frames]).to(device)
     21 
     22   # Encode with CLIP and normalize

2 frames

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2450         if not _is_valid_text_input(text):
   2451             raise ValueError(
-> 2452                 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2453                 "or `List[List[str]]` (batch of pretokenized examples)."
   2454             )

ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

So it looks like the input to the preprocess is looking for some text. Maybe I am not choosing the right transformer API for CLIP?

Please suggest.