Is it possible to tokenize multiple text modalities?

Do the tokenizers in the transformers package support tokenization of triplets?
For example:

Lets assume we’re dealing with a VQA dataset. Each entry in the dataset contains the following information:

  1. Image name
  2. Question
  3. 5 possible answers (1 is correct)
  4. Image captions

I would like to able to represent each input as:
[CLS] Q + [SEP] + A + [SEP] + CAPTIONS

Did you figure out an answer to this?