PerceiverIO Output Query Array Doubts

I’m interested in using HuggingFace’s PerceiverIO implementation for (video) autoencoders (btw, thank you for the excellent blog post).

My main doubt is the “Output query array” (see the image). Let’s call it OQA for short. Let’s call the “Input array” INP.

I have no idea how to construct this properly or what it actually means (I’m not the only one struggling with this).

The OQA is constructed by the (abstract) decoder_query method that should be implement for each typical decoder case.

Optical Flow Example

(sorry, I would add direct links to code but this forum doesn’t let me)

Let’s first take a look at the optical flow case as it seems to be the most straightforward. It encodes a “video” of just two frames:

class PerceiverOpticalFlowDecoder(PerceiverAbstractDecoder):
    """Cross-attention based optical flow decoder."""
    ...
    def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
        if subsampled_points is not None:
            raise ValueError("FlowDecoder doesn't support subsampling yet.")
        return inputs

OK, so the OQA is the same thing as INP. Let’ see what happens when we call the decoder:

def forward(
        self,
        query: torch.Tensor,
        z: torch.FloatTensor,
        query_mask: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = False,
    ) -> PerceiverDecoderOutput:
        # NOTE: self.decoder is an instance of PerceiverBasicDecoder
        decoder_outputs = self.decoder(query, z, output_attentions=output_attentions)
        preds = decoder_outputs.logits
        # Output flow and rescale.
        preds /= self.rescale_factor
        preds = preds.reshape([preds.shape[0]] + list(self.output_image_shape) + [preds.shape[-1]])
        return PerceiverDecoderOutput(logits=preds, cross_attentions=decoder_outputs.cross_attentions)

self.decoder(query, z, output_attentions=output_attentions) looks like this:

class PerceiverBasicDecoder(PerceiverAbstractDecoder):
    ...
    def forward(
        self,
        query: torch.Tensor,
        z: torch.FloatTensor,
        query_mask: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = False,
    ) -> PerceiverDecoderOutput:
        # Cross-attention decoding.
        # key, value: B x N x K; query: B x M x K
        # Attention maps -> B x N x M
        # Output -> B x M x K
        cross_attentions = () if output_attentions else None

        layer_outputs = self.decoding_cross_attention(
            query,
            attention_mask=query_mask,
            head_mask=None,
            inputs=z,
            inputs_mask=None,
            output_attentions=output_attentions,
        )
        ...

So here query = OQA = INP and z is the latent coming from the backbone.

So, why is OQA = INP ?

Input has pixel values and position information and also the desired dimensions, so
it is convenient candidate for OQA. Is this the only reason?

Video Autoencoding

Let’s first get the dimensions straight. Let’s assume a video of 16 frames, 224x224 reso, 3 color channels, i.e. a tensor torch.Size([16, 3, 224, 224])

After “space-to-depth” transformation (i.e. creating blocks, each block including a small xy patch and color channels and some time slice) we get torch.Size([1, 16, 56, 56, 48]),
where 16,56,56 are the dimensions indexing the blocks and each block has 48 values

After flattening the index dimensions and concatenating fourier positional encodings, we get torch.Size([1, 50176, 243]) so in terms of llms & transformers, that’s 50176 “tokens” embedded into vector space of dim 243. This is the input tensor INP.

During the crunching in the self-attention layers, the latent hidden dimensions are always torch.Size([1, 256, 1278]). So far so good.

But then comes the decoding part. Let’s see what PerceiverBasicVideoAutoencodingDecoder does:

class PerceiverBasicVideoAutoencodingDecoder(PerceiverAbstractDecoder):
    """
    Cross-attention based video-autoencoding decoder. Light-weight wrapper of [*PerceiverBasicDecoder*] with video
    reshaping logic.
    ...
    """
    ...
    def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
        # NOTE: self.decoder is an instance of PerceiverBasicDecoder
        return self.decoder.decoder_query(
            inputs,
            modality_sizes=modality_sizes,
            inputs_without_pos=inputs_without_pos,
            subsampled_points=subsampled_points,
        )
    ...

Comparing this to the optical flow case, why all the fuzz? Why not just return inputs like class PerceiverOpticalFlowDecoder ?

I mean, the inputs are already torch.Size([1, 50176, 243]), i.e. properly organized and with added positional encoding dims.

The call to self.decoder.decoder_query() does lots of things I do not get:

class PerceiverBasicDecoder(PerceiverAbstractDecoder):
    ...
    def decoder_query(self, inputs, modality_sizes=None, inputs_without_pos=None, subsampled_points=None):
      ...
      else:
          # we end up in this section..
          batch_size = inputs.shape[0]
          index_dims = inputs.shape[2:] # BUG here? shouldn't index_dims be = inputs.shape[1:2] ??
          # Construct the position encoding
          # NOTE: constructs index dimensions again / independently of the inputs
          if self.position_encoding_type == "trainable":
              pos_emb = self.output_position_encodings(batch_size)
          elif self.position_encoding_type == "fourier": # this is where we end up
              pos_emb = self.output_position_encodings(
                  index_dims, batch_size, device=inputs.device, dtype=inputs.dtype
              )

All of this results with my example dimension in a OQA with weirdo dimensions of torch.Size([1, 243, 193]), i.e. we have as the new index/token dimension 243 that was the original block length, etc.

Does this make any sense? And if so, why? Why not just use OQA = INP?

Btw, at decoding stage, the code crashes with

RuntimeError: Given normalized_shape=[128], expected input with shape [*, 128], but got input of size[1, 243, 193]

I hope these concrete questions and examples have made clear the weak points in my understanding.

Please help!

I have cleared my doubts myself so far… So I guess I’ll continue this (lonely :disappointed_relieved: ) monologue a bit more.

The decoder query matrix - being a query (Q) matrix - defines the first index dimension of the matrix that comes out of the decoder. So you can simply think that it is something that must have the dimension of your desider output. And has something to do - in a common sense way - with your specific input. For optflow it is the same as the first input frame and for video autoencoder the positional fourier embeddings. Makes sense.

Finally I think there’s a bug in the way the preprocessor, perceiver, decoder and output processor is organized in this module: there’s no way to create consistent fourier positional embeddings in the decoder part, say, for video, as the index dimension information is lost in the callchain: preprocessor → perceiver → decoder (no way to know what the original index dimensions were in the beginning)