Transformer for asynchronous multi-stream image time-series with online prediction?

I have two streams of images, each stream corresponding to a different “channel” (e.g. different sensor modality). The streams are not synchronized — at any given moment, a new image arrives from one stream or the other, each with a real-valued timestamp. I want to classify the sequence online, i.e. produce an updated prediction after every new incoming image.

Key constraints:

  • Spatial features within each image matter (not just a scalar summary)

  • Timestamps are irregular and not aligned across streams

  • Prediction must improve causally as more observations arrive

The natural design seems to be: ViT encoder per image → causal transformer over the merged token stream, with real-valued timestamp embeddings (e.g. Time2Vec) replacing positional indices, and band/channel ID as an additional embedding.

Is there an existing architecture or paper that handles this exact setup? Or is this a known gap?

1 Like

Seems a known gap?


I did not find a widely recognized, exact-match architecture for your full setup: two raw image streams, asynchronous real-valued timestamps, no forced alignment, spatial structure preserved inside each image, and causal online classification after every arrival. What I found is a set of strong neighboring lines of work that each cover part of it. The intersection still looks like a real gap rather than a settled standard. MulT covers unaligned multimodal attention. StreaMulT covers streaming unaligned multimodal inference with memory. TSViT covers image time series with time-aware visual tokenization. ContiFormer and Transformer Hawkes Process cover continuous-time irregular sequences. AnytimeFormer covers irregular asynchronous two-image-modality fusion, but for reconstruction rather than online classification. RAMNet and SODFormer cover asynchronous visual streams with online updates, but in event-plus-frame settings and different tasks. Time-IMM then reinforces the broader point that realistic irregular asynchronous multimodal settings are still under-served in current benchmarks and methods. (ACLarXiv)

The background

Most transformer work on time series grew out of one of three easier settings:

  • regularly sampled sequences
  • feature-level multimodal streams
  • video-like synchronized visual inputs

Your problem sits outside all three. You have visual observations, not just vectors. They arrive at irregular real times. The two streams are not aligned. And you need online causal updates, not one final prediction after the sequence ends. That combination is exactly why no single canonical paper shows up. The literature is rich on each axis separately, but sparse at the full intersection. (ACLarXiv)

What already exists, and how close it is

1. Unaligned multimodal transformers

MulT is the classic reference for unaligned multimodal sequences. Its key idea is directional crossmodal attention that lets one modality attend to another across distinct time steps without explicit alignment. That is very relevant to your two unsynchronized streams. But MulT was developed for low-level modality feature sequences, not raw image patch streams, and it is not an online streaming vision model. (ACLarXiv)

StreaMulT is closer in deployment spirit. It explicitly defines a setting where the goal is prediction across time from heterogeneous multimodal sequential data in a streaming fashion, and it uses crossmodal attention plus a memory bank to handle unaligned input streams and arbitrarily long inputs. That is the closest existing transformer framing to your online requirement. The mismatch is that it is still not a raw-image-first architecture. (arXiv)

2. Irregular continuous-time sequence models

Transformer Hawkes Process is important conceptually because it treats the input as a continuous-time event sequence and explicitly says vanilla transformer machinery is not directly ready-made for continuous-time event data. It adapts self-attention to that setting and argues for attention-based modeling of short- and long-range event dependencies.

ContiFormer pushes the same idea further. It states that ordinary recurrent and transformer models are limited by their discrete characteristic on irregular continuous-time data, and extends transformer relation modeling into the continuous-time domain. That makes it one of the strongest references for your timestamp problem. (arXiv)

Time2Vec is not a full architecture, but it is still one of the cleanest timestamp components. It is explicitly proposed as a model-agnostic vector representation of time for synchronous and asynchronous events. That makes it a natural candidate for event-time embeddings in your setup. (OpenReview)

3. Visual time-series models that preserve spatial structure

TSViT is probably the most relevant visual paper if you care about keeping spatial image structure instead of collapsing each image to one scalar or one vector too early. It builds a factorized temporo-spatial encoder for satellite image time series and introduces acquisition-time-specific temporal positional encodings. This is strong evidence that image time series benefit from explicit timestamp-aware modeling while still treating the input as images, not just tabular points. (CVF Open Access)

S-ViT is relevant for a different reason. It uses a memory-enabled temporally aware spatial encoder to produce frame-level features, then sends those features to a temporal decoder. That separation is useful for your case because it points away from one giant flat spatiotemporal token stream and toward a more scalable “encode image first, fuse over time second” design. (CVF Open Access)

4. Asynchronous visual streaming systems

RAMNet is one of the strongest near-matches for the online semantics you want. It is not transformer-based, but it is explicitly built for asynchronous and irregular data from multiple sensors, keeps a hidden state that is updated asynchronously, and can be queried at any time for a prediction. The mismatch is that it works on events and frames for monocular depth, not two ordinary image streams for classification.

SODFormer is another very relevant near-match. It fuses asynchronous events and frames, uses a spatiotemporal transformer, and says it can continuously detect objects in an asynchronous manner. Its fusion module can be queried at any time, specifically to avoid the bottleneck of synchronized frame-based fusion. Again, the mismatch is task and modality type rather than the core streaming idea. (arXiv)

5. Two image modalities with irregular timestamps

AnytimeFormer is the closest paper I found to your raw input shape. It takes Sentinel-2 optical and Sentinel-1 SAR observations together with their timestamps, uses a time-align attention module to adaptively align temporally asynchronous multi-modal time series, and avoids extra alignment preprocessing. That is very close to “two image channels with irregular timestamps.” The mismatch is that the task is reconstruction at arbitrary times, not online sequence classification after each arrival. (ScienceDirect)

6. A useful warning paper

MICA is important because it points out a failure mode that matters a lot in your case. Its argument is that asynchronous multimodal fusion is not just a timing problem. It is also a distribution discrepancy problem. If the two modalities live in different feature distributions, plain cross-attention can become unreliable, so it performs attention in a more modality-invariant space. If your two streams come from genuinely different sensors, this paper is very relevant to architecture design. (CVF Open Access)

So is this a known gap?

Yes. That is the most accurate summary.

The field clearly knows about:

  • unaligned multimodal streams (ACLarXiv)
  • continuous-time irregular sequences (arXiv)
  • timestamp-aware visual time series (CVF Open Access)
  • asynchronous online visual fusion

But those pieces are usually studied in different communities. Time-IMM makes the broader point explicitly: real-world time series are often irregular, multimodal, asynchronous, and messy, while many benchmarks and methods still assume cleaner, more regular settings. That supports the claim that your problem is not a solved standard benchmark case. (arXiv)

What I think about your proposed design

Your instinct is good. The main refinement is architectural.

What is right in your idea

These parts are solid:

  • image encoder first
  • explicit real-valued time embeddings
  • channel or modality ID embeddings
  • causal prediction after every new observation

Those choices line up well with the existing literature. MulT and StreaMulT support the unaligned multimodal part. Time2Vec, THP, and ContiFormer support treating time explicitly. TSViT supports the idea that timestamps belong in visual time-series modeling. (ACLarXiv)

What I would change

I would not literally replace all positional indices with time embeddings.

Inside each image, you still need 2D spatial positional information. Across images, you need event-time information. Those are different roles. TSViT is a strong precedent for keeping image-space modeling explicit while adding time-aware encodings. So I would use:

  • 2D spatial positions inside the per-image encoder
  • continuous-time embeddings at the event level
  • modality embeddings at the event level (CVF Open Access)

I would also not start with a model that sends every patch token from every image into one ever-growing causal transformer. That is elegant, but it is also the most likely place to hit compute and memory problems. StreaMulT’s use of memory banks and S-ViT’s separation of spatial and temporal stages both point toward a more scalable design. (arXiv)

The design I would actually recommend

I would treat each arrival as an event made of:

  • the image
  • its real timestamp
  • its stream ID

Then I would use this pipeline:

1. Per-image visual encoder

Encode each image with a ViT-like or CNN-plus-transformer backbone that preserves spatial structure. Keep normal 2D patch positions here. If the two sensor modalities are very different, use separate stems or adapters, because MICA shows that crossmodal attention can become unreliable when modality distributions differ too much. (CVF Open Access)

2. Per-image token compression

Do not export all patches into the temporal model. Export either:

  • one global token, or
  • a small set of learned latent summary tokens

This keeps more spatial information than one scalar summary, but avoids a temporal patch-history explosion. TSViT and S-ViT both support this kind of factorized thinking. (CVF Open Access)

3. Event-time encoding

Add:

  • absolute timestamp embedding
  • time since previous event
  • time since previous event from the same stream
  • time since previous event from the other stream
  • stream ID embedding

That exact combination is my recommendation, not a named paper module, but it follows naturally from the continuous-time event perspective in Time2Vec, THP, and ContiFormer. (OpenReview)

4. Streaming fusion with memory

Instead of one universal stream only, keep:

  • memory for stream A
  • memory for stream B
  • fused memory for prediction

When a new A image arrives, update A memory, let it attend into recent B memory, then update fused state and emit a prediction. That is much closer to the logic of MulT and StreaMulT than to a monolithic merged token list. (ACLarXiv)

5. Prefix-level supervision

Because you need online classification, train the model to be correct not only at the end, but after each arrival. RAMNet and StreaMulT are both useful here as operational references: they are built around state updates and predictions across time, not only one terminal decision. (arXiv)

My shortest answer

There is no standard named architecture that already solves your exact setting end to end. The nearest pieces are:

  • MulT for unaligned multimodal attention (ACLarXiv)
  • StreaMulT for streaming unaligned multimodal inference with memory (arXiv)
  • TSViT for timestamp-aware image time-series transformers (CVF Open Access)
  • ContiFormer and Transformer Hawkes Process for continuous-time irregular event modeling (arXiv)
  • AnytimeFormer for asynchronous two-image-modality fusion with timestamps (ScienceDirect)
  • RAMNet and SODFormer for asynchronous online visual prediction semantics

So the best classification of your idea is:

not solved as a standard recipe, but strongly supported by existing components. (ACLarXiv)

What I would call it

If you need a label for the architecture family, I would call it one of these:

  • streaming multimodal continuous-time vision transformer
  • event-driven multimodal vision transformer
  • marked-event transformer for asynchronous visual streams

Those are descriptive names, not established paper titles.

Final judgment

Your proposed direction is not only reasonable. It is probably the right research direction for this problem. The only strong caveat is this:

keep space inside the image encoder, and keep time at the event level.

Do that, and your design becomes a clean synthesis of the strongest ideas already in the literature, while still addressing a real gap.

1 Like