Time Series Transformer. Lagged values and time alignment

Hello, I’m relatively new to working with transformers, and I’ve been exploring the implementation of a Time Series Transformer. However, I’m struggling to grasp how time features align with lagged values in the training phase.

Upon examining the code for create_network_inputs() in TimeSeriesTransformerModel() alongside the generate() function in ‘TimeSeriesTransformerForPrediction()’, an observation is that “1” is a mandatory value for config.lags_sequence; otherwise inference would not work because there would be missing values between the observed context and the forecast. Also, it is worth mentioning that the minimum value for config.lags_sequence should not fall below one (zero or negative). In the first piece of code, the function calls ‘get_lagged_subsequences()’ with shift=0, while the second utilizes shift=1.

It is apparent that, during the inference process, the last time value must be utilized as the initial token to commence the incremental, step-by-step greedy forecast—a method that aligns with logical reasoning. Doing so would require future values that are not currently available, introducing a practical constraint.

def get_lagged_subsequences(
	self, sequence: torch.Tensor, subsequences_length: int, shift: int = 0
) -> torch.Tensor:
	sequence_length = sequence.shape[1]
	indices = [lag - shift for lag in self.config.lags_sequence]

	lagged_values = []
	for lag_index in indices:
		begin_index = -lag_index - subsequences_length
		end_index = -lag_index if lag_index > 0 else None
		lagged_values.append(sequence[:, begin_index:end_index, ...])
	return torch.stack(lagged_values, dim=-1)

During the training phase in create_network_inputs(), it appears that for lag_index= 1 the last values lagged by one step lagged_values.append(sequence[:, - self.config.context_length - max(self.config.lags_sequence) -1 :-1, ...]) are aligned with the time features of the last time observed context past_time_features[:, - self.config.context_length - max(self.config.lags_sequence) :, ...]. This seems to be misaligned.

Wouldn’t it be more appropriate for Values to be aligned with Time features in synch? In cases where steps are not equally distributed the current alignment could lead to the omission of relevant information from the last observed value features. Adjusting this alignment could enhance the model’s ability to capture relevant temporal patterns.

def create_network_inputs(...):
	# time feature
	time_feat = (
				past_time_features[:, self._past_length - self.config.context_length :, ...],
		if future_values is not None
		else past_time_features[:, self._past_length - self.config.context_length :, ...]

	# lagged features
	subsequences_length = (
		self.config.context_length + self.config.prediction_length
		if future_values is not None
		else self.config.context_length
	lagged_sequence = self.get_lagged_subsequences(sequence=inputs, subsequences_length=subsequences_length) # shift = 0, default

	return transformer_inputs, loc, scale, static_feat

This question would apply to Time Series transformer, Informer and Autoformer because they share the same code.

Thank you for anyone helping here.

Thank you


thanks @zulok33 for the questions…

To begin with let’s separate out the issue of lags and the temporal covariates.

The lags allow us to create a sequence of vectors when all we have is an 1-d array of numbers because transformers operate on a sequence of vectors that typically represent some semantic meaning in NLP. Similarly in the time-series setting we would like to have temporal semantic representations (or sequences of vectors) which we get via the lag operation.

Now the lag operation keeps the time steps of the incoming 1-d seq intact as the very first coordinate of each lag vector is the actual input time series at a particular time point say t.

so now once we have a seq of lag vectors, which contains the values x_t and some chosen x_t-lag-index values we can concat the time features to this vector.

the model we are learning is the distribution of the next time step given the past and the current covariate, i.e. p(x_t | x_t-1, x_t-2, …, c_t) Now the lag vector at time t-1 contains some representation of t-1 and the time feature contains the covariates of t that allow us to condition the forecast as we wanted.

Just wanted to ask, if all this clear up till here?

Hi @kashif,

I sincerely appreciate your swift response to my inquiry. I comprehend that lagging is responsible for generating the mentioned vectors, as elucidated in your explanation.


  1. Considering that self-attention already involves cross-multiplying various positions, the lagging operation introduces additional computational complexity to the self-attention mechanism (the multiplications between different steps are already done in the self attention).An alternative approach might be to enhance dimensionality using a Linear layer or an Embedding layer. But I guess that actual tuning gives best performance with lagging.

  2. Furthermore, x_t seems not to be not available in p(x_t | x_t-1, x_t-2, …, c_t) because at least we lag the series by one (at least in training).

Thank you once again for your assistance.

Best regards,


so regarding

  1. yes your intuition is correct, what the lag features do in some sense is to offer a trade-off between the sequence length vs. an inductive bias (coming from the chosen lag indices). Since attention is quadratic, this helps. The lag-feature brings in information from the very past which without it would require a very long sequence length. Thus with a shorter number of sequences, we are shoving information into the vector’s coordinates. Yes, I suppose a linear “tokenization” would work but even with a 1-d time series there are the mu and sigma of the window that need to be passed in so I found lagging to be the most straightforward thing to do for time series tokenization. The downside to lag features is that it requires an initially larger context window from which to create the context-window vectors… another downside is in the case of multivariate inputs, depending on the size of the multivariate vector, with lags the resulting vector becomes some multiple of it and then such a large input size to the transformer or any other layer might not be practical…

  2. Yes so the x_t is in the loss(…) while the conditioning is the stuff going into the input of the transformer to get the representation. That is all I wanted to say with that part… that the input has the values for x_t-1 together with the time features of c_t

please let me know if you have more questions!

Dear @kashif,

I greatly appreciate your assistance.

I have a couple more questions:

  1. If my understanding is correct, it seems that we are excluding the last step of the observed context past_time_features when passing it to the encoder. I presume this is due to the constraint that the smallest value of lag_sequences must be 1.

  2. I’m curious if this transformer can be applied to a multivariate signal, where the goal is to predict one of its X features (N:1 case).

Thank you for your guidance.

Best regards,