I wonder why did you choose 2 as embedding size?
Is there any specific reasons?

from transformers import TimeSeriesTransformerConfig, TimeSeriesTransformerForPrediction
config = TimeSeriesTransformerConfig(
prediction_length=prediction_length,
# context length:
context_length=prediction_length * 2,
# lags coming from helper given the freq:
lags_sequence=lags_sequence,
# we'll add 2 time features ("month of year" and "age", see further):
num_time_features=len(time_features) + 1,
# we have a single static categorical feature, namely time series ID:
num_static_categorical_features=1,
# it has 366 possible values:
cardinality=[len(train_dataset)],
# the model will learn an embedding of size 2 for each of the 366 possible values:
embedding_dimension=[2],
# transformer params:
encoder_layers=4,
decoder_layers=4,
d_model=32,
)
model = TimeSeriesTransformerForPrediction(config)

And I donâ€™t understand the idea about the comment on the cardinality.
Why is there 366 possible values? Itâ€™s 366 different region I thought, not the 366 measurements on the same region.

The idea of using categorical features is that one converts the specific ID to a vector whose size can be chosen by us. The resulting vectors for each of the individual IDs start by being random vectors and through the learning process, the network moves them around as appropriate. This is achieved via the nn.Embedding layer in Pytorch. To initialize this layer we need to specify the total number of unique IDs (the cardinality of the categorical feature that depends on the dataset and we cannot control this) and the size of the resulting vector (which as mentioned is something we can control). Thus the [2] is the resulting size of the embedding vector and can potentially be tuned. The cardinality is 366 since for this particular dataset there are 366 time series and we choose to give them an ID of 0, 1, â€¦, 365.

Wow. I would not expect that the author answer the question.
Really Thank you !!!

But I am still confused.

In this sentence, What the specific ID means?
Is this just index ID? (It seems to me) , Then what exactly is it learning from the time series data values?

It would be great if I can read some resources about these like papers or articles.
I think I am curious about the behind logic.
Thx!!

what is the learning parameters you refer to? (mean, std, seasonal data, trend etcâ€¦)
And what is the probability distribution you refer to? specifically what is the X-axis of that?

This seems might be related with my question I think.
Thx!

so apart from the target 1-d data that we wish to learn into the future, the datasets typically come with extra information, for one we know the date-times of each of the values in the 1-d array and we can incorporate it into the model. Similarly, for each time series in the dataset we can assign it with an ID, e.g. 0 for the first time series, 1 for the next etc., and also give the model this information while training, since if you recall we have a single model which we train over all the time series of a dataset.

What will the model learn? Well initially the model will assign random vectors to each ID but as training progresses, similar time series might have similar pointing vectors etc. Of course, it could also be that the model performs better without these embeddings, and hence it is optional.

The logic here is to give the model the ability to incorporate all the information available to it.

Suppose we say that the distribution at each time point is a Gaussian, in which case the modelâ€™s emission head returns the mean and std that defines the Gaussian distribution. Given a distributional object, e.g. gauss = Normal(mu, sigma) one can then use the neg. log. likelihood as the loss during training: -gauss.log_prob(x_t+1) or at inference time sample from it: gauss.sample((100,))

In this way the model outputs the appropriate parameters of the distribution given the past, very similar to the NLP case, where the model outputs the parameter of a Categorical (over all the tokens in the dictionary).

Right, the distribution can be real-valued when the data is real-valued (E.g. Gaussian or Student-T), or if the data is integer-valued count data e.g. sales of articles, etc. then one can select the Negative-Binomial head. Those are the 3 heads I have implemented but of course, one can use all the other distributions available in PyTorch if needed: Probability distributions - torch.distributions â€” PyTorch 2.1 documentation, or one can also cook up mixtures of distributions.

One is still specifying the distribution here and the data might not fit it, in which case one can also cook up a head that learns the shape of the quantile function of the dataâ€¦ but yeah this is more advanced stuff I suppose.