I am working on timeseries problem. I have dataset captured over several usage sessions of a machine. The input is 7 feature time series and output is 3 target time series. The output at time step t is directly determined by input at time step t-1. However, the machine usually change its internal physical characteristics (like it expands or contracts) which in turn can indirectly affect the output. However this change is very very tiny and has very tiny impact on the input. Apart from that sometimes during usage of the machine, I dont get to see actual output. So, I cannot always have past ground truth output to feed to the model for predicting next output.
I tried LSTM model accepting feature timeserieses as inputs and predicting target timeserieses. It worked but not satisfactorily. For some usage sessions, it still gives wrong predictions.
LSTM consumes all 24 GBs of GPU memory during training (especially due to unrolling over time window of size 200). So I was exploring other smarter approaches, especially time series transformer approaches.
To start with transformer, I tried PatchTSTForRegression
implementation from huggingface library. It worked a bit but poorer than LSTM. (The official blog explains how to use PatchTSTForPrediction
. I guess prediction involves forecasting input timeseries for future timesteps. My input and output timeseries are different. So, I felt I must be opting for PatchTSTForRegression
.)
I went through the PatchTST paper and found that huggingface implementation have many concepts implemented which are not discussed in the PatchTST paper (for example, output distributions). So I thought I better try official PatchTST implementation. Turn out that official repo also have implementation mainly for prediction. It has two prediction mode:
- MM (multiple input and output timeseries)
- SS (single input and output timeseries)
- MS (multiple input timeseries and single output timeseries): However in this mode too, it inputs both feature and target timeseries, also outputs all timeseries, but while calculating loss it just uses last timeseries in the output (and hence “S” in MS mode).
So it requires ground truth targets at time step t to predict target at future time steps (t+1
to t+prediction_window
). But I want to predict target at time step at t using current (t
) and past (till t-sequence_length
) features.
I tried modifying Flatten_Head
to output 3 timeseries. But it did not learn at all to predict target timeseries for next single time step.
Since, I have target timeseries values for all timesteps in training dataset, I tried passing t
to t-sequence_length
values for feature time series and past ground truth targets too (t-1
to t-sequence_lenght-1
), total 10 timeseries. Still it did not beat LSTM performance. (I was thinking I will pass past predictions instead of ground truth during last some epochs and inference.)
How I am thinking to try the same (pass past target time series ground truth) with huggingface implementation. Also, I may try PatchTSMixerForRegression
. I also thought of tring vanilla transformer, but it might take more time to implement from scratch (in comparison to the existing time series tranformer implementation like PatchTST and PatchTSMixer) and still may end up with poorer performance.
I have spent many months on this problem and now thinking what should I do to quickly beat LSTM performance. I have following doubts:
-
What other network architecture / model options do I have?
-
Does feeding past targets (ground truth and / or past predictions) along with features will give same effect as teacher forcing, especially because in teacher forcing, past targets are fed to decoder and not encoder and PatchTST is encoder-only model?