Train BERT on time-series data

Hello everyone! I’d like to train a BERT model on time-series data. Let met briefly describe of the data I’m using before talking about the issue I’m facing.

I’m working with 90 seconds windows, and I have access to 100-dim embeddings for each second (i.e. 90 embeddings of size 100). My goal is to predict a binary label (0 or 1) for each second (i.e. a vector of 0s ans 1s of length 90).

My first idea was to approach this as a multi-label classification problem, where I would use BERT to produce a vector of size 90 filled with numbers between 0 and 1 and regress using nn.BCELoss. A simple analogy would be to consider each second as a word, and the 100-dim embedding I have access to as the corresponding word embedding. I would then like to train BERT (from scratch) on these sequences of 100-dim embedding (all sequence lengths are the same: 90).

The problem: when dealing with textual inputs, we simply add the CLS and SEP tokens to the input sequences, and let the tokenizer and the model do the rest of the job. When training directly on embeddings, what should we do to account for CLS and SEP tokens?

One thing that came to my mind was to add a 100-dim embedding at position 0 standing for the CLS token, as well as a 100-dim embedding on position 90+1=91 standing for the SEP token. But I don’t know what embeddings I should use for these two tokens. And I’m not sure that’s a good solution either.

Thank you!

@clems I am running a similar experiment currently and have posted my thoughts on it here. The largest difference I have to yours is that I don’t have to run the self-supervised training that BERT did, as I have labels. Since it seems you are running the self-supervised training, were you able to obtain any results with your initial suggestion? I’d be interested to hear your findings.