Hello everyone! I’d like to train a BERT model on time-series data. Let met briefly describe of the data I’m using before talking about the issue I’m facing.

I’m working with 90 seconds windows, and I have access to 100-dim embeddings for each second (i.e. 90 embeddings of size 100). My goal is to predict a binary label (0 or 1) for each second (i.e. a vector of 0s ans 1s of length 90).

My first idea was to approach this as a multi-label classification problem, where I would use BERT to produce a vector of size 90 filled with numbers between 0 and 1 and regress using nn.BCELoss. A simple analogy would be to consider each second as a *word*, and the 100-dim embedding I have access to as the corresponding *word embedding*. I would then like to train BERT (from scratch) on these sequences of 100-dim embedding (all sequence lengths are the same: 90).

The problem: when dealing with textual inputs, we simply add the CLS and SEP tokens to the input sequences, and let the tokenizer and the model do the rest of the job. When training directly on embeddings, what should we do to account for CLS and SEP tokens?

One thing that came to my mind was to add a 100-dim embedding at position 0 standing for the CLS token, as well as a 100-dim embedding on position 90+1=91 standing for the SEP token. But I don’t know what embeddings I should use for these two tokens. And I’m not sure that’s a good solution either.

Thank you!