Generate embeddings with custom (non-text) dataset

entropy · January 24, 2023, 11:34pm

Long story short I have a dataset of items represented as variable length integer sequences (ie [34, 827, 4011, 5, ...]), analogous to text after tokenization and numericalization. I want to use masked language modeling to learn self-supervised embeddings of these integer sequences. What would be the best way to do this?

I can see how to rig up a BERT-style model that will randomly mask tokens and predict the masked tokens, but I’m stuck at the best way to generate the final embedding (the thing I actually want). Sure I can chuck a [CLS] token into each sequence, but I don’t know how useful that would be without the next sentence prediction task used by BERT. Maybe during inference, average token representations from one of the layers? Does anyone have ideas on this?

Extra info:
The data doesn’t have a “paired data” framework, so pair-based SSL isn’t possible.
There isn’t a clear concept of “data augmentation”, so augmentation-based SSL like BYOL or SimSiam are out.
Masked token prediction makes a lot of sense for this task, because the un-masked tokens have contextual information that should enable correct prediction of the masked token.

Topic		Replies	Views
Numeric embedding input for transformers Beginners	0	977	June 13, 2023
Using a dataset with already masked tokens Beginners	2	702	February 3, 2021
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1379	July 22, 2023
Multi-task learning for masked language modeling and token classification 🤗Transformers	0	602	October 5, 2021
Train BERT on time-series data Models	1	2969	September 7, 2022

Generate embeddings with custom (non-text) dataset

Related topics