Is it reasonableto pretrain by masking certain dimensions of each vector, rather than the individual token?

Let’s say I want to adapt Transformers to a non-NLP task, like financial data or a multiplayer online video game. You can imagine that the high-dimensional vector of each input will contain information that pertain to different events. For example, the first 10 dimensions might describe player 1, and the next 10 dimensions might describe player 2.

If I were to extend the pre-training exercise to these non-NLP tasks, I think it could be reasonable to mask the actions of certain players in order to predict back their actions. This would essentially involve masking certain dimensions of a vector rather than masking the entire “input”.

My question is: is this reasonable to do and is this even the right approach?

I don’t know what kind of input embeddings you’d be working with in that case, but the problem you’ll probably run into is that latent embeddings are usually not as nicely disentangled as you’ve described here. We sometimes talk about them as if they were for illustrative purposes, but in reality your description of “player 1” is probably distributed across the entire vector rather than existing entirely in some subset of vector positions.

Hi Joeddav,

Naive question here, but rather than learned embeddings like in the case of words, if I directly create the input vector such that player1’s actions can be described via dimensions 1-5, player2’s actions are described via dimensions 6-10 and etc, then does that mean that each player’s information is disentangled by design?

If so, would my question become a reasonable one or is there another way to encode multi-player information in a Transformers model?

Thanks so much

Sure, that sounds like a reasonable thing to try. Let us know how it goes – I’m sure we’d all learn something :slight_smile: