Multiple sequences per sample

Hi, I am using bert to embed word similarities in a model I am working on. For each sample I have a list of object labels of variable lengths associated with it, for example:

['dog', 'cat', 'car', 'bus'] or ['person', 'sign']

My implementation so far has been to combine all the object labels together split with '[SEP]' and pass this through the bert network:

['dog', 'cat', 'car', 'bus'] – combine → 'dog [SEP] cat [SEP] car [SEP] bus' – tokenize → bert(tokens) → [768] embedding

I am not sure if this is the best way to handle this situation and was thinking to pass each object label seperately through the model and perform some sort of combination of the embeddings at the end, for example:

[bert('dog'), bert('cat'), bert('car'), bert('bus')] → torch.sum([[768],[768],[768],[768]])

For this method I have modified my code so that a vector of batch_size x T x object label tokens is generated where T is the number of objects for the sample and can vary. With padding the shape of this tensor is [192,8,5], I am unsure how I would pass this through to bert() though considering each tokenized object label (T) would have to be passed seperately.

I dont have a great deal of experience with bert so I was hoping someone more experienced may be able to give me some advice. I grealty appreciated any help/suggestions.

For anyone wondering I got this to work by flattening the batch and sequence length (T) dimensions:

# x.shape = [192,8,5]
y = torch.flatten(x, start_dim=0, end_dim=1).to(torch.int64)
# y.shape = [1536,5]
bert_embeds = bert(y)
1 Like