Feed output from one transformer model as input to another

So I am trying to train an automated essay scoring system, that combines the loss of predicting scores with predicting whether a sentence is grammatically correct. To do this I have split each sentence in the essay with a sep and a cls token so that an essay is fed into Bert like this:

essay 1 : [cls] … sent 1 … [sep][cls] … sent 2 … [sep][cls] … sent 3 … [sep][cls] … etc
essay 2 : [cls] … sent 1 … [sep][cls] … sent 2 … [sep][cls] … sent 3 … [sep][cls] … etc

As well as a list of labels for each sentence whether it contains a grammatical error or not and a score for the essay i.e

essay 1 : labels: [1,0,0,1,etc…],score:38
essay 2 : labels: [1,1,0,1,etc…],score:24

(So that the list of labels can be passed into a dataset they must have the same length as the input_ids, attention_masks etc. so I have padded them with 2’s at each word that is not a cls token, so they actually look something like this…

[1,2,2,0,2,2,2,0,2,2,2,2,1,etc…]
which would correspond to a sent
[cls,w,w,cls,w,w,w,cls,w,w,w,w,cls…]

(w=word)

I then use this to get the index of the cls tokens in each essay.

So the performance of the model on the grammatical error detection is comparable to feeding in a sentence normally to bert:

sent 1 : [cls] … sent 1 … [sep]
sent2 : [cls] … sent 2 … [sep]

However, unsurprisingly the additional sep and cls tokens decrease the model performance when compared to feeding an essay into Bert normally:

essay 1 : [cls] … essay 1 … [sep]
essay 2 : [cls] … essay 2 … [sep]

So to combat this I am trying to use the vector representation of each cls token in the output of each essay as input to another smaller transformer model as done in this paper here :frowning:https://arxiv.org/pdf/1903.10318.pdf). However I cannot figure out how to do this, I have tried feeding the output into a embedding layer with input_embeds = True but this does not work.

My is a simplified version of my code so far with just a mini batch I have used as a test:

# encoded_dataset_train is my training dataset type = (datasets.arrow_dataset.Dataset)
mini_batch = encoded_dataset_train[:2]
print(mini_batch)
'''{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'input_ids': tensor([[    0, 23314,  5348,  ...,     1,     1,     1],
         [    0,  1360,    73,  ...,     1,     1,     1]]),
 'label': tensor([[1, 2, 2,  ..., 2, 2, 2],
         [1, 2, 2,  ..., 2, 2, 2]]),
 'score': tensor([31, 23])}'''

from transformers import AutoModel
model = AutoModel.from_pretrained('distilroberta-base')

# get the output of the final attention layer of the model 
input_ids = mini_batch['input_ids']
attention_mask = mini_batch['attention_mask']
output = model(input_ids=input_ids,attention_mask=attention_mask)

print(output.shape)
'''
torch.Size([2, 512, 768])
'''

# get a mask where each cls token in each batch is 1 and 0 for all other tokens
bs,tok_len = mini_batch['label'].shape

active_labels = torch.where(mini_batch['label']<2,1,0).reshape(bs,tok_len,1)expand(active_labels.shape[0],active_labels.shape[1],768)

print(active_labels.shape)
'''
torch.Size([2, 512, 768])
'''

# multiply the output by the mask
active_loss = output*active_labels

print(active_loss.shape)
'''
torch.Size([2, 512, 768])
'''

# Create my smaller model (so far only extracted the layers, I want have not combined them yet)

embeds = model.embeddings
layers = model.encoder.layer[:2]
classifier = model.classifier

# Result for trying to pass the output into the embedding layer

embeds.forward(input_ids=active_loss,inputs_embeds=True)

'''
RuntimeError: The size of tensor a (512) must match the size of tensor b (768) at non-singleton dimension 2
'''

Cheers in advance

Is this possible or do I have to perform pooling at each sentence vector to get single value to represent that sentence and use that as an input?