I have seen several papers using a language model like BERT in combination with an LSTM. I am wondering if/how padded sequences are being considered with that approach.
As a small example (without full torch module and training loop):
model = AutoModel.from_pretrained('bert-base-uncased') tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') lstm = torch.nn.LSTM(768, 256) inputs = tokenizer(['this is a text'], max_length=10, padding='max_length', return_tensors='pt') outputs = mode(**inputs) lstm(outputs.last_hidden_state)
Will the embeddings of padding tokens then “distort” the output of the LSTM layer? If so, how can I avoid that?