Multi-label sequence labeling (for e.g., multi-label NER)

I am trying to create a multi-label model for sequence labeling (e.g., multi-label NER) where each token in the input can have multiple labels. For example, given the sentence “George Washington University” George can have two labels, “B-Per” and “B-Loc”.

So far, I have implemented my model using:

import transformers
from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification
import torch
from transformers.modeling_outputs import SequenceClassifierOutput
from transformers import BertConfig, BertModel
class seq2SeqBERT(torch.nn.Module):
	def __init__(self):
		super(seq2SeqBERT, self).__init__()
		configuration = BertConfig()
		self.bert = BertModel(configuration)
		self.classifier = torch.nn.Linear(768, 5)
		self.criterion = torch.nn.BCEWithLogitsLoss()
	def forward(self, input_ids, attention_mask, labels = None):
		embeddings = self.bert(input_ids = input_ids, attention_mask = attention_mask)
		logits = self.classifier(embeddings['last_hidden_state'])
		loss_ = None
		flat_outputs = logits[labels!=-100]
		flat_labels  = labels[ labels!=-100]
		if labels is not None:
			loss_ = self.criterion(flat_outputs, flat_labels)
		return SequenceClassifierOutput(loss = loss_, logits = logits, attentions=embeddings.attentions)

My input sequence is something like
Input: [word1, word2, …, wordN]
output: [[0,0,0,0,0], [1,0,1,0,1] … [0,1,1,0,0]], i.e., each word is associated with multiple outputs.

I can see that the loss is going down during training, but when I try to infer using the following:

outputs = model(input)
loss = outputs['loss']
logits = outputs['logits']
predictions = torch.sigmoid(logits)

I am getting predictions as [1,1,1,1,1] for each word (i.e., the model is predicting all the classes for all the words).

Can someone please guide me toward the correct implementation of the model? Or at least some suggestions would be very helpful.