VisualBert Embeddings

Hi! I am trying to fine tune VisualBert for a classification task but right now it is randomly predicting only one of the two classes that I have. I am thinking that it might be the way I am retrieving the visual embeddings. I am using resnet50 and I get the features from this line of code:
detector = torchvision.models.resnet50(pretrained=True) detector = torch.nn.Sequential(*list(detector.children())[:-1])

does anyone know if these embeddings actually work with VisualBert? I read that it typically needs embeddings from an object detector but since I am only classifying image-sentence pairs I thought that this network could also work. Thanks!