I am getting bad performance when evaluating on Huggingface test dataset (GLUE dataset)

I am getting decent performance for CoLA like 0.58-0.60 on the dev set, but when I evaluate on the test set I am getting a very poor performance like 0.0. Why is that? What am I missing?