Output of Pyramid Vision Transformer

What does the output of PVTModel represent? Is it image patch like ViT or feature map like CNN?