Performance Issue Segformer Cityscapes 1024-1024

First, thanks for sharing the weights of Segformer on CityScapes here. I have difficulties replicating (or even approaching) the Segformer paper performance on cityscapes. The first issue I observed is that from B0 to B5 Image Processors only B1 and B5 resize the input image to 10241024 and the rest to 512 512, as you can see in the attached screenshot.

Moreover, with a standard pytorch Dataset (see below using the LabelTrainingIds after running cityscapesscripts) loading the image and label as PIL image and passing both through the image processor I obtain an mIoU of 58 on cityscapes validation.

When using a custom albumentation transform pipeline of Resize(1024, 1024) , Normalize, ToTensorV2 I improve the result to mean IoU 68 still far from the 76.2 in the paper.

From the paper and mmsegmentation the test was done via 1024*1024 sliding window with stride 768. Were the result replicated here with this type of inference, it is unclear from the image processor attributes ? Is there an available implementation of the Cityscapes inference pipeline with this model implementation ?

If not what was the achieved results and the used pre-processing pipeline ?

Code for CityScapes Dataloader:

class CityscapesDataset(Dataset):
    """CityScapes Dataset from Raw Data"""

    def __init__(
        self,
        root_dir: os.PathLike,
        image_processor: Optional[BaseImageProcessor] = None,
        transform: Optional[A.Compose] = None,
        split: str = "train",
    ):  # TODO: we could specify a callable for transform
        """Initialize the dataset Object.

        Args:
            root_dir (os.PathLike): Local path to raw data.
            image_processor (Optional[BaseImageProcessor], optional): HuggingFace Image Processor. Defaults to None.
            transform (Optional[A.Compose], optional): Set of transforms for processing images and masks. Defaults to None.
            split (str, optional): Dataset split to load. Defaults to "train".
        """
        self.root_dir = root_dir
        self.image_processor = image_processor
        self.split = split
        self.transform = transform

        self.images_dir = self.root_dir / "leftImg8bit" / self.split
        self.labels_dir = self.root_dir / "gtFine" / self.split

        self.images = []
        self.labels = []
        self.ids = []

        for city in self.images_dir.iterdir():
            for image_path in city.iterdir():
                self.images.append(image_path)
                self.labels.append(
                    self.labels_dir / city.name / image_path.name.replace("leftImg8bit", "gtFine_labelTrainIds")
                )
                self.ids.append(image_path.name.replace("_leftImg8bit.png", ""))

    def __len__(self) -> int:
        return len(self.images)

    def __getitem__(self, idx) -> Dict[str, torch.Tensor]:
        image = Image.open(self.images[idx]).convert("RGB")
        label = Image.open(self.labels[idx])

        if self.transform is not None:
            transformed = self.transform(image=np.asarray(image).copy(), mask=np.asarray(label).copy())
            image = transformed["image"]
            label = transformed["mask"]

        if self.image_processor is not None:
            encoded_inputs = {}
            process_inputs = self.image_processor.preprocess(images=image, segmentation_maps=label, return_tensors="pt")
            for k, _ in process_inputs.items():
                encoded_inputs[k] = process_inputs[k].squeeze()  # remove batch dimension

        else:
            encoded_inputs = {"pixel_values": image, "labels": label}

        encoded_inputs["id"] = self.ids[idx]
        # encoded_inputs["labels"].apply_(lambda x: CityscapesDataset.mapping_ids[x])

        return encoded_inputs