Using huggingface models without any other huggingface support?

Hello, I am a beginner to huggingface.

I am working on a project to see how some models/architectures perform with my custom dataset for semantic segmentation. I want to train my models from scratch with no pre trained weights. I am comparing models like ResNet50, Segfromer and Mask2Former. I load all my images and masks using DataLoader. For ResNet50, I use the ResNet50 FCN provided by torchvision: fcn_resnet50 — Torchvision main documentation.

For Segformer, I found that huggingface provides a Segformer model. So I am just using that. The performance isn’t that great so I am wondering if I am doing something wrong. Is it fine if we use huggingface models without using any other huggingface methods like AutoImageProcessor?
I load the model using:

configuration = SegformerConfig(**dict(arch['args'])) 
configuration.num_labels = data['num_classes']
model = SegformerForSemanticSegmentation(configuration)

Then I get the results using:

for _, images, masks in dataloader:
    images =, non_blocking=True)
    masks =, non_blocking=True)
    outputs = model(pixel_values=images, labels=masks).logits
    outputs = nn.functional.interpolate(outputs, size=masks.shape[-2:], mode="bilinear", align_corners=False)

I am also trying to use Mask2Former to train from scratch, but for post processing I need to use Mask2FormerImageProcessor to get the semantic segmentation. I already have processed my images in my DataLoader. What do I do here to just use Mask2Former with my own data?


So for Mask2Former. I am doing this

    configuration = Mask2FormerConfig(**dict(arch['args']))
    configuration.num_queries = data['num_classes']
    model = Mask2FormerForUniversalSegmentation(configuration)
for _, images, masks in dataloader:
    images =, non_blocking=True)
    masks =, non_blocking=True)
    outputs = model(images,
        outputs = outputs.masks_queries_logits
        outputs = nn.functional.interpolate(outputs, size=masks.shape[-2:], mode="bilinear", align_corners=False)

Is this the correct way to use Mask2Former (model only) for semantic segmentation? Its not performing that well…

Hi, can someone check if I have implemented this correctly? I am not getting the best results. Thanks