Standard Procedure for adapting vision encoders to semantic segmentation

I notice some of the vision encoders do not have a class ModelForSemanticSegmentation, but only one for image classification, although most of them could in theory be trained to perform semantic segmentation by adding a simple decoder. I do not want to reinvent the wheel and was wondering if there was a standard procedure to create my own ModelXForSemanticSegmentation that I could follow (I checked implementations of the semantic segmentation of the models that have it but can I simply follow the same recipe?). For example I’d like to add a decoder for semantic segmentation to the ConvNeXt encoders series.