Fine tuning SAM with input images 256x256

Hello all
I am trying to fine-tune the Segment Anything (SAM) model.
I would like to load a pretrained model such as “sam-vit-base”
so that the trained encoder is frozen and only the decoder is fine tuned.
My input images are 256x256 and so are the segmentation labels.
I see that the model resizes the input to 1024x1024 and outputs 256x256 labels as default.
Is it possible to use a pretrained model for initialization but use the 256x256 images without resizing to 1024x1024. If so what are the code lines and parameters which I should use.
Once I do:
model = SamModel.from_pretrained(“facebook/sam-vit-base”)
How can I modify the image_size=256 parameter.
Also, I saw that once I do change the image_size to 256 without loading a pretrained model then the output mask reduces to 64. I want to keep the output 256.

In summary I would like to start training from a pretrained model and use an input size 256 (w/o resizing to 1024) and get output size 256.

Thank you for your assistance

Oded

Hello,

I have successfully fine-tuned SAM from transformers with 256x256 images and masks. The masks were loaded and converted to numpy 256x256 and grayscale. I used this to get the required numpy arrays to match the fine-tune tutorial.

def load_and_resize_and_grayscale_images_from_dir(directory, new_shape, threshold=0.0005):
    images = []
    for filename in os.listdir(directory):
        if filename.endswith(".png") or filename.endswith(".jpg"):
            img_path = os.path.join(directory, filename)
            # Remove the first two characters ("._") from the filename... somehow the mac version kept this
            #img_path = os.path.join(directory, filename[2:])
            img = imread(img_path)
            resized_img = resize(img, new_shape, preserve_range=True, anti_aliasing=True)
            grayscaled_img = rgb2gray(resized_img)
            # Apply thresholding to convert grayscale values to 0's and 1's
            thresholded_img = (grayscaled_img > threshold).astype(np.int32)
            images.append(thresholded_img)
    return np.array(images)

def load_and_resize_images_from_dir(directory, new_shape):
    images = []
    for filename in os.listdir(directory):
        if filename.endswith(".png") or filename.endswith(".jpg"):
            img_path = os.path.join(directory, filename)
            # Remove the first two characters ("._") from the filename... somehow the mac version kept this
            #img_path = os.path.join(directory, filename[2:])
            img = imread(img_path)
            resized_img = resize(img, new_shape, preserve_range=True, anti_aliasing=True).astype(np.uint8)
            images.append(resized_img)
    return np.array(images)

Thank you for your reply but I don’t see how this feedback is related to my query.
As I explained, I do not want to resize the image from 256x256 and I want the ViT encoder to accept it as is.

Hey again, sorry for the delayed response. Is there a specific reason why you don’t want the encoder to resize it to 1024? Since SAM was trained on 1024x1024 images, it is to my understanding that this is necessary. If you’re looking for a smaller model, I think TinySAM may help?

The interpolation effects edges, especially for small object which is what I am trying to segment in the image. But if this is a limitation of the ViT architecture I was not aware of then I get it :).