Fine tuning SAM with input images 256x256

Hello all
I am trying to fine-tune the Segment Anything (SAM) model.
I would like to load a pretrained model such as “sam-vit-base”
so that the trained encoder is frozen and only the decoder is fine tuned.
My input images are 256x256 and so are the segmentation labels.
I see that the model resizes the input to 1024x1024 and outputs 256x256 labels as default.
Is it possible to use a pretrained model for initialization but use the 256x256 images without resizing to 1024x1024. If so what are the code lines and parameters which I should use.
Once I do:
model = SamModel.from_pretrained(“facebook/sam-vit-base”)
How can I modify the image_size=256 parameter.
Also, I saw that once I do change the image_size to 256 without loading a pretrained model then the output mask reduces to 64. I want to keep the output 256.

In summary I would like to start training from a pretrained model and use an input size 256 (w/o resizing to 1024) and get output size 256.

Thank you for your assistance

Oded

Hello,

I have successfully fine-tuned SAM from transformers with 256x256 images and masks. The masks were loaded and converted to numpy 256x256 and grayscale. I used this to get the required numpy arrays to match the fine-tune tutorial.

def load_and_resize_and_grayscale_images_from_dir(directory, new_shape, threshold=0.0005):
    images = []
    for filename in os.listdir(directory):
        if filename.endswith(".png") or filename.endswith(".jpg"):
            img_path = os.path.join(directory, filename)
            # Remove the first two characters ("._") from the filename... somehow the mac version kept this
            #img_path = os.path.join(directory, filename[2:])
            img = imread(img_path)
            resized_img = resize(img, new_shape, preserve_range=True, anti_aliasing=True)
            grayscaled_img = rgb2gray(resized_img)
            # Apply thresholding to convert grayscale values to 0's and 1's
            thresholded_img = (grayscaled_img > threshold).astype(np.int32)
            images.append(thresholded_img)
    return np.array(images)

def load_and_resize_images_from_dir(directory, new_shape):
    images = []
    for filename in os.listdir(directory):
        if filename.endswith(".png") or filename.endswith(".jpg"):
            img_path = os.path.join(directory, filename)
            # Remove the first two characters ("._") from the filename... somehow the mac version kept this
            #img_path = os.path.join(directory, filename[2:])
            img = imread(img_path)
            resized_img = resize(img, new_shape, preserve_range=True, anti_aliasing=True).astype(np.uint8)
            images.append(resized_img)
    return np.array(images)

Thank you for your reply but I don’t see how this feedback is related to my query.
As I explained, I do not want to resize the image from 256x256 and I want the ViT encoder to accept it as is.