Need help in determining model quality

Model performance as in decreasing loss? What about the MIoU though, where should that variable be in our discussion so far? I know it’s a comparison of the intersection over union so definitely bigger number is better since that means our model matches the prediction to the ground truth really well

1 Like

Model performance as in decreasing loss?

No. As in your actual inference.
If the actual model’s recognition ability is improving, then it is the evaluation function that is wrong.

Regarding the evaluation function, for example, if you mix in images of rivers from different angles and satellite images in the dataset for evaluation, you can also detect when the response to these images has become poor. If you assume that the evaluation function is smart in advance, you have the option of finally adopting the model that gave the best evaluation. This makes it easy to prevent over-learning.

Interesting, I think I wanna try that on my next training.

So TLDR, the model is good for one thing and bad for other thing because:

  1. all images, both for validation and the training, are from one same angle and position of camera of one place
  2. probably unfit evaluation function, programatically, so I need to improve that as well somehow for broader dataset.

The accuracy is weird but it’s fine because the loss is decreasing with proper MIoU as well.

Anything I left behind on this conclusion? Thanks again John, this has been a really good journey for my first model

1 Like

I don’t think there are any omissions. As for the angle, the model simply learned it faithfully because that’s what the data set was like.
In that sense, it’s the expected result, so there’s no problem.:grinning:

1 Like

Oh sorry one more thing, do you have any comments on why the “not water” IoU is nan? is it because binary id2label?

1 Like

I think you might need to set up IoU. Like whether it’s zero-origin or one-origin. I’ve never used it before, so I don’t know unless I dig through the manual.
Or maybe the pre-/post-processing of the VLM output and evaluation results that would have been passed to IoU is also necessary.

Okay got it, this is the first time I found out MIoU is unreliable as a parameter too tbh, I didn’t know comparing intersection could be that complex for bigger region. Lucky the dataset is on 512 size and only signing one label so I do hope that isnt included to the problem that paper brought up.

1 Like

This time, we didn’t do any training that relied on the evaluation function, so even if the evaluation is wrong, there is no real harm. It’s lucky.

Is it? It’s not on this one?

import torch
from torch import nn
import evaluate

metric = evaluate.load("mean_iou")

def compute_metrics(eval_pred):
  with torch.no_grad():
    logits, labels = eval_pred
    logits_tensor = torch.from_numpy(logits)
    # scale the logits to the size of the label
    logits_tensor = nn.functional.interpolate(
        logits_tensor,
        size=labels.shape[-2:],
        mode="bilinear",
        align_corners=False,
    ).argmax(dim=1)

    pred_labels = logits_tensor.detach().cpu().numpy()
    # currently using _compute instead of compute
    # see this issue for more info: https://github.com/huggingface/evaluate/pull/328#issuecomment-1286866576
    metrics = metric._compute(
            predictions=pred_labels,
            references=labels,
            num_labels=num_labels,
            ignore_index=0
            )

    # add per category metrics as individual key-value pairs
    per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
    per_category_iou = metrics.pop("per_category_iou").tolist()

    metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
    metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})

    return metrics

print(len(id2label))

Btw about this function, I just do a big digging to the functions one yesterday, it seems I also almost using deprecated functions on this one instead of the new one. Idk why it came first on the google search, lucky I did some digging because the hyperlink on top of it doesn’t work at all. Is there any way to contact huggingface about it?


The only difference is that this one has _ on the method if im not wrong.

1 Like

We’re using and applying the evaluation function, but we’re not using it to select or replace the actual model. If you turned off the evaluation, the training results would not have changed. If you were using the options below, the evaluation function would have a greater impact.

load_best_model_at_end (bool, optional, defaults to False) — Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved. See save_total_limit for more.

Is there any way to contact huggingface about it?

You can use email (website@huggingface.co), the HF feedback section on HF Discord, github issues, the Discussion section of the HF repo if it’s a problem with that repo, or you can also mention them from here.
Which one you use depends on the content. Can you be more specific?

Edit:
I understood. These old pages OFTEN appears to me, too. (datasets, transformers, huggingface_hub…)
For large-scale, long-term issues like this, it might be a good idea to make a request in the post below.

To be clear from this, you’re just lightly finetuning a model. I don’t know how big your training dataset or model is, but you’re starting to saturate after just ~2k steps, which is going to be tiny compared to even the weights of a very small model. You’re not going to get some fundamental deeper understanding out of 2k steps of finetuning, like the ability to recognize rivers in satellite images; you’re basically just tuning the input and output layers for the most part. And jumping from camera-level images to satellites is a huge leap in terms of the required understanding; in no situation would I expect that to work without representative samples in the dataset.

If your dataset is larger than where you’re starting to saturate at (I don’t see obvious signs in the graph of epoch jumps), you may want to consider a lower LR, esp. during warmup, and could even add a bit of dropout into the mix. You can also train for many epochs (far past the point of memorization), to try to improve grokking. But really, when you’re measuring your training time in single-digit thousands of steps, hoping for radically out-of-domain success is too ambitious. Broaden your domain if you want that; make sure your training data is representative of the entire scope.

As for whether to rotate, that depends. Do you want detections on heavily rotated images? Do you want them to happen just as readily as with non-rotated images? I don’t think throwing in some rotation would hurt, especially small rotations, but I’d still try to keep the majority of the training dataset representative (reasonable orientations) - again, esp. given how few steps we’re talking about here.

2 Likes

esp. during warmup, and could even add a bit of dropout into the mix

You could have different learning rate across training process?

Broaden your domain if you want that; make sure your training data is representative of the entire scope.

That means add more variety of images including other river viewpoint or rotational position, btw as we are talking about image variety; since these images are 512 x 512 of other self, will it counts as “adding other variety” if I also add 1080 x 2560 full image to the dataset or is there any reason to not do that as in I will just desaturate the neural network?

As for whether to rotate, that depends. Do you want detections on heavily rotated images? Do you want them to happen just as readily as with non-rotated images? I don’t think throwing in some rotation would hurt, especially small rotations, but I’d still try to keep the majority of the training dataset representative (reasonable orientations) - again, esp. given how few steps we’re talking about here.

does “reasonable orientations” means I should abide to the logic of where the dataset is located or as many kinds of rotation possible is also good. Btw how do we decide if adding more dataset won’t add anything, is it from the loss plot? Is there any reading for that?

Thanks in advance for the input!

1 Like