Correct way to define outputs for an Image Model

Hey! I am making a Image AutoEncoder of type PreTrainedModel so that its compatible with the Trainer class. I understand that the output should be in a specific format so the trainer can automatically infer the output but I don’t understand is what format I should adopt. For e.g. I am feeding in images of shape batch_size x 3 x 256 x 256 and my output is another tensor with the same dimensions. So I output the loss and logits as part of a dictionary. While that works during the training phase, the trainer fails during the evaluation phase and gives me the following error:

  9%|▉         | 29/320 [00:27<01:59,  2.44it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed
  9%|▉         | 30/320 [00:27<01:57,  2.46it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed
 10%|▉         | 31/320 [00:28<01:57,  2.46it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed
 10%|█         | 32/320 [00:28<01:32,  3.13it/s]***** Running Evaluation *****
  Num examples = 200
  Batch size = 32

  0%|          | 0/7 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\nisha\Documents\Imagine\", line 60, in <module>
  File "C:\Users\nisha\Documents\Imagine\", line 52, in main
    atrain(args.dataset, args.subdatasets)
  File "C:\Users\nisha\Documents\Imagine\", line 91, in train
    train_single_asset(subdataset, dataset_tag)
  File "C:\Users\nisha\Documents\Imagine\", line 75, in train_single_asset
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 1455, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 1565, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 2208, in evaluate
    output = eval_loop(
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 2394, in evaluation_loop
    preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 106, in nested_concat
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 106, in <genexpr>
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 108, in nested_concat
    return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\", line 69, in torch_pad_and_concatenate
    if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]:

Upon inspection it seems to be that the output doesn’t match what the trainer expects. I should probably mention that since I have an AutoEncoder model, I don’t have an explicit label since the input is the label. Any help would be much appreciated!