Using fine-tuned model that wasn't explicitly saved

Hi, I’ve got a definite beginner situation/question…

I trained a model overnight with Trainer API. It seemed to finish since I got the Training completed. Do not forget to share your model on huggingface.co/models =) prompt.

I then wanted to do some predictions within the script. However, that didn’t appear to happen. The script stopped since I got the leaked semphare objects warning at the very end of my log.

I forgot to include an explicit save trainer.save_model call in the script.

My questions:

  • Is there a way to retrieve the trained model and use it for predictions? I can see checkpoints in my test-trainer folder and a reference to the model in the cache.

  • Why did I get the leaked semaphore objects warning if the training finished?

Any help would be appreciated.

If it helps for more detail, I copied the tail of my log, starting from the Training completed prompt (removed unhelpful lines).

100%|██████████| 911/911 [32:52<00:00,  1.92s/it]e[A

                                                 e[A

Training completed. Do not forget to share your model on huggingface.co/models =)



                                                      

100%|██████████| 3984/3984 [15:16:58<00:00, 12.06s/it]
100%|██████████| 3984/3984 [15:16:58<00:00, 13.81s/it]
The following columns in the test set  don't have a corresponding argument in `XLNetForSequenceClassification.forward` and have been ignored: hypothesis, idx, premise. If hypothesis, idx, premise are not expected by `XLNetForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7285
  Batch size = 8
{'eval_loss': 0.06020349636673927, 'eval_runtime': 1975.7072, 'eval_samples_per_second': 3.687, 'eval_steps_per_second': 0.461, 'epoch': 3.0}
{'train_runtime': 55018.864, 'train_samples_per_second': 0.579, 'train_steps_per_second': 0.072, 'train_loss': 0.05098113381719015, 'epoch': 3.0}

  0%|          | 0/911 [00:00<?, ?it/s]
  0%|          | 2/911 [00:03<22:50,  1.51s/it]
  0%|          | 3/911 [00:05<32:03,  2.12s/it]
  0%|          | 4/911 [00:08<35:46,  2.37s/it]
  1%|          | 5/911 [00:11<39:23,  2.61s/it]
  1%|          | 6/911 [00:14<41:11,  2.73s/it]
  1%|          | 7/911 [00:16<33:51,  2.25s/it]
  1%|          | 8/911 [00:17<28:34,  1.90s/it]
  1%|          | 9/911 [00:18<24:58,  1.66s/it]
  1%|          | 10/911 [00:19<22:09,  1.48s/it]
  ...
 99%|█████████▉| 906/911 [32:10<00:10,  2.02s/it]
100%|█████████▉| 907/911 [32:12<00:08,  2.05s/it]
100%|█████████▉| 908/911 [32:14<00:06,  2.12s/it]
100%|█████████▉| 909/911 [32:16<00:04,  2.19s/it]
100%|█████████▉| 910/911 [32:19<00:02,  2.21s/it]
100%|██████████| 911/911 [32:20<00:00,  1.99s/it]loading configuration file https://huggingface.co/ynie/xlnet-large-cased-snli_mnli_fever_anli_R1_R2_R3-nli/resolve/main/config.json from cache at /Users/<USER>/.cache/huggingface/transformers/6c94c94c14efab475d1f94dd6e8db89c88d795ee247d6d8bc2abcf08e0a0ffd0.daa7acdd41354d5f480660f3a1afeaf69ccac9a5c013e173e2f5b557a777eaa4
Model config XLNetConfig {
  "_name_or_path": "ynie/xlnet-large-cased-snli_mnli_fever_anli_R1_R2_R3-nli",
  "architectures": [
    "XLNetForSequenceClassification"
  ],
  "attn_type": "bi",
  "bi_data": false,
  "bos_token_id": 1,
  "clamp_len": -1,
  "d_head": 64,
  "d_inner": 4096,
  "d_model": 1024,
  "dropout": 0.1,
  "end_n_top": 5,
  "eos_token_id": 2,
  "ff_activation": "gelu",
  "id2label": {
    "0": "entailment",
    "1": "neutral",
    "2": "contradiction"
  },
  "initializer_range": 0.02,
  "label2id": {
    "contradiction": 2,
    "entailment": 0,
    "neutral": 1
  },
  "layer_norm_eps": 1e-12,
  "mem_len": null,
  "model_type": "xlnet",
  "n_head": 16,
  "n_layer": 24,
  "pad_token_id": 5,
  "reuse_len": null,
  "same_length": false,
  "start_n_top": 5,
  "summary_activation": "tanh",
  "summary_last_dropout": 0.1,
  "summary_type": "last",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 250
    }
  },
  "transformers_version": "4.17.0",
  "untie_r": true,
  "use_mems_eval": true,
  "use_mems_train": false,
  "vocab_size": 32000
}

loading weights file https://huggingface.co/ynie/xlnet-large-cased-snli_mnli_fever_anli_R1_R2_R3-nli/resolve/main/pytorch_model.bin from cache at /Users/<USER>/.cache/huggingface/transformers/497700d1fcba7e3645179d764fdfb1876debe562dc47f01626383396694a7a44.8fd318050b4dc29b0a5933a2c5a73385fe6607522f2ca1622e82339745b920b8
All model checkpoint weights were used when initializing XLNetForSequenceClassification.

All the weights of XLNetForSequenceClassification were initialized from the model checkpoint at ynie/xlnet-large-cased-snli_mnli_fever_anli_R1_R2_R3-nli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use XLNetForSequenceClassification for predictions without further training.
/Users/<USER>/opt/anaconda3/envs/nlu/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Look at the trainer documentation. It may default to saving weights per epoch og per X steps. You can load these checkpoints (wherever they default to) directly. Again, look at the trainer doc. You can both evaluate a single sample, or whole dataset.

I am facing the same leaked semaphore objects issues, any idea why I am having it just after upgrading to Mac OS Sonoma 14.0 ?