Training doesn't end properly but stops the machine with no error message

System Info

  • transformers version: 4.37.0.dev0
  • Platform: Windows-10-10.0.19045-SP0
  • Python version: 3.11.5
  • Huggingface_hub version: 0.20.1
  • Safetensors version: 0.4.0
  • Accelerate version: 0.25.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sgugger @ArthurZ

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. I used own remote local machine via AnyDesk.
  2. there happens some warning messages like “You’re using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.” and “The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.float16.”
  3. Training goes well, but after the train, train loss is shown and suddenly the machine stops. All the programs include visual studio turn off and got disconnected from the machine. I think it automatically restarts. Saving isn’t performed right, and output isn’t saved right either. To be specific, .json files such as config.json or tokenizer.json and README.md are all broken. They do exist but all broken. I cannot check the safetensor file. There is nothing saved in the output_dir which is declared in training_params.
  4. To add, I ran the code with commenting out trainer.model.save_pretrained line, but still crashes. But surprisingly, .json files related to tokenizer is saved properly, not broken. PC crash still happens but by commenting out saving model line, broken file problem is partially solved.
  5. Here is the code
import transformers
from transformers import (BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TrainingArguments, logging)
import torch
import os
from datasets import load_dataset, concatenate_datasets
import json
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM


def main():
	base_model = "mistralai/Mistral-7B-Instruct-v0.2"
	new_model = "Mistral-7B-Instruct-v0.2_newmodel"

	compute_dtype = getattr(torch, "float16")

	quant_config = BitsAndBytesConfig(
		load_in_4bit=True,
		bnb_4bit_quant_type="nf4",
		bnb_4bit_compute_dtype=compute_dtype,
		bnb_4bit_use_double_quant=False
	)

	tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True, padding_side = "right")
	model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config = quant_config, attn_implementation = "flash_attention_2", device_map = {"": 0})
	model.config.use_cache = False
	model.config.pretraining_tp = 1

	if tokenizer.pad_token is None:
		tokenizer.pad_token = tokenizer.eos_token

	
	train_dataset = load_dataset('json', data_files = './dataset/mixed_train.json', split = 'train')
	eval_dataset = load_dataset('json', data_files = './dataset/mixed_val.json', split = 'train')
	print(f"train dataset size: {len(train_dataset)}, eval dataset size: {len(eval_dataset)}")

	training_params = TrainingArguments(
		output_dir="./FT_newmodel",
		num_train_epochs=1,
		per_device_train_batch_size=2,
		per_device_eval_batch_size= 1,
		evaluation_strategy='steps',
		eval_steps=25,
		gradient_accumulation_steps=4,
		optim="paged_adamw_32bit",
		logging_steps=25,
		learning_rate=2e-5,
		weight_decay=0.001,
		fp16=False,
		bf16=False,
		max_grad_norm=0.3,
		max_steps=-1,
		warmup_ratio=0.03,
		group_by_length=True,
		lr_scheduler_type="constant",
		report_to="tensorboard"
	)

	peft_config = LoraConfig(
		lora_alpha=16,
		lora_dropout=0.1,
		r=64,
		bias="none",
		target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    	],
		task_type="CAUSAL_LM"
	)

	trainer = SFTTrainer(
		model = model,
		tokenizer = tokenizer,
		train_dataset = train_dataset,
		eval_dataset = eval_dataset,
		dataset_text_field= "text",
		args = training_params, 
		peft_config = peft_config,
		max_seq_length = 512,
		packing = False,
		neftune_noise_alpha = 5
	)

	trainer.train()
	trainer.model.save_pretrained(new_model)
	trainer.tokenizer.save_pretrained(new_model)


if __name__ == "__main__":
	print("Training starts")
	main()
	print("Training ended")

  1. Funny part is that actually the machine stops after “Training ended” is printed. There is no error message, machine just stops. I really can’t figure out the problem. Please help…
  • For OOM issue, do i have to check hard disk memory? If so, more than 300GB is left, but i guess it’s the point.
    Since I’m using windows I installed bitsandbytes library compiled for windows, made by individual not official, I thought of some unknown incompatibility like thing happened, so I made new WSL2 Linux environment and run the code, which led me to “Error out of memory at line 380 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c” Does this indicate something??

Expected behavior

I just want it to end properly by saving the fine-tuned model.

The “Please note that with a fast tokenizer…” warning is totally okay. Its been discussed extensively elsewhere, e.g., Get “using the __call__ method is faster” warning with DataCollatorWithPadding. You can disable it (see link).

I’m a little confused. Did the program OOM or did the training finish? #3 indicates that the program crashed, but #6 suggests the program finished.

I’m also confused by #3. You say nothing was saved in the output_dir, but the output files were corrupted when saving. In other words, if nothing was saved, how could the saved output be corrupted?

Can you post the command used to execute the script as well as the terminal output? Also you mention this is your own task or dataset. Please elaborate on that.

Sorry for imperfect elaborations.

OOM
Terminal doesn’t show OOM directly, it was just my speculation. Actually, terminal prints out “Training ended”., but right then, the machine stops. I asked ChatGPT for this, it says that since memory processing takes some time, so system can crash even after “Training ended” is printed out.

Output Files
For output files, I meant that nothing was saved on ./FT_newmodel which is mentioned in training_params. I think log files for training is saved in that directory. For crashed output, it is saved on ./Mistral-7B-Instruct-v0.2_newmodel, which is the directory that model and configurations are saved by

trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

Other Information
I just executed the code by python fine_tuning.py which is the python script file name. Terminal shows nothing more than I mentioned. 2 warning messages and progress bar, traning starts and training ended.
I’m just practicing instruction based fine-tuning. I’m using some medical dataset that I curated, which is sourced from hugging face and pubmed.
Instructions are something like this: “[INST] Write an appropriate title for following medical abstract: [abstract][/INST] [title]”. There are also sentence start token and sentence end token in the text which I guess are not shown on here.

Okay, I understand better. This is indeed unusual. Have you watched the memory usage of the process using top? That could be helpful to assess the likelihood of OOM. I would suggest the same for the GPU utilization using nvidia-smi, just to get a sense of how resources are being used. Finally, I’d recommend ensuring that your hard drive isn’t full.

Are you able to get the program to work properly using a slightly different configuration, e.g., by using a different (smaller) model or a different (simpler) dataset?

1 Like

It was a very fool of me. I guess it was indeed OOM problem. I changed r in peft configuration from 64 to 32 and the problem has been solved. It is now using only about 13GB/24GB of dedicated GPU memory. Training speed got much higher… Thanks for the help!

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.