System Info
transformers
version: 4.37.0.dev0- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.11.5
- Huggingface_hub version: 0.20.1
- Safetensors version: 0.4.0
- Accelerate version: 0.25.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
- I used own remote local machine via AnyDesk.
- there happens some warning messages like “You’re using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the
__call__
method is faster than using a method to encode the text followed by a call to thepad
method to get a padded encoding.” and “The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.float16.” - Training goes well, but after the train, train loss is shown and suddenly the machine stops. All the programs include visual studio turn off and got disconnected from the machine. I think it automatically restarts. Saving isn’t performed right, and output isn’t saved right either. To be specific, .json files such as config.json or tokenizer.json and README.md are all broken. They do exist but all broken. I cannot check the safetensor file. There is nothing saved in the output_dir which is declared in training_params.
- To add, I ran the code with commenting out trainer.model.save_pretrained line, but still crashes. But surprisingly, .json files related to tokenizer is saved properly, not broken. PC crash still happens but by commenting out saving model line, broken file problem is partially solved.
- Here is the code
import transformers
from transformers import (BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TrainingArguments, logging)
import torch
import os
from datasets import load_dataset, concatenate_datasets
import json
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
def main():
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
new_model = "Mistral-7B-Instruct-v0.2_newmodel"
compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=False
)
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True, padding_side = "right")
model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config = quant_config, attn_implementation = "flash_attention_2", device_map = {"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
train_dataset = load_dataset('json', data_files = './dataset/mixed_train.json', split = 'train')
eval_dataset = load_dataset('json', data_files = './dataset/mixed_val.json', split = 'train')
print(f"train dataset size: {len(train_dataset)}, eval dataset size: {len(eval_dataset)}")
training_params = TrainingArguments(
output_dir="./FT_newmodel",
num_train_epochs=1,
per_device_train_batch_size=2,
per_device_eval_batch_size= 1,
evaluation_strategy='steps',
eval_steps=25,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
logging_steps=25,
learning_rate=2e-5,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="constant",
report_to="tensorboard"
)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
task_type="CAUSAL_LM"
)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
dataset_text_field= "text",
args = training_params,
peft_config = peft_config,
max_seq_length = 512,
packing = False,
neftune_noise_alpha = 5
)
trainer.train()
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)
if __name__ == "__main__":
print("Training starts")
main()
print("Training ended")
- Funny part is that actually the machine stops after “Training ended” is printed. There is no error message, machine just stops. I really can’t figure out the problem. Please help…
- For OOM issue, do i have to check hard disk memory? If so, more than 300GB is left, but i guess it’s the point.
Since I’m using windows I installed bitsandbytes library compiled for windows, made by individual not official, I thought of some unknown incompatibility like thing happened, so I made new WSL2 Linux environment and run the code, which led me to “Error out of memory at line 380 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c” Does this indicate something??
Expected behavior
I just want it to end properly by saving the fine-tuned model.