I just want to confirm if this is normal behavior for the split step, and why the actual training might not be starting.
Yeah. Maybe. I think AutoTrain is designed to stop as quickly as possible when any error occurs.
BTW, that JSON settings may be for the parameters of the old version of Trainer. How about like this (in YAML)?
task: llm
base_model: nothingiisreal/MN-12B-Celeste-V1.9
project_name: mn12b-celeste-espanol-stage1
log: tensorboard
data:
path: josecannete/large_spanish_corpus
train_split: train
valid_split: null
chat_template: null
column_mapping:
text_column: text
params:
trainer: sft
block_size: -1
model_max_length: 4096
epochs: 1
batch_size: 1
gradient_accumulation: 16
lr: 5e-5
warmup_ratio: 0.1
optimizer: adamw_torch
scheduler: linear
weight_decay: 0.01
logging_steps: 25
eval_strategy: epoch
save_total_limit: 2
mixed_precision: fp16
# QLoRA
peft: true
quantization: int4
target_modules: all-linear
lora_r: 16
lora_alpha: 32
lora_dropout: 0.10
padding: right
seed: 42
hub:
username: SlayerL99
push_to_hub: true