A10G not using VRAM after generating training split in AutoTrain

Hi everyone,

I recently purchased a Hugging Face AutoTrain space with an NVIDIA A10G (24GB, reports ~22.49 GiB usable) for fine-tuning nothingiisreal/MN-12B-Celeste-V1.9 on josecannete/large_spanish_corpus.

When I start training, AutoTrain first generates the training split. During that step, VRAM usage is basically zero (around 2.88 MiB/22.49 GiB). After the split finishes, the process just stops β€” no training actually begins, and GPU usage never increases.

I expected VRAM usage to spike when training started, but it seems the job never reaches that stage.

Has anyone else experienced this with AutoTrain + A10G?
Could this be an issue with:

  • Dataset size or format?

  • The LoRA/PEFT + quantization setup I’m using?

  • Some AutoTrain pipeline bug for large models?

Any help would be appreciated β€” I just want to confirm if this is normal behavior for the split step, and why the actual training might not be starting.

Thanks in advance!

I want to train Celeste V1.9 to learn Spanish, then Spanish books with PleIAs/Spanish-PD-Books and then Argentine Spanish with ylacombe/google-argentinian-spanish. But I’m not that sure if my current JSON for Spanish Corpus is OK, or how to config the other JSON for next steps.

This is my JSON:

{
  "model": "nothingiisreal/MN-12B-Celeste-V1.9",
  "data": "josecannete/large_spanish_corpus",
  "task": "text-generation",

  "hub_model_id": "SlayerL99/mn12b-celeste-espanol-stage1",

  "training_parameters": {
    "learning_rate": 0.00005,
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "warmup_steps": 100,
    "max_seq_length": 2048,
    "weight_decay": 0.01,
    "lr_scheduler_type": "cosine",
    "seed": 42,
    "fp16": true,                // switch bf16 to fp16 for better compatibility with 3060 downstream
    "bf16": false,
    "gradient_checkpointing": true,
    "dataloader_num_workers": 4,  // boost data loading speed (assuming Linux/WSL)
    "push_to_hub": true,
    "save_total_limit": 2,
    "logging_steps": 25,
    "save_steps": 200,
    "eval_steps": 200,
    "evaluation_strategy": "steps",
    "load_best_model_at_end": true,
    "metric_for_best_model": "eval_loss",
    "greater_is_better": false,
    "report_to": "tensorboard"
  },

  "peft_parameters": {
    "use_peft": true,
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.1,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear"
  },

  "quantization_parameters": {
    "use_int4": true,
    "use_int8": false,
    "use_fp4": false,           // disable fp4 for compatibility/stability on RTX 3060 GGUF
    "use_double_quant": true,
    "bnb_4bit_quant_type": "nf4"
  }
}

1 Like

I just want to confirm if this is normal behavior for the split step, and why the actual training might not be starting.

Yeah. Maybe. I think AutoTrain is designed to stop as quickly as possible when any error occurs.

BTW, that JSON settings may be for the parameters of the old version of Trainer. How about like this (in YAML)?

task: llm
base_model: nothingiisreal/MN-12B-Celeste-V1.9
project_name: mn12b-celeste-espanol-stage1
log: tensorboard

data:
  path: josecannete/large_spanish_corpus
  train_split: train
  valid_split: null
  chat_template: null
  column_mapping:
    text_column: text

params:
  trainer: sft
  block_size: -1
  model_max_length: 4096
  epochs: 1
  batch_size: 1
  gradient_accumulation: 16
  lr: 5e-5
  warmup_ratio: 0.1
  optimizer: adamw_torch
  scheduler: linear
  weight_decay: 0.01
  logging_steps: 25
  eval_strategy: epoch
  save_total_limit: 2
  mixed_precision: fp16

  # QLoRA
  peft: true
  quantization: int4
  target_modules: all-linear
  lora_r: 16
  lora_alpha: 32
  lora_dropout: 0.10

  padding: right
  seed: 42

hub:
  username: SlayerL99
  push_to_hub: true

Ah, that may be it:

I just transform that into JSON and paste it in the parameters and try?

1 Like

I just transform that into JSON and paste it in the parameters and try?

Maybe okay. I think they were treated the same internally anyway…

Heyo! I tried using that YAML into JSON, but the same happens. AutoTrainer gives β€œError fertching trainer status” and just stops :confused:

1 Like

β€œError fertching trainer status”

I think this error is caused by something wrong with the Accelerate library, but could it be that the version of the Accelerate library on the newly created space is outdated?:thinking: