My dataset is 250 docs, is multiple hours tuning normal?

I’m tuning a mythomax 13B model with a dataset of 250 Questions and answers on the L40S. It ran for 3 1/2 hours before I killed it. Is this normal?
ChatGPT estimated 10 minutes.

(from autotrain-advanced) (4.67.1)
Requirement already satisfied: werkzeug==3.1.3 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (3.1.3)
Requirement already satisfied: xgboost==2.1.3 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.1.3)
Requirement already satisfied: huggingface-hub==0.27.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.27.0)
Requirement already satisfied: requests==2.32.3 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.32.3)
Requirement already satisfied: einops==0.8.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.8.0)
Requirement already satisfied: packaging==24.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (24.2)
Requirement already satisfied: cryptography==44.0.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (44.0.0)
Requirement already satisfied: nvitop==1.3.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.3.2)
Requirement already satisfied: tensorboard==2.18.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.18.0)
Requirement already satisfied: peft==0.14.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.14.0)
Requirement already satisfied: trl==0.13.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.13.0)
Requirement already satisfied: tiktoken==0.8.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.8.0)
Requirement already satisfied: transformers==4.48.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (4.48.0)
Requirement already satisfied: accelerate==1.2.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.2.1)
Requirement already satisfied: rouge-score==0.1.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.1.2)
Requirement already satisfied: py7zr==0.22.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.22.0)
Requirement already satisfied: fastapi==0.115.6 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.115.6)
Requirement already satisfied: uvicorn==0.34.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.34.0)
Requirement already satisfied: python-multipart==0.0.20 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.0.20)
Requirement already satisfied: pydantic==2.10.4 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.10.4)
Requirement already satisfied: matplotlib>=2.1.0 in ./env/lib/python3.10/site-packages (from pycocotools==2.0.8->autotrain-advanced) (3.10.0)
Requirement already satisfied: annotated-types>=0.6.0 in ./env/lib/python3.10/site-packages (from pydantic==2.10.4->autotrain-advanced) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in ./env/lib/python3.10/site-packages (from pydantic==2.10.4->autotrain-advanced) (2.27.2)
Requirement already satisfied: charset-normalizer<4,>=2 in ./env/lib/python3.10/site-packages (from requests==2.32.3->autotrain-advanced) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./env/lib/python3.10/site-packages (from requests==2.32.3->autotrain-advanced) (2.2.3)
Requirement already satisfied: absl-py in ./env/lib/python3.10/site-packages (from rouge-score==0.1.2->autotrain-advanced) (2.1.0)
Requirement already satisfied: six>=1.14.0 in ./env/lib/python3.10/site-packages (from rouge-score==0.1.2->autotrain-advanced) (1.17.0)
Requirement already satisfied: threadpoolctl>=3.1.0 in ./env/lib/python3.10/site-packages (from scikit-learn==1.6.0->autotrain-advanced) (3.5.0)
Requirement already satisfied: grpcio>=1.48.2 in ./env/lib/python3.10/site-packages (from tensorboard==2.18.0->autotrain-advanced) (1.69.0)
Requirement already satisfied: markdown>=2.6.8 in ./env/lib/python3.10/site-packages (from tensorboard==2.18.0->autotrain-advanced) (3.7)
Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in ./env/lib/python3.10/site-packages (from tensorboard==2.18.0->autotrain-advanced) (5.29.3)
Requirement already satisfied: setuptools>=41.0.0 in ./env/lib/python3.10/site-packages (from tensorboard==2.18.0->autotrain-advanced) (75.1.0)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in ./env/lib/python3.10/site-packages (from tensorboard==2.18.0->autotrain-advanced) (0.7.2)
Requirement already satisfied: torchvision in ./env/lib/python3.10/site-packages (from timm==1.0.12->autotrain-advanced) (0.19.0)
Requirement already satisfied: lightning-utilities>=0.8.0 in ./env/lib/python3.10/site-packages (from torchmetrics==1.6.0->autotrain-advanced) (0.11.9)
Requirement already satisfied: tokenizers<0.22,>=0.21 in ./env/lib/python3.10/site-packages (from transformers==4.48.0->autotrain-advanced) (0.21.0)
Requirement already satisfied: rich in ./env/lib/python3.10/site-packages (from trl==0.13.0->autotrain-advanced) (13.9.4)
Requirement already satisfied: h11>=0.8 in ./env/lib/python3.10/site-packages (from uvicorn==0.34.0->autotrain-advanced) (0.14.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in ./env/lib/python3.10/site-packages (from werkzeug==3.1.3->autotrain-advanced) (2.1.3)
Requirement already satisfied: nvidia-nccl-cu12 in ./env/lib/python3.10/site-packages (from xgboost==2.1.3->autotrain-advanced) (2.24.3)
Requirement already satisfied: stringzilla>=3.10.4 in ./env/lib/python3.10/site-packages (from albucore==0.0.21->albumentations==1.4.23->autotrain-advanced) (3.11.3)
Requirement already satisfied: simsimd>=5.9.2 in ./env/lib/python3.10/site-packages (from albucore==0.0.21->albumentations==1.4.23->autotrain-advanced) (6.2.1)
Requirement already satisfied: pyarrow>=15.0.0 in ./env/lib/python3.10/site-packages (from datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (18.1.0)
Requirement already satisfied: aiohttp in ./env/lib/python3.10/site-packages (from datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (3.11.11)
Requirement already satisfied: Mako in ./env/lib/python3.10/site-packages (from alembic>=1.5.0->optuna==4.1.0->autotrain-advanced) (1.3.8)
Requirement already satisfied: pycparser in ./env/lib/python3.10/site-packages (from cffi>=1.12->cryptography==44.0.0->autotrain-advanced) (2.22)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (2.4.4)
Requirement already satisfied: aiosignal>=1.1.2 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (1.3.2)
Requirement already satisfied: async-timeout<6.0,>=4.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (5.0.1)
Requirement already satisfied: attrs>=17.3.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (24.3.0)
Requirement already satisfied: frozenlist>=1.1.1 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (6.1.0)
Requirement already satisfied: propcache>=0.2.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (0.2.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=3.2.0->datasets[vision]~=3.2.0->autotrain-advanced) (1.18.3)
Requirement already satisfied: contourpy>=1.0.1 in ./env/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->autotrain-advanced) (1.3.1)
Requirement already satisfied: cycler>=0.10 in ./env/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->autotrain-advanced) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./env/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->autotrain-advanced) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in ./env/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->autotrain-advanced) (1.4.8)
Requirement already satisfied: pyparsing>=2.3.1 in ./env/lib/python3.10/site-packages (from matplotlib>=2.1.0->pycocotools==2.0.8->autotrain-advanced) (3.2.1)
Requirement already satisfied: greenlet!=0.4.17 in ./env/lib/python3.10/site-packages (from sqlalchemy>=1.4.2->optuna==4.1.0->autotrain-advanced) (3.1.1)
Requirement already satisfied: exceptiongroup>=1.0.2 in ./env/lib/python3.10/site-packages (from anyio->httpx==0.28.1->autotrain-advanced) (1.2.2)
Requirement already satisfied: sniffio>=1.1 in ./env/lib/python3.10/site-packages (from anyio->httpx==0.28.1->autotrain-advanced) (1.3.1)
Requirement already satisfied: sympy in ./env/lib/python3.10/site-packages (from torch>=1.10.0->accelerate==1.2.1->autotrain-advanced) (1.13.3)
Requirement already satisfied: networkx in ./env/lib/python3.10/site-packages (from torch>=1.10.0->accelerate==1.2.1->autotrain-advanced) (3.2.1)
Requirement already satisfied: jinja2 in ./env/lib/python3.10/site-packages (from torch>=1.10.0->accelerate==1.2.1->autotrain-advanced) (3.1.4)
Requirement already satisfied: markdown-it-py>=2.2.0 in ./env/lib/python3.10/site-packages (from rich->trl==0.13.0->autotrain-advanced) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in ./env/lib/python3.10/site-packages (from rich->trl==0.13.0->autotrain-advanced) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in ./env/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->trl==0.13.0->autotrain-advanced) (0.1.2)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./env/lib/python3.10/site-packages (from sympy->torch>=1.10.0->accelerate==1.2.1->autotrain-advanced) (1.3.0)
Downloading autotrain_advanced-0.8.36-py3-none-any.whl (341 kB)
Installing collected packages: autotrain-advanced
Successfully installed autotrain-advanced-0.8.36
INFO     | 2025-01-25 03:35:42 | autotrain.app.ui_routes:<module>:31 - Starting AutoTrain...
INFO     | 2025-01-25 03:35:44 | autotrain.app.ui_routes:<module>:315 - AutoTrain started successfully
INFO     | 2025-01-25 03:35:44 | autotrain.app.app:<module>:13 - Starting AutoTrain...
INFO     | 2025-01-25 03:35:44 | autotrain.app.app:<module>:23 - AutoTrain version: 0.8.36
INFO     | 2025-01-25 03:35:44 | autotrain.app.app:<module>:24 - AutoTrain started successfully
INFO:     Started server process [53]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
INFO:     10.16.46.168:22511 - "GET /?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJyZWFkIjp0cnVlLCJwZXJtaXNzaW9ucyI6eyJyZXBvLmNvbnRlbnQucmVhZCI6dHJ1ZX0sIm9uQmVoYWxmT2YiOnsia2luZCI6InVzZXIiLCJfaWQiOiI2NmM5MTJiOTc0NmNkZGQ5NTllM2RkNjciLCJ1c2VyIjoiZ29rc3RhZCJ9LCJpYXQiOjE3Mzc3NzYxNDUsInN1YiI6Ii9zcGFjZXMvZ29rc3RhZC9hdXRvdHJhaW4tbXl0aG9tYXgiLCJleHAiOjE3Mzc4NjI1NDUsImlzcyI6Imh0dHBzOi8vaHVnZ2luZ2ZhY2UuY28ifQ.d5y_fbriLBfpQr0l4lyFOUX1hQEZwvOuFRLavEJH2Bo0MMxZd6SB62vIyhipLbV7WkHeSDrGyp3eo2XZGx_VDQ HTTP/1.1" 307 Temporary Redirect
INFO:     10.16.46.168:22511 - "GET /?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJyZWFkIjp0cnVlLCJwZXJtaXNzaW9ucyI6eyJyZXBvLmNvbnRlbnQucmVhZCI6dHJ1ZX0sIm9uQmVoYWxmT2YiOnsia2luZCI6InVzZXIiLCJfaWQiOiI2NmM5MTJiOTc0NmNkZGQ5NTllM2RkNjciLCJ1c2VyIjoiZ29rc3RhZCJ9LCJpYXQiOjE3Mzc3NzYxNDUsInN1YiI6Ii9zcGFjZXMvZ29rc3RhZC9hdXRvdHJhaW4tbXl0aG9tYXgiLCJleHAiOjE3Mzc4NjI1NDUsImlzcyI6Imh0dHBzOi8vaHVnZ2luZ2ZhY2UuY28ifQ.d5y_fbriLBfpQr0l4lyFOUX1hQEZwvOuFRLavEJH2Bo0MMxZd6SB62vIyhipLbV7WkHeSDrGyp3eo2XZGx_VDQ HTTP/1.1" 307 Temporary Redirect
INFO:     10.16.21.179:2795 - "GET /ui/?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJyZWFkIjp0cnVlLCJwZXJtaXNzaW9ucyI6eyJyZXBvLmNvbnRlbnQucmVhZCI6dHJ1ZX0sIm9uQmVoYWxmT2YiOnsia2luZCI6InVzZXIiLCJfaWQiOiI2NmM5MTJiOTc0NmNkZGQ5NTllM2RkNjciLCJ1c2VyIjoiZ29rc3RhZCJ9LCJpYXQiOjE3Mzc3NzYxNDUsInN1YiI6Ii9zcGFjZXMvZ29rc3RhZC9hdXRvdHJhaW4tbXl0aG9tYXgiLCJleHAiOjE3Mzc4NjI1NDUsImlzcyI6Imh0dHBzOi8vaHVnZ2luZ2ZhY2UuY28ifQ.d5y_fbriLBfpQr0l4lyFOUX1hQEZwvOuFRLavEJH2Bo0MMxZd6SB62vIyhipLbV7WkHeSDrGyp3eo2XZGx_VDQ HTTP/1.1" 200 OK
INFO:     10.16.39.228:31650 - "GET /static/scripts/fetch_data_and_update_models.js?cb=2025-01-25%2003:35:46 HTTP/1.1" 200 OK
INFO:     10.16.46.168:22511 - "GET /static/scripts/listeners.js?cb=2025-01-25%2003:35:46 HTTP/1.1" 200 OK
INFO:     10.16.39.228:31650 - "GET /static/scripts/utils.js?cb=2025-01-25%2003:35:46 HTTP/1.1" 200 OK
INFO:     10.16.17.134:6478 - "GET /static/scripts/poll.js?cb=2025-01-25%2003:35:46 HTTP/1.1" 200 OK
INFO:     10.16.21.179:2795 - "GET /static/scripts/logs.js?cb=2025-01-25%2003:35:46 HTTP/1.1" 200 OK
INFO:     10.16.17.134:6478 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.46.168:22511 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO     | 2025-01-25 03:35:46 | autotrain.app.ui_routes:fetch_params:415 - Task: llm:sft
INFO:     10.16.39.228:28699 - "GET /ui/params/llm%3Asft/basic HTTP/1.1" 200 OK
INFO:     10.16.39.228:31650 - "GET /ui/model_choices/llm%3Asft HTTP/1.1" 200 OK
INFO:     10.16.5.242:15988 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:10337 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.21.179:43884 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:44337 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:25359 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:32270 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:45088 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:18944 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:16908 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:6524 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:1769 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.46.168:52154 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:32684 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:3077 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.21.179:49111 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:63522 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:1175 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.17.134:44922 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:40721 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:34127 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:28308 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:16135 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:6174 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.17.134:28174 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:36156 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:57585 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:45654 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.21.179:26262 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:56677 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.5.242:6957 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:28645 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:33925 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:26393 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:53711 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:54032 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:39391 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.17.134:3412 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:30513 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:15297 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.5.242:30280 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:24538 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:38569 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.17.134:19283 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:33626 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:37982 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:28487 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:44322 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:18182 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.5.242:2627 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO     | 2025-01-25 03:38:35 | autotrain.app.ui_routes:handle_form:540 - hardware: local-ui
INFO     | 2025-01-25 03:38:35 | autotrain.backends.local:create:20 - Starting local training...
INFO     | 2025-01-25 03:38:35 | autotrain.commands:launch_command:514 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-t7kn1-a6v62/training_params.json']
INFO     | 2025-01-25 03:38:35 | autotrain.commands:launch_command:515 - {'model': 'Gryphe/MythoMax-L2-13b', 'project_name': 'autotrain-t7kn1-a6v62', 'data_path': 'gokstad/13winters', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 32, 'model_max_length': 128, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'lr': 0.0001, 'epochs': 1, 'batch_size': 16, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': 'none', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': 'prompt', 'text_column': '{     "input": "input",     "output": "response" }', 'rejected_text_column': 'rejected_text', 'push_to_hub': True, 'username': 'gokstad', 'token': '*****', 'unsloth': False, 'distributed_backend': 'ddp'}
INFO     | 2025-01-25 03:38:35 | autotrain.backends.local:create:25 - Training PID: 69
INFO:     10.16.21.179:5168 - "POST /ui/create_project HTTP/1.1" 200 OK
INFO:     10.16.5.242:21957 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:5168 - "GET /ui/accelerators HTTP/1.1" 200 OK
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
INFO:     10.16.5.242:9349 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO     | 2025-01-25 03:38:42 | autotrain.trainers.clm.train_clm_sft:train:11 - Starting SFT training...

Generating train split:   0%|          | 0/183 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 183/183 [00:00<00:00, 97703.36 examples/s]
INFO     | 2025-01-25 03:38:43 | autotrain.trainers.clm.utils:process_input_data:550 - Train data: Dataset({
    features: ['input', 'response'],
    num_rows: 183
})
INFO     | 2025-01-25 03:38:43 | autotrain.trainers.clm.utils:process_input_data:551 - Valid data: None
INFO     | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:configure_logging_steps:671 - configuring logging steps
INFO     | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:configure_logging_steps:684 - Logging steps: 2
INFO     | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:configure_training_args:723 - configuring training args
INFO     | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:configure_block_size:801 - Using block size 32
INFO     | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:get_model:877 - Can use unsloth: False
WARNING  | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:get_model:919 - Unsloth not available, continuing without it...
INFO     | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:get_model:921 - loading model config...
INFO     | 2025-01-25 03:38:44 | autotrain.trainers.clm.utils:get_model:929 - loading model...
`low_cpu_mem_usage` was None, now default to True since model is quantized.

Downloading shards:   0%|          | 0/13 [00:00<?, ?it/s]INFO:     10.16.5.242:24479 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:43974 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Downloading shards:   8%|▊         | 1/13 [00:03<00:39,  3.32s/it]
Downloading shards:  15%|█▌        | 2/13 [00:06<00:33,  3.04s/it]INFO:     10.16.17.134:58345 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Downloading shards:  23%|██▎       | 3/13 [00:09<00:30,  3.05s/it]INFO:     10.16.5.242:22132 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:37273 - "GET /ui/accelerators HTTP/1.1" 200 OK

Downloading shards:  31%|███       | 4/13 [00:12<00:27,  3.00s/it]
Downloading shards:  38%|███▊      | 5/13 [00:14<00:23,  2.94s/it]INFO:     10.16.5.242:1092 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Downloading shards:  46%|████▌     | 6/13 [00:18<00:21,  3.11s/it]INFO:     10.16.21.179:65248 - "GET /ui/accelerators HTTP/1.1" 200 OK

Downloading shards:  54%|█████▍    | 7/13 [00:21<00:19,  3.25s/it]INFO:     10.16.21.179:61531 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:57944 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:36413 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:52244 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Downloading shards:  62%|██████▏   | 8/13 [00:35<00:32,  6.43s/it]INFO:     10.16.39.228:23435 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:62725 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.17.134:64798 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Downloading shards:  69%|██████▉   | 9/13 [00:44<00:29,  7.31s/it]INFO:     10.16.39.228:51682 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:23793 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:27364 - "GET /ui/accelerators HTTP/1.1" 200 OK

Downloading shards:  77%|███████▋  | 10/13 [00:54<00:24,  8.23s/it]INFO:     10.16.5.242:24407 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:28586 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:54373 - "GET /ui/accelerators HTTP/1.1" 200 OK

Downloading shards:  85%|████████▍ | 11/13 [01:05<00:18,  9.02s/it]INFO:     10.16.39.228:17869 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:22587 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.17.134:9249 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Downloading shards:  92%|█████████▏| 12/13 [01:16<00:09,  9.50s/it]INFO:     10.16.39.228:53699 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:7889 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:26434 - "GET /ui/accelerators HTTP/1.1" 200 OK

Downloading shards: 100%|██████████| 13/13 [01:23<00:00,  8.84s/it]
Downloading shards: 100%|██████████| 13/13 [01:23<00:00,  6.42s/it]

Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]INFO:     10.16.5.242:61819 - "GET /ui/is_model_training HTTP/1.1" 200 OK

Loading checkpoint shards:   8%|▊         | 1/13 [00:04<00:53,  4.50s/it]
Loading checkpoint shards:  15%|█▌        | 2/13 [00:05<00:28,  2.59s/it]
Loading checkpoint shards:  23%|██▎       | 3/13 [00:06<00:16,  1.61s/it]
Loading checkpoint shards:  31%|███       | 4/13 [00:06<00:10,  1.14s/it]
Loading checkpoint shards:  38%|███▊      | 5/13 [00:07<00:07,  1.13it/s]
Loading checkpoint shards:  46%|████▌     | 6/13 [00:07<00:05,  1.37it/s]
Loading checkpoint shards:  54%|█████▍    | 7/13 [00:07<00:03,  1.59it/s]INFO:     10.16.5.242:64762 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:1873 - "GET /ui/accelerators HTTP/1.1" 200 OK

Loading checkpoint shards:  62%|██████▏   | 8/13 [00:08<00:02,  1.76it/s]
Loading checkpoint shards:  69%|██████▉   | 9/13 [00:08<00:02,  1.93it/s]
Loading checkpoint shards:  77%|███████▋  | 10/13 [00:09<00:01,  2.04it/s]
Loading checkpoint shards:  85%|████████▍ | 11/13 [00:09<00:00,  2.12it/s]
Loading checkpoint shards:  92%|█████████▏| 12/13 [00:10<00:00,  2.17it/s]
Loading checkpoint shards: 100%|██████████| 13/13 [00:10<00:00,  2.29it/s]
Loading checkpoint shards: 100%|██████████| 13/13 [00:10<00:00,  1.25it/s]
INFO     | 2025-01-25 03:40:19 | autotrain.trainers.clm.utils:get_model:960 - model dtype: torch.float16
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
INFO:     10.16.21.179:1658 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.39.228:63570 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.17.134:26017 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.17.134:50441 - "GET /ui/is_model_training HTTP/1.1" 200 OK
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
INFO:     10.16.39.228:16279 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:64992 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.5.242:33437 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     10.16.21.179:28870 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     10.16.39.228:53596 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO     | 2025-01-25 03:40:50 | autotrain.trainers.clm.train_clm_sft:train:39 - creating trainer

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 0 examples [00:00, ? examples/s]
ERROR    | 2025-01-25 03:40:51 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/datasets/builder.py", line 1607, in _prepare_split_single
    for key, record in generator:
  File "/app/env/lib/python3.10/site-packages/datasets/packaged_modules/generator/generator.py", line 33, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "/app/env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 492, in data_generator
    yield from constant_length_iterator
  File "/app/env/lib/python3.10/site-packages/trl/trainer/utils.py", line 648, in __iter__
    buffer.append(self.formatting_func(next(iterator)))
  File "/app/env/lib/python3.10/site-packages/trl/trainer/utils.py", line 623, in <lambda>
    self.formatting_func = lambda x: x[dataset_text_field]
KeyError: '{     "input": "input",     "output": "response" }'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 495, in _prepare_packed_dataloader
    packed_dataset = Dataset.from_generator(
  File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1117, in from_generator
    ).read()
  File "/app/env/lib/python3.10/site-packages/datasets/io/generator.py", line 49, in read
    self.builder.download_and_prepare(
  File "/app/env/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/app/env/lib/python3.10/site-packages/datasets/builder.py", line 1648, in _download_and_prepare
    super()._download_and_prepare(
  File "/app/env/lib/python3.10/site-packages/datasets/builder.py", line 1000, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/app/env/lib/python3.10/site-packages/datasets/builder.py", line 1486, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/app/env/lib/python3.10/site-packages/datasets/builder.py", line 1643, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/autotrain/trainers/common.py", line 212, in wrapper
    return func(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/__main__.py", line 28, in train
    train_sft(config)
  File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/train_clm_sft.py", line 46, in train
    trainer = SFTTrainer(
  File "/app/env/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
    return func(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 265, in __init__
    train_dataset = self._prepare_dataset(
  File "/app/env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 391, in _prepare_dataset
    return self._prepare_packed_dataloader(
  File "/app/env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 499, in _prepare_packed_dataloader
    raise ValueError(
ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.

ERROR    | 2025-01-25 03:40:51 | autotrain.trainers.common:wrapper:216 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.
INFO     | 2025-01-25 03:40:51 | autotrain.trainers.common:pause_space:156 - Pausing space...
1 Like

Something is probably wrong. I don’t know if it’s the model, the dataset, the training settings, or a bug, but if you have 48GB of VRAM, it should generally be enough, depending on the training algorithm, and if you use the default settings, it should probably take less than 10 minutes, as ChatGPT says. There’s not much data.

If you changed the default settings, that might be the cause. Also, AutoTrainAdvanced automates as much of the training preparation as possible, but there may have been parts that couldn’t be automated successfully because the structure of the model or data set was outside the program’s expectations.

2 Likes

Hey there!

It sounds like you’re trying to fine-tune the MythoMax 13B model, and yeah, 3.5 hours does feel like a long time, especially for 250 questions and answers. Let me break it down for you in a casual way:

  1. Big model = Big work: The MythoMax 13B is huge (13 billion parameters!), so it’s no surprise it takes some serious muscle to fine-tune.

  2. Your GPU (L40S): It’s a strong GPU, but if you’re doing full fine-tuning (changing all the model’s parameters), it’s going to be slow. A lot slower than you’d like.

  3. Why 10 minutes? If ChatGPT estimated 10 minutes, it was probably thinking you were using a faster method like LoRA (only fine-tuning parts of the model) or had a multi-GPU setup. Full fine-tuning and single GPU? Yeah, 3.5 hours makes sense.

  4. How to make it faster:

    • Use LoRA or QLoRA for tuning. These make it so much quicker and use way less power.
    • Stick to FP16 (half precision) instead of full precision to save time.
    • Check if you’re running too many epochs—maybe you only need one or two.

So yeah, it’s pretty normal, but you can definitely speed things up with some tweaks. Let me know if you need help setting up LoRA or anything else. Hope this helps! :blush:

2 Likes

These are Q&A about one specific band the model has horrendously bad knowledge of (says the band is a terrorist group responsible for a bombing, lol) so the q&a is a set of positive and negative reinforcement that it is a band, based on all of their entire discography, biographies, reviews, tours, members, no they are not affiliated with terrorism, so and so.
So I am trying to tune a specific area of the model.
I guess I do need help setting up LoRA then.

1 Like

In the case of AutoTrainAdvanced, it may be enough to just return PEFT/LoRA as true. In the case of a normal Trainer, you will need to write a few lines of code.

Mind you my browser only shows me a little window of the trainer UI, so I have to work within a 150 pixel tall window, pfff Brave amirite?

I see by default PEFT/LoRA is “true”, doesn’t this mean it’s running a LoRA? Or is true the PEFT and false the LoRA? (This defies logical labeling of a switch but you never know these days) I now noticed the menu to the left has more options including “Task” that looks like a set of presets. I’m not sure which one is best for what I’m doing here or if I should just go with the default “LLM SFT” and work with the settings. I’m going through the page you gave me.

1 Like

I see by default PEFT/LoRA is “true”, doesn’t this mean it’s running a LoRA?

I think that’s fine… probably.
Is the default SFT? SFT is quite heavy, so that might be the reason.

1 Like

It would seem so. When I start the space, that’s what it’s set to. I had chat explain to me Unsloth, which by default is set to false. By the sounds of it, that’s something I want true because I’m just tuning a singe topic. I also changed my dataset labeling so it is now “Input” and “output”, this way the column mapping can’t be mixed up. I wish HF do like Wordpress and other sites do with fields “(Separate tags with commas. i.e. tiktok, dance, trend)” you know, give a little example to follow.

I can’t get a tuning run to work again. It’s almost like that one run was a fluke… well it was since I had to stop it.
It keep getting the message that my column mapping is wrong, even though I changed my data’s labeling and now enter:


{"input": "input", "output": "output"}

as my column mapping. Here’s an actual line from the dataset:

{"input": "When was 13 Winters formed?", "output": "13 Winters was formed in 2001 in the black woods of southern Maine."}

I also get this notification: “ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.”
Do I need to double the number of documents in my jsonl? I could have GPT generate 250 more Q&As to bump it up to 500.

1 Like

I think there aren’t many data samples, but I’m not sure if that’s the reason or if it’s because of the following bugs…
It might be more reliable to pick up a sample from somewhere on the web and try running it with the sample data set. I think that will help isolate the problem.

After going back through my dataset I learned that I was sold short by ChatGPT. It only gave me a collective 185 Q&As, lol. It’s hard to count them when they are in json format. I just fed it more info plus all the old info and had it go on a generating spree. I also asked the mythomax to explain to me in detail who the band is and gave that to ChatGPT with “All of this info is wrong. Generate 100 Q&As refuting and correcting the info.” Now I have 534 bonefied lines of jsonl.

Here’s my notes as I go I saved the column mapping and the json settings.

{"input": "input", "output": "output"}


{
    "auto_find_batch_size": "false",
    "chat_template": "none",
    "disable_gradient_checkpointing": "false",
    "distributed_backend": "ddp",
    "eval_strategy": "epoch",
    "merge_adapter": "false",
    "mixed_precision": "fp16",
    "optimizer": "adamw_torch",
    "peft": "true",
    "padding": "right",
    "quantization": "int4",
    "scheduler": "linear",
    "unsloth": "true",
    "use_flash_attention_2": "false",
    "batch_size": "2",
    "block_size": "32",
    "epochs": "1",
    "gradient_accumulation": "4",
    "lr": "0.0001",
    "logging_steps": "-1",
    "lora_alpha": "32",
    "lora_dropout": "0.05",
    "lora_r": "16",
    "max_grad_norm": "1",
    "model_max_length": "256",
    "save_total_limit": "1",
    "seed": "42",
    "warmup_ratio": "0.1",
    "weight_decay": "0",
    "target_modules": "all-linear"
  }
1 Like

I get frustrated with ChatGPT to the point I fully believe it is designed by OpenAI to not actually help people on this, I mean it would make sense they would want to ensure their product. Why sell a 3D printer than can print 3D printers, right? Because otherwise this confused bot was just put in charge of an entire country’s defense system, yikes. Here’s Chat telling me HuggingFace doesn’t have Unsloth installed.

From the error log and your configuration settings, it seems the problem lies in how the dataset is being processed and packed during the training phase. Specifically:

Key Error:
Error Message: KeyError: '{"input": "input", "output": "output"}'
This suggests that the dataset's column mapping is not being interpreted correctly by the system. The mapping you provided in JSON format ({"input": "input", "output": "output"}) is not directly compatible with how the trl library expects the columns to be defined.
Recommended Fixes:
Adjust the Column Mapping: The trl library expects the column mapping to be passed as separate arguments rather than as a JSON string. You may need to adjust how you pass the column mapping. Instead of {"input": "input", "output": "output"}, try explicitly setting the input and output columns in the interface or config like:

input_column: input
output_column: output
Ensure the dataset's columns match these names. Modify the Block Size:

The block size of 32 might be too small for packing sequences effectively. Increase the block size to, say, 128 or 256, depending on the average length of your input-output pairs.
Batch Size Adjustments:

With a batch size of 2 and gradient_accumulation=4, you are simulating an effective batch size of 8. Ensure your GPU has enough memory to handle this configuration. If memory is a concern, try reducing gradient_accumulation to 2.
Unsloth Dependency:

The warnings about Unsloth indicate it's not installed. Since Unsloth optimizes performance, consider installing it with:
pip install unsloth_zoo 
1 Like

I think AutoTrainAdvanced works even if you don’t install unsloth. Or rather, that library is generally excellent and does things automatically, but it’s too advanced and you have to read github to understand the contents…
It’s a pain to count large amounts of data on a computer, such as JSON or CSV, so it’s easier to make a program to count it and use it as needed. Now, if it’s a problem with numbers, it should be fine since it’s over 500. If it’s not working, it’s not a problem with numbers, but with the structure of the data (even if there is one, I think it’s something small like the format is different from what the program expects), or it’s a problem with the model.
By the way, the safe JSON format for Hugging Face is something like this.

I provided a line from my dataset a few responses back. This person in that thread looks like they are doing the exact same formatting I did. https://discuss.huggingface.co/t/load-dataset-fail-for-custom-json-format/30350/4?u=gokstad

My file is a jsonl and the lines are this:

{"input": "When was 13 Winters formed?", "output": "13 Winters was formed in 2001 in the black woods of southern Maine."}

But looking again I see their format is like this:

{"input": When was 13 Winters formed?, "output": 13 Winters was formed in 2001 in the black woods of southern Maine.}
1 Like

Grok and GPT both say that the strings should be enclosed in " ", which is what I always thought, so I don’t know what’s right.

1 Like

Grok and GPT both say that the strings should be enclosed in " "

Me too.

See, that’s what I don’t get with the other thread then. Someone posted their data without the " " and no one pointed it out, and that was an active thread.

1 Like

{ “A” : string, “B”: list of string, “C”: list of list of bool }

I see. That’s probably because everyone thought, “This is a concept, not actual data.”
The original post above probably means that the actual data would look like this in Python.

{"A": "string", "B": ["string", "string"], "C": [[False, True, False], [False, True]]}
1 Like

Hi. You definitely want those strings all enclosed in quotation marks. You want that way even if someone else tells you it will work without them, and even if they have done it.
Enclosing those strings with double quotes is the correct thing to do. Not just for this data we are discussing here.
Without getting into it too much beyond that: if you have numbers that you were going to use a strings put them in double quotes. In JSON and YAML there are different data types that can get different results if you aren’t paying attention. That is not what we are talking about here. : )

1 Like