AutoTrain Advanced Error on Massive Dataset

I am attempting to use AutoTrain Advanced to finetune Mistral-7B-Instruct-v0.2 using a dataset of 2.2 million samples in JSONL format. The dataset is 6.5 GB. I am using Nvidia 8xL40S as the space hardware, the most powerful option available. Partway through generating the test split, before training begins, I receive these errors

...
Generating train split: 355839 examples [09:57, 466.04 examples/s]
Generating train split: 356339 examples [09:57, 638.27 examples/s]
Generating train split: 356831 examples [09:59, 471.38 examples/s]
Generating train split: 357342 examples [09:59, 637.19 examples/s]
[rank2]:[W701 21:58:39.480892597 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank1]:[W701 21:58:39.491554272 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank7]:[W701 21:58:39.584430192 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank6]:[W701 21:58:39.656034372 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank5]:[W701 21:58:39.666387163 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank3]:[W701 21:58:39.780078100 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank4]:[W701 21:58:39.872609585 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
Generating train split: 357808 examples [10:01, 465.60 examples/s]
Generating train split: 358336 examples [10:01, 648.42 examples/s]
Generating train split: 358798 examples [10:02, 466.83 examples/s]
Generating train split: 359000 examples [10:02, 525.05 examples/s]
Generating train split: 359420 examples [10:04, 396.80 examples/s]
Generating train split: 360000 examples [10:04, 607.53 examples/s]
...

Shortly after, the space automatically paused. When I attempted to restart the space, I received this error.

I assume that both of these issues are due to the large size of the dataset. I have made it to the beginning of the training step using only the first 500k samples in the dataset, so the issue begins somewhere between 500k and 2.2 million. Is my assumption correct, or could the space have paused for some other reason? Below are the fine-tuning parameters used, they are all at their default values besides merge_adapter, which is set to True.

{
  "auto_find_batch_size": "false",
  "chat_template": "none",
  "disable_gradient_checkpointing": "false",
  "distributed_backend": "ddp",
  "eval_strategy": "epoch",
  "merge_adapter": "true",
  "mixed_precision": "fp16",
  "optimizer": "adamw_torch",
  "peft": "true",
  "padding": "right",
  "quantization": "int4",
  "scheduler": "linear",
  "unsloth": "false",
  "use_flash_attention_2": "false",
  "batch_size": "2",
  "block_size": "1024",
  "epochs": "3",
  "gradient_accumulation": "4",
  "lr": "0.00003",
  "logging_steps": "-1",
  "lora_alpha": "32",
  "lora_dropout": "0.05",
  "lora_r": "16",
  "max_grad_norm": "1",
  "model_max_length": "2048",
  "save_total_limit": "1",
  "seed": "42",
  "warmup_ratio": "0.1",
  "weight_decay": "0",
  "target_modules": "all-linear"
}

Are there any steps I could take to still use AutoTrain with this large of a dataset? Is it possible to pre-tokenize the dataset, enable streaming, or anything similar? Is there a better way than AutoTrain Advanced to fine-tune a model on a very large dataset (6.5 GB)? Please let me know if I can provide any additional information that would help. Thanks!

1 Like

It would be easier if Streaming (IterableDataset) could be used, but I don’t know how to enable Streaming in AutoTrainAdvanced…

It might be quicker to split the dataset into several smaller datasets in advance.:sweat_smile: