Autotrain ValueError: num_samples should be a positive integer value, but got num_samples=0

I have tried using autotrain inside huggingface and in google colab but both time i just get the ValueError
:x: ERROR | 2023-12-22 05:01:02 | autotrain.trainers.common:wrapper:90 - train has failed due to an exception: Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/autotrain/trainers/common.py”, line 87, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/autotrain/trainers/clm/main.py”, line 446, in train
trainer.train()
File “/usr/local/lib/python3.10/dist-packages/transformers/trainer.py”, line 1537, in train
return inner_training_loop(
File “/usr/local/lib/python3.10/dist-packages/transformers/trainer.py”, line 1553, in _inner_training_loop
train_dataloader = self.get_train_dataloader()
File “/usr/local/lib/python3.10/dist-packages/transformers/trainer.py”, line 800, in get_train_dataloader
dataloader_params[“sampler”] = self._get_train_sampler()
File “/usr/local/lib/python3.10/dist-packages/transformers/trainer.py”, line 770, in _get_train_sampler
return RandomSampler(self.train_dataset)
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/sampler.py”, line 107, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

:x: ERROR | 2023-12-22 05:01:02 | autotrain.trainers.common:wrapper:91 - num_samples should be a positive integer value, but got num_samples=0

i have uploaded the Dataset into huggingface
Maxx0/DatasetProfy

i think something is wrong with it but i don’t know what

1 Like

I can’t get past this either. Here’s the log:

🚀 INFO   | 2023-12-26 04:11:22 | __main__:process_input_data:41 - loading dataset from disk
🚀 INFO   | 2023-12-26 04:11:22 | __main__:process_input_data:82 - Train data: Dataset({
    features: ['autotrain_text', '__index_level_0__'],
    num_rows: 268
})
🚀 INFO   | 2023-12-26 04:11:22 | __main__:process_input_data:83 - Valid data: None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:12<00:00,  4.19s/it]
🚀 INFO   | 2023-12-26 04:11:36 | __main__:train:271 - Using block size 1024
Running tokenizer on train dataset: 100%|█| 268/268 [00:00<00:00, 41504.76 examp
Grouping texts in chunks of 1024 (num_proc=4): 100%|█| 268/268 [00:01<00:00, 198
🚀 INFO   | 2023-12-26 04:11:41 | __main__:train:333 - creating trainer
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
❌ ERROR  | 2023-12-26 04:11:47 | autotrain.trainers.common:wrapper:90 - train has failed due to an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/common.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/clm/__main__.py", line 469, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 800, in get_train_dataloader
    dataloader_params["sampler"] = self._get_train_sampler()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 770, in _get_train_sampler
    return RandomSampler(self.train_dataset)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/sampler.py", line 141, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

❌ ERROR  | 2023-12-26 04:11:47 | autotrain.trainers.common:wrapper:91 - num_samples should be a positive integer value, but got num_samples=0

Makes no sense because it should default to the length of the entire dataset, in my case 268:

def num_samples(self) -> int:
        # dataset size might change at runtime
        if self._num_samples is None:
            return len(self.data_source)
        return self._num_samples

cc @abhishek

You should read github issues before making posts here saying it makes no sense. ❌ ERROR | 2023-12-15 14:04:00 | autotrain.trainers.common:wrapper:80 - num_samples should be a positive integer value, but got num_samples=0 · Issue #397 · huggingface/autotrain-advanced · GitHub

Well, this just doesn’t work plain and simple. On HF there’s a " TypeError: object of type ‘NoneType’ has no len()" exception which is unclear why that’s happening. When running on RunPod with a reduced block size of 4 it still generates a CUDA out of memory error. What’s the point of this when you can’t train a 4G 7B model with 268 lines of training data on a H100 with 80G of RAM and 4bit quant? Obviously better documentation is needed from those with the knowledge. Hang on let me dig through the hundreds of posts…