### Describe the bug
I am trying to run some trainings on Google's TPUs using H…uggingface's DataLoader on [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B) and [c4](https://huggingface.co/datasets/allenai/c4), but I end up running into `429 Client Error: Too Many Requests for URL` error when I call `load_dataset`. The even odder part is that I am able to sucessfully run trainings with the [wikitext dataset](https://huggingface.co/datasets/Salesforce/wikitext). Is there something I need to setup to specifically train with SlimPajama or C4 with TPUs because I am not clear why I am getting these errors.
### Steps to reproduce the bug
These are the commands you could run to produce the error below but you will require a ClearML account (you can create one [here](https://app.clear.ml/login?redirect=%2Fdashboard)) with a queue setup to run on Google TPUs
```bash
git clone https://github.com/clankur/muGPT.git
cd muGPT
python -m train --config-name=slim_v4-32_84m.yaml +training.queue={NAME_OF_CLEARML_QUEUE}
```
The error I see:
```
Traceback (most recent call last):
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 1037, in main
main_contained(config, logger)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 840, in main_contained
loader = get_loader("train", config.training_data, config.training.tokens)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 549, in get_loader
return HuggingFaceDataLoader(split, config, token_batch_params)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 395, in __init__
self.dataset = load_dataset(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 2112, in load_dataset
builder_instance = load_dataset_builder(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1798, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1495, in dataset_module_factory
raise e1 from None
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1479, in dataset_module_factory
).get_module()
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1034, in get_module
else get_data_patterns(base_path, download_config=self.download_config)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 457, in get_data_patterns
return _get_data_files_patterns(resolver)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 248, in _get_data_files_patterns
data_files = pattern_resolver(pattern)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 340, in resolve_pattern
for filepath, info in fs.glob(pattern, detail=True).items()
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 409, in glob
return super().glob(path, **kwargs)
File "/home/clankur/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/fsspec/spec.py", line 602, in glob
allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 429, in find
out = self._ls_tree(path, recursive=True, refresh=refresh, revision=resolved_path.revision, **kwargs)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 358, in _ls_tree
self._ls_tree(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 375, in _ls_tree
for path_info in tree:
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3080, in list_repo_tree
for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_pagination.py", line 46, in paginate
hf_raise_for_status(r)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/cerebras/SlimPajama-627B/tree/2d0accdd58c5d5511943ca1f5ff0e3eb5e293543?recursive=True&expand=True&cursor=ZXlKbWFXeGxYMjVoYldVaU9pSjBaWE4wTDJOb2RXNXJNUzlsZUdGdGNHeGxYMmh2YkdSdmRYUmZPVFEzTG1wemIyNXNMbnB6ZENKOTo2MjUw (Request ID: Root=1-67673de9-1413900606ede7712b08ef2c;1304c09c-3e69-4222-be14-f10ee709d49c)
maximum queue size reached
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
```
### Expected behavior
I'd expect the DataLoader to load from the SlimPajama-627B and c4 dataset without issue.
### Environment info
- `datasets` version: 2.14.4
- Platform: Linux-5.8.0-1035-gcp-x86_64-with-glibc2.31
- Python version: 3.10.16
- Huggingface_hub version: 0.26.5
- PyArrow version: 18.1.0
- Pandas version: 2.2.3