Problem loading HuggingFaceFW/fineweb-edu-score-2 dataset: Too Many Requests

Anforg · March 21, 2025, 9:45pm

Hello,

I am encountering an issue while loading the HuggingFaceFW/fineweb-edu-score-2 dataset for model training within a Bittensor project. I am receiving the 429 Too Many Requests error, even though I have taken the following measures to reduce the load on the server:

Using --pages_per_epoch 1: I have limited the number of pages loaded per epoch to 1.
Implemented delays: I have added time.sleep(10) (or more) before each call to requests.get in the dataset.py file.
Verified internet connection: I have a stable internet connection with good ping to datasets-server.huggingface.co.
Authenticated via huggingface-cli login: I have successfully authenticated with Hugging Face via the command line.

I am using the following code to load the dataset:

loader = pt.dataset.SubsetFineWebEdu2Loader(
    ...
    num_pages=config.pages_per_epoch,
    ...
)

python

Full traceback of the error:

[Insert the full traceback of the error here]

I suspect that the issue may be related to very strict rate limits on the Hugging Face Datasets server.

Could you please assist me in resolving this issue? Perhaps you could lift the rate limit for my account or suggest other solutions.

Thank you for your assistance!

John6666 · March 22, 2025, 5:21am

With 35TB, it’s not impossible to download, but we don’t want to…

It seems that there are cases of the following issues.

github.com/huggingface/datasets

HfHubHTTPError: 429 Client Error: Too Many Requests for URL when trying to access SlimPajama-627B or c4 on TPUs

opened 04:30PM - 22 Dec 24 UTC

closed 05:31AM - 15 Jan 25 UTC

clankur

### Describe the bug I am trying to run some trainings on Google's TPUs using H…uggingface's DataLoader on [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B) and [c4](https://huggingface.co/datasets/allenai/c4), but I end up running into `429 Client Error: Too Many Requests for URL` error when I call `load_dataset`. The even odder part is that I am able to sucessfully run trainings with the [wikitext dataset](https://huggingface.co/datasets/Salesforce/wikitext). Is there something I need to setup to specifically train with SlimPajama or C4 with TPUs because I am not clear why I am getting these errors. ### Steps to reproduce the bug These are the commands you could run to produce the error below but you will require a ClearML account (you can create one [here](https://app.clear.ml/login?redirect=%2Fdashboard)) with a queue setup to run on Google TPUs ```bash git clone https://github.com/clankur/muGPT.git cd muGPT python -m train --config-name=slim_v4-32_84m.yaml +training.queue={NAME_OF_CLEARML_QUEUE} ``` The error I see: ``` Traceback (most recent call last): File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function return task_function(a_config, *a_args, **a_kwargs) File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 1037, in main main_contained(config, logger) File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 840, in main_contained loader = get_loader("train", config.training_data, config.training.tokens) File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 549, in get_loader return HuggingFaceDataLoader(split, config, token_batch_params) File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 395, in __init__ self.dataset = load_dataset( File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 2112, in load_dataset builder_instance = load_dataset_builder( File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1798, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1495, in dataset_module_factory raise e1 from None File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1479, in dataset_module_factory ).get_module() File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1034, in get_module else get_data_patterns(base_path, download_config=self.download_config) File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 457, in get_data_patterns return _get_data_files_patterns(resolver) File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 248, in _get_data_files_patterns data_files = pattern_resolver(pattern) File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 340, in resolve_pattern for filepath, info in fs.glob(pattern, detail=True).items() File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 409, in glob return super().glob(path, **kwargs) File "/home/clankur/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/fsspec/spec.py", line 602, in glob allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs) File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 429, in find out = self._ls_tree(path, recursive=True, refresh=refresh, revision=resolved_path.revision, **kwargs) File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 358, in _ls_tree self._ls_tree( File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 375, in _ls_tree for path_info in tree: File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3080, in list_repo_tree for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}): File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_pagination.py", line 46, in paginate hf_raise_for_status(r) File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status raise _format(HfHubHTTPError, str(e), response) from e huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/cerebras/SlimPajama-627B/tree/2d0accdd58c5d5511943ca1f5ff0e3eb5e293543?recursive=True&expand=True&cursor=ZXlKbWFXeGxYMjVoYldVaU9pSjBaWE4wTDJOb2RXNXJNUzlsZUdGdGNHeGxYMmh2YkdSdmRYUmZPVFEzTG1wemIyNXNMbnB6ZENKOTo2MjUw (Request ID: Root=1-67673de9-1413900606ede7712b08ef2c;1304c09c-3e69-4222-be14-f10ee709d49c) maximum queue size reached Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. ``` ### Expected behavior I'd expect the DataLoader to load from the SlimPajama-627B and c4 dataset without issue. ### Environment info - `datasets` version: 2.14.4 - Platform: Linux-5.8.0-1035-gcp-x86_64-with-glibc2.31 - Python version: 3.10.16 - Huggingface_hub version: 0.26.5 - PyArrow version: 18.1.0 - Pandas version: 2.2.3

lhoestq
on Jan 10, 2025
Hi ! This is due to your old version of datasets which calls HF with expand=True, an option that is strongly rate limited.
Recent versions of datasets don’t rely on this anymore, you can fix your issue by upgrading datasets
pip install -U datasets
You can also get maximum HF availability on your compute nodes with HF Enterprise (see network security features)

Topic		Replies	Views
Exceeded our hourly quotas for action while loading dataset to HF Hub 🤗Datasets	9	1439	November 7, 2023
Unable to Train for a Long Time 🤗Datasets	4	1866	February 16, 2023
Too many requests for URL 🤗Hub	5	3480	May 25, 2025
Speeding up Streaming of Large Datasets (FineWeb)? 🤗Datasets	8	1475	June 10, 2024
Error 500 when loading dataset Beginners	2	1010	May 14, 2024

Problem loading HuggingFaceFW/fineweb-edu-score-2 dataset: Too Many Requests

Related topics