Processing big LLM datasets (e.g. FineWeb)

mlefarov · July 11, 2024, 4:43pm

Hi, I’m trying to do some simple filtering of FineWeb dataset. The dataset page provides examples of how to process it using datatrove, but datatrove only supports Slurm as a distributed executor. I don’t have access to Slurm. I do have access to Ray cluster, thus I wrote a ray.data pipeline to process parquet files of FineWeb and save the results to a GCS bucket (without downloading of the dataset to cluster nodes).

It’s working fine for 10BT subsampled version of dataset when running with 16 parallel tasks. However when I was trying to scale up the number of parallel tasks for a bigger version of the dataset it seems like I was getting rate-limited. I.e. after a few minutes of runtime I started to get “Bad request” errors from Ray’s read_parquet.

So my question are:

Is there any rate limit on access/reads of the FineWeb?
Is there a way to copy/clone the dataset directly to GCS without going through local file system?

I’ve tried follwing this docs about cloud storage. But no matter what I do, the execution of all code examples first downloads the dataset locally (which is “problematic” for the dataset of this size)

I would also appreciate any best practices for processing datasets of this sizes hosted on Huggingface

Topic		Replies	Views
Iterating on dataset extremely slow 🤗Datasets	8	1990	November 6, 2024
Speeding up Streaming of Large Datasets (FineWeb)? 🤗Datasets	8	1479	June 10, 2024
Using Webdatasets to stream data 🤗Datasets	6	1850	February 19, 2024
Loading webdatasets across multiple nodes 🤗Datasets	3	1520	April 21, 2025
Parsing dataset Beginners	0	138	January 18, 2024

Processing big LLM datasets (e.g. FineWeb)

Related topics