Processing big LLM datasets (e.g. FineWeb)

Hi, I’m trying to do some simple filtering of FineWeb dataset. The dataset page provides examples of how to process it using datatrove, but datatrove only supports Slurm as a distributed executor. I don’t have access to Slurm. I do have access to Ray cluster, thus I wrote a ray.data pipeline to process parquet files of FineWeb and save the results to a GCS bucket (without downloading of the dataset to cluster nodes).

It’s working fine for 10BT subsampled version of dataset when running with 16 parallel tasks. However when I was trying to scale up the number of parallel tasks for a bigger version of the dataset it seems like I was getting rate-limited. I.e. after a few minutes of runtime I started to get “Bad request” errors from Ray’s read_parquet.

So my question are:

  1. Is there any rate limit on access/reads of the FineWeb?
  2. Is there a way to copy/clone the dataset directly to GCS without going through local file system?

I’ve tried follwing this docs about cloud storage. But no matter what I do, the execution of all code examples first downloads the dataset locally (which is “problematic” for the dataset of this size)

I would also appreciate any best practices for processing datasets of this sizes hosted on Huggingface :slight_smile:

1 Like