Drive Stats data set - upload files to dataset repo or link to them?

Hi there - I’m looking at adding the Drive Stats data set to Hugging Face. Drive Stats comprises over 10 years of metrics on hard drives spinning in Backblaze’s data centers; nearly 390 million records covering 340,000 drives, each record containing the date, drive model number, drive serial number, drive capacity, the reported SMART attributes and whether the drive failed on that day. You can read more on the embryonic data set page: backblaze/Drive_Stats · Datasets at Hugging Face

The data is currently hosted in a Backblaze B2 (S3-compatible) bucket, in a number of zip files. Each zip file contain’s a calendar quarter of data in CSV files, one per day.

The 33 zip files add up to about 21 GB; the 3749 unzipped CSV files about 128 GB.

There are a few options I can think of:

  • Upload a file (what format?) containing links to the zips.
  • Upload the zip files
  • Upload the CSV files - ‘flat’ in a single directory, or partitioned into subdirectories by year and month

Is there any ‘best practice’ here?

We usually recommend to shard big datasets in compressed files that are less than 1GB each. This allows fast downloads thanks to compression, and also enables parallel downloads/processing.

In your specific case if you anticipate that users will want to download subsets of the dataset by year and month I’d recommend to partition the data by year and month. And you probably want to compress the data in any case (using zip or gzip for example).

3 Likes

more details here: Repository limitations and recommendations

1 Like