Push to Hub - HTTPS Connection Pool( host = ‘huggingface.co’, port = 443 )

Hello,

I have been working on recreating the OPT pre-training corpus to upload to the Huggingface dataset hub. I have been processing the Pile dataset into chunks and filtering the subsets to only include the specified sections such as CommonCrawl, DM Mathematics, etc. in the paper. I am working with a hardware constraint of 32 GB of RAM. I save the subsets to disk. The filtered subset of Pile Common Crawl is about 220 GB of data. Below is a minimum reproducible script:

# Split the dataset into chunks that can fit and be processed all in 32 GB of RAM
chunks = [
    'train[0:20000000]', 
    'train[20000001:40000000]', 
    'train[40000001:60000000]', 
    'train[60000001:80000000]', 
    'train[80000001:100000000]', 
    'train[100000001:120000000]', 
    'train[120000001:140000000]', 
    'train[140000001:160000000]', 
    'train[160000001:180000000]', 
    'train[180000001:200000000]', 
    'train[200000001:]'
    ]

# Loop through chunks and process data by specified filtered subset of the Pile
for chunk_number, chunk in enumerate(chunks):
    #Load the pile dataset chunk
    pile_dataset = load_dataset('the_pile', 'all', split = chunk)
    #Filter the pile dataset by meta-data to get the specified subset
    filtered_pile = pile_dataset.filter(lambda x: x['meta']['pile_set_name'] == 'Pile-CC', num_proc=8)
    # Directory name should be the same as chunk number. If chunk_number is 0 directory name should also be 0. Save the file chunks to disk.
    filtered_pile.save_to_disk('../datasets/pile_cc/{}'.format(chunk_number))

I then concatenate all of the chunks together to get the full Pile-CC subset and push the dataset to the hub.

#Loading all of the chunks
pile_cc_chunk_0 = load_from_disk('../datasets/pile_cc/0')
pile_cc_chunk_1 = load_from_disk('../datasets/pile_cc/1')
pile_cc_chunk_2 = load_from_disk('../datasets/pile_cc/2')
pile_cc_chunk_3 = load_from_disk('../datasets/pile_cc/3')
pile_cc_chunk_4 = load_from_disk('../datasets/pile_cc/4')
pile_cc_chunk_5 = load_from_disk('../datasets/pile_cc/5')
pile_cc_chunk_6 = load_from_disk('../datasets/pile_cc/6')
pile_cc_chunk_7 = load_from_disk('../datasets/pile_cc/7')
pile_cc_chunk_8 = load_from_disk('../datasets/pile_cc/8')
pile_cc_chunk_9 = load_from_disk('../datasets/pile_cc/9')
pile_cc_chunk_10 = load_from_disk('../datasets/pile_cc/10')

#Concatenate the chunks together to get the full Pile-CC subset
pile_cc = concatenate_datasets([pile_cc_chunk_0, pile_cc_chunk_1, pile_cc_chunk_2, pile_cc_chunk_3, pile_cc_chunk_4, pile_cc_chunk_5, pile_cc_chunk_6, pile_cc_chunk_7, pile_cc_chunk_8, pile_cc_chunk_9, pile_cc_chunk_10])

#Push the full subset to the Huggingface hub
pile_cc.push_to_hub('conceptofmind/pile-cc')

After about 2 hours, a third of the data (around 34%) was successfully uploaded before I received this error message:

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/conceptofmind/pile-cc/upload/main/data/train-00157-of-00445.parquet

I then reran the script but the files started uploading from start again. It would take about 6 hours in total to fully upload this Pile-CC subset.

This error appeared during the second attempt:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host=‘huggingface.co’, port=443): Max retries exceeded with url: /api/whoami-v2 (Caused by NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x7f54241c0520>: Failed to establish a new connection: [Errno -2] Name or service not known’))

Does anyone have any idea why these errors occurred? Should I be uploading the data to the hub in a different way?

I may try saving the full Pile-CC subset to disk as well and manually drag uploading the file to the hub.

Any advice would be greatly appreciated.

Thank you,

Enrico

Sources: https://arxiv.org/pdf/2205.01068.pdf

I tried to upload the dataset to the hub without chunking or saving to disk:

from datasets import load_dataset
 
pile_dataset = load_dataset('the_pile', 'all', split = 'train')
filtered_pile = pile_dataset.filter(lambda x: x['meta']['pile_set_name'] == 'Pile-CC', num_proc=8)
filtered_pile.push_to_hub('conceptofmind/pile_cc')

The dataset was uploaded about 64% of the way before I received the same error as before:

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/conceptofmind/pile_cc/upload/main/data/train-00400-of-00623.parquet

Uploading always seems to start from 0 instead of resuming at the time of the error, and progress is lost.

Hi, as a workaround you can try manually chunking the dataset and save the parquet files in a clone of the destination dataset repository, then push them to the hub using git add/commit inside the folder, you can find the code at the end of this file github_preprocessing.py · codeparrot/github-code at main

Hi! We’ve recently added support for resuming an upload in push_to_hub to address this issue, so now you can just rerun the push_to_hub line after an error to resume the upload.

2 Likes

Hi @loubnabnl ,

I will take a look.

Thank you,

Enrico

Hi @mariosasko ,

I appreciate the help.

I will try uploading again now.

Thank you,

Enrico