Hello,
I have been working on recreating the OPT pre-training corpus to upload to the Huggingface dataset hub. I have been processing the Pile dataset into chunks and filtering the subsets to only include the specified sections such as CommonCrawl, DM Mathematics, etc. in the paper. I am working with a hardware constraint of 32 GB of RAM. I save the subsets to disk. The filtered subset of Pile Common Crawl is about 220 GB of data. Below is a minimum reproducible script:
# Split the dataset into chunks that can fit and be processed all in 32 GB of RAM
chunks = [
'train[0:20000000]',
'train[20000001:40000000]',
'train[40000001:60000000]',
'train[60000001:80000000]',
'train[80000001:100000000]',
'train[100000001:120000000]',
'train[120000001:140000000]',
'train[140000001:160000000]',
'train[160000001:180000000]',
'train[180000001:200000000]',
'train[200000001:]'
]
# Loop through chunks and process data by specified filtered subset of the Pile
for chunk_number, chunk in enumerate(chunks):
#Load the pile dataset chunk
pile_dataset = load_dataset('the_pile', 'all', split = chunk)
#Filter the pile dataset by meta-data to get the specified subset
filtered_pile = pile_dataset.filter(lambda x: x['meta']['pile_set_name'] == 'Pile-CC', num_proc=8)
# Directory name should be the same as chunk number. If chunk_number is 0 directory name should also be 0. Save the file chunks to disk.
filtered_pile.save_to_disk('../datasets/pile_cc/{}'.format(chunk_number))
I then concatenate all of the chunks together to get the full Pile-CC subset and push the dataset to the hub.
#Loading all of the chunks
pile_cc_chunk_0 = load_from_disk('../datasets/pile_cc/0')
pile_cc_chunk_1 = load_from_disk('../datasets/pile_cc/1')
pile_cc_chunk_2 = load_from_disk('../datasets/pile_cc/2')
pile_cc_chunk_3 = load_from_disk('../datasets/pile_cc/3')
pile_cc_chunk_4 = load_from_disk('../datasets/pile_cc/4')
pile_cc_chunk_5 = load_from_disk('../datasets/pile_cc/5')
pile_cc_chunk_6 = load_from_disk('../datasets/pile_cc/6')
pile_cc_chunk_7 = load_from_disk('../datasets/pile_cc/7')
pile_cc_chunk_8 = load_from_disk('../datasets/pile_cc/8')
pile_cc_chunk_9 = load_from_disk('../datasets/pile_cc/9')
pile_cc_chunk_10 = load_from_disk('../datasets/pile_cc/10')
#Concatenate the chunks together to get the full Pile-CC subset
pile_cc = concatenate_datasets([pile_cc_chunk_0, pile_cc_chunk_1, pile_cc_chunk_2, pile_cc_chunk_3, pile_cc_chunk_4, pile_cc_chunk_5, pile_cc_chunk_6, pile_cc_chunk_7, pile_cc_chunk_8, pile_cc_chunk_9, pile_cc_chunk_10])
#Push the full subset to the Huggingface hub
pile_cc.push_to_hub('conceptofmind/pile-cc')
After about 2 hours, a third of the data (around 34%) was successfully uploaded before I received this error message:
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/conceptofmind/pile-cc/upload/main/data/train-00157-of-00445.parquet
I then reran the script but the files started uploading from start again. It would take about 6 hours in total to fully upload this Pile-CC subset.
This error appeared during the second attempt:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host=‘huggingface.co’, port=443): Max retries exceeded with url: /api/whoami-v2 (Caused by NewConnectionError(’<urllib3.connection.HTTPSConnection object at 0x7f54241c0520>: Failed to establish a new connection: [Errno -2] Name or service not known’))
Does anyone have any idea why these errors occurred? Should I be uploading the data to the hub in a different way?
I may try saving the full Pile-CC subset to disk as well and manually drag uploading the file to the hub.
Any advice would be greatly appreciated.
Thank you,
Enrico
Sources: https://arxiv.org/pdf/2205.01068.pdf