Streaming dataset freezes with multi-gpu

rkarimi · October 25, 2022, 12:05am

Hi @lvwerra
I am running codeparrot provided in the huggingface. Using this command:

accelerate launch scripts/codeparrot_training.py \
--model_ckpt codeparrot/codeparrot-small \
--dataset_name_train ./data/codeparrot-clean-train \
--dataset_name_valid ./data/codeparrot-clean-valid \
--train_batch_size 12 \
--valid_batch_size 12 \
--learning_rate 5e-4 \
--num_warmup_steps 2000 \
--gradient_accumulation 1 \
--gradient_checkpointing False \
--max_train_steps 150000 \
--save_checkpoint_steps 15000

the code freezes on the multi-gpu setting, I see similar reports in the datasets library here datasets freezes with streaming mode in multiple-gpu · Issue #5123 · huggingface/datasets · GitHub

Is any specific setting needed to run this script? Have you encouter the freezing with this script?

thanks for any help on this.

lvwerra · October 25, 2022, 8:09am

Hi @rkarimi

Could this be related to this issue:

github.com/huggingface/datasets

Loading JSON gets stuck with many workers/threads

opened 06:50PM - 11 Feb 22 UTC

lvwerra

bug

## Describe the bug Loading a JSON dataset with `load_dataset` can get stuck wh…en running on a machine with many CPUs. This is especially an issue when loading a large dataset on a large machine. ## Steps to reproduce the bug I originally created the following script to reproduce the issue: ```python from datasets import load_dataset from multiprocessing import Process from tqdm import tqdm import datasets from transformers import set_seed def run_tasks_in_parallel(tasks, ds_list): for _ in tqdm(range(1000)): print('new batch') running_tasks = [Process(target=task, args=(ds, i)) for i, (task, ds) in enumerate(zip(tasks, ds_list))] for running_task in running_tasks: running_task.start() for running_task in running_tasks: running_task.join() def get_dataset(): dataset_name = 'transformersbook/codeparrot' ds = load_dataset(dataset_name+'-train', split="train", streaming=True) ds = ds.shuffle(buffer_size=1000, seed=1) return iter(ds) def get_next_element(ds, process_id, N=10000): for _ in range(N): _ = next(ds)['content'] print(f'process {process_id} done') return set_seed(1) datasets.utils.logging.set_verbosity_debug() n_processes = 8 tasks = [get_next_element for _ in range(n_processes)] args = [get_dataset() for _ in range(n_processes)] run_tasks_in_parallel(tasks, args) ``` Today I noticed that it can happen when running it on a single process on a machine with many cores without streaming. So just `load_dataset("transformersbook/codeparrot-train")` alone might cause the issue after waiting long enough or trying many times. It's a slightly random process which makes it especially hard to track down. When I encountered it today it had already processed 17GB of data (the size of the cache folder when it got stuck) before getting stuck. Here's my current understanding of the error. As far as I can tell it happens in the following block: https://github.com/huggingface/datasets/blob/be701e9e89ab38022612c7263edc015bc7feaff9/src/datasets/packaged_modules/json/json.py#L119-L139 When the try on line 121 fails and the `block_size` is increased it can happen that it can't read the JSON again and gets stuck indefinitely. A hint that points in that direction is that increasing the `chunksize` argument decreases the chance of getting stuck and vice versa. Maybe it is an issue with a lock on the file that is not properly released. ## Expected results Read a JSON before the end of the universe. ## Actual results Read a JSON not before the end of the universe. ## Environment info - `datasets` version: 1.18.3 - Platform: Linux-4.19.0-18-cloud-amd64-x86_64-with-glibc2.28 - Python version: 3.9.10 - PyArrow version: 7.0.0 @lhoestq we dicsussed this a while ago. @albertvillanova we discussed this today :)

Can you try to increase the chunk_size?

cc @loubnabnl

loubnabnl · December 8, 2022, 8:20pm

Can you make sure you’re using Pytorch 1.11, it seems that ShuffleIterDataPipe changed in recent versions and this shuffling makes the dataloader stuck.

And does it work for you in non streaming mode? I think the issue should persist even without it.

Topic		Replies	Views
Cannot use Datasets.map on multi-gpu during evaluation Beginners	3	3047	July 1, 2024
Trainer.train() hangs with multiple GPUs (but GPUs show activity) Beginners	4	836	October 31, 2024
Trainer freezes after all steps are complete (multi-gpu setting) 🤗Transformers	4	1538	February 14, 2024
CodeGen Model - Transfer Learning, Train and Eval (codeparrot/apps database) Beginners	0	539	August 7, 2022
Fine tunning llama2 with multiple GPUs and Hugging face trainer 🤗Transformers	1	3478	November 3, 2023

Streaming dataset freezes with multi-gpu

Related topics