rkarimi
October 25, 2022, 12:05am
1
Hi @lvwerra
I am running codeparrot provided in the huggingface. Using this command:
accelerate launch scripts/codeparrot_training.py \
--model_ckpt codeparrot/codeparrot-small \
--dataset_name_train ./data/codeparrot-clean-train \
--dataset_name_valid ./data/codeparrot-clean-valid \
--train_batch_size 12 \
--valid_batch_size 12 \
--learning_rate 5e-4 \
--num_warmup_steps 2000 \
--gradient_accumulation 1 \
--gradient_checkpointing False \
--max_train_steps 150000 \
--save_checkpoint_steps 15000
the code freezes on the multi-gpu setting, I see similar reports in the datasets library here datasets freezes with streaming mode in multiple-gpu · Issue #5123 · huggingface/datasets · GitHub
Is any specific setting needed to run this script? Have you encouter the freezing with this script?
thanks for any help on this.
Hi @rkarimi
Could this be related to this issue:
opened 06:50PM - 11 Feb 22 UTC
bug
## Describe the bug
Loading a JSON dataset with `load_dataset` can get stuck wh… en running on a machine with many CPUs. This is especially an issue when loading a large dataset on a large machine.
## Steps to reproduce the bug
I originally created the following script to reproduce the issue:
```python
from datasets import load_dataset
from multiprocessing import Process
from tqdm import tqdm
import datasets
from transformers import set_seed
def run_tasks_in_parallel(tasks, ds_list):
for _ in tqdm(range(1000)):
print('new batch')
running_tasks = [Process(target=task, args=(ds, i)) for i, (task, ds) in enumerate(zip(tasks, ds_list))]
for running_task in running_tasks:
running_task.start()
for running_task in running_tasks:
running_task.join()
def get_dataset():
dataset_name = 'transformersbook/codeparrot'
ds = load_dataset(dataset_name+'-train', split="train", streaming=True)
ds = ds.shuffle(buffer_size=1000, seed=1)
return iter(ds)
def get_next_element(ds, process_id, N=10000):
for _ in range(N):
_ = next(ds)['content']
print(f'process {process_id} done')
return
set_seed(1)
datasets.utils.logging.set_verbosity_debug()
n_processes = 8
tasks = [get_next_element for _ in range(n_processes)]
args = [get_dataset() for _ in range(n_processes)]
run_tasks_in_parallel(tasks, args)
```
Today I noticed that it can happen when running it on a single process on a machine with many cores without streaming. So just `load_dataset("transformersbook/codeparrot-train")` alone might cause the issue after waiting long enough or trying many times. It's a slightly random process which makes it especially hard to track down. When I encountered it today it had already processed 17GB of data (the size of the cache folder when it got stuck) before getting stuck.
Here's my current understanding of the error. As far as I can tell it happens in the following block: https://github.com/huggingface/datasets/blob/be701e9e89ab38022612c7263edc015bc7feaff9/src/datasets/packaged_modules/json/json.py#L119-L139
When the try on line 121 fails and the `block_size` is increased it can happen that it can't read the JSON again and gets stuck indefinitely. A hint that points in that direction is that increasing the `chunksize` argument decreases the chance of getting stuck and vice versa. Maybe it is an issue with a lock on the file that is not properly released.
## Expected results
Read a JSON before the end of the universe.
## Actual results
Read a JSON not before the end of the universe.
## Environment info
- `datasets` version: 1.18.3
- Platform: Linux-4.19.0-18-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.10
- PyArrow version: 7.0.0
@lhoestq we dicsussed this a while ago. @albertvillanova we discussed this today :)
Can you try to increase the chunk_size
?
cc @loubnabnl
1 Like
Can you make sure you’re using Pytorch 1.11
, it seems that ShuffleIterDataPipe
changed in recent versions and this shuffling makes the dataloader stuck.
And does it work for you in non streaming mode? I think the issue should persist even without it.