Datasets 'ChunkedEncodingError: ConnectionBroken'

Hey everyone o/
I’m still trying to get plesent results on training SpeechT5 on Japanese and wanted to try switching to the Reazonspeech (medium) model.
Sadly I’m experiencing some strange behaviour when downloading this particular model. Once started, the download speeds up and after the first couple of MB (inconsistend when exactly) it slows down to some two digit kB/s before crashing with ‘ChunkedEncodingError: (ConnectionBroken: IncompleteRead(…))’.

Here is my output:

E:\Programming\python\projects\SpeechT5-jp\venv\Scripts\python.exe E:\Programming\python\projects\SpeechT5-jp\tts_fine-tune.py 

The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.

The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\Maximilian\.cache\huggingface\token
Login successful
Found cached dataset common_voice_13_0 (C:/Users/Maximilian/.cache/huggingface/datasets/mozilla-foundation___common_voice_13_0/ja/13.0.0/2506e9a8950f5807ceae08c2920e814222909fd7f477b74f5d225802e9f04055)
Downloading and preparing dataset reazonspeech/medium to C:/Users/Maximilian/.cache/huggingface/datasets/reazon-research___reazonspeech/medium/1.0.0/00f9d8f336dd718ea4c26dba7be9a2ce3795b9d92903c626baa912de3021ba2d...
Downloading data files:   0%|          | 0/64 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/328M [00:00<?, ?B/s]
Downloading data:   0%|          | 4.10k/328M [00:00<5:49:20, 15.7kB/s]
Downloading data:   0%|          | 42.0k/328M [00:00<59:53, 91.4kB/s]  
Downloading data:   0%|          | 91.1k/328M [00:00<40:23, 135kB/s] 
Downloading data:   0%|          | 206k/328M [00:01<21:27, 255kB/s] 
Downloading data:   0%|          | 435k/328M [00:01<11:25, 479kB/s]
Downloading data:   0%|          | 697k/328M [00:01<08:19, 656kB/s]
Downloading data:   0%|          | 1.22M/328M [00:01<04:58, 1.10MB/s]
Downloading data:   1%|          | 2.29M/328M [00:02<02:40, 2.04MB/s]
Downloading data:   1%|          | 3.58M/328M [00:02<01:29, 3.65MB/s]
Downloading data:   1%|▏         | 4.12M/328M [00:13<28:47, 188kB/s] 
Downloading data:   1%|▏         | 4.14M/328M [00:15<32:56, 164kB/s]
Downloading data:   1%|▏         | 4.28M/328M [00:27<32:55, 164kB/s]
Downloading data:   1%|▏         | 4.29M/328M [00:28<1:25:53, 62.9kB/s]
Downloading data:   1%|▏         | 4.30M/328M [00:29<1:27:27, 61.8kB/s]
Downloading data:   1%|▏         | 4.56M/328M [00:35<1:39:41, 54.1kB/s]
Downloading data:   2%|▏         | 6.07M/328M [00:36<31:08, 173kB/s]   
Downloading data:   2%|▏         | 6.20M/328M [00:47<31:07, 173kB/s]
Downloading data:   2%|▏         | 6.20M/328M [00:48<1:11:01, 75.6kB/s]
Downloading data:   2%|▏         | 6.22M/328M [00:50<1:17:19, 69.4kB/s]
Downloading data:   2%|▏         | 6.35M/328M [01:00<2:03:35, 43.4kB/s]
Downloading data:   2%|▏         | 6.44M/328M [01:11<3:07:18, 28.6kB/s]
Downloading data:   2%|▏         | 6.45M/328M [01:13<3:28:36, 25.7kB/s]
Downloading data:   2%|▏         | 6.51M/328M [01:17<3:42:53, 24.1kB/s]
Downloading data:   2%|▏         | 6.56M/328M [01:23<4:45:41, 18.8kB/s]
Downloading data:   2%|▏         | 6.59M/328M [01:26<5:02:33, 17.7kB/s]
Downloading data:   2%|▏         | 6.62M/328M [01:29<5:47:07, 15.4kB/s]
Downloading data:   2%|▏         | 6.63M/328M [01:30<5:46:45, 15.5kB/s]
Downloading data:   2%|▏         | 6.64M/328M [01:31<6:05:06, 14.7kB/s]
Downloading data:   2%|▏         | 6.66M/328M [01:32<5:48:51, 15.4kB/s]
Downloading data:   2%|▏         | 6.68M/328M [01:35<7:34:16, 11.8kB/s]
Downloading data:   2%|▏         | 6.69M/328M [01:36<6:54:30, 12.9kB/s]
Downloading data:   2%|▏         | 6.71M/328M [01:38<8:14:41, 10.8kB/s]
Downloading data:   2%|▏         | 6.73M/328M [01:41<9:42:48, 9.20kB/s]
Downloading data:   2%|▏         | 6.74M/328M [01:41<7:34:59, 11.8kB/s]
Downloading data:   2%|▏         | 6.76M/328M [01:41<6:18:47, 14.2kB/s]
Downloading data:   2%|▏         | 6.78M/328M [01:44<8:31:51, 10.5kB/s]
Downloading data:   2%|▏         | 6.79M/328M [01:46<9:45:47, 9.15kB/s]
Downloading data:   2%|▏         | 6.81M/328M [01:47<8:11:04, 10.9kB/s]
Downloading data:   2%|▏         | 6.82M/328M [01:50<10:22:57, 8.60kB/s]
Downloading data:   2%|▏         | 6.84M/328M [01:51<9:25:45, 9.47kB/s] 
Downloading data:   2%|▏         | 6.86M/328M [01:52<7:29:03, 11.9kB/s]
Downloading data:   2%|▏         | 6.87M/328M [01:53<6:57:28, 12.8kB/s]
Downloading data:   2%|▏         | 6.89M/328M [01:56<9:34:03, 9.33kB/s]
Downloading data:   2%|▏         | 6.91M/328M [01:57<8:24:59, 10.6kB/s]
Downloading data:   2%|▏         | 6.92M/328M [01:58<7:10:51, 12.4kB/s]
Downloading data:   2%|▏         | 6.94M/328M [01:59<7:35:52, 11.8kB/s]
Downloading data:   2%|▏         | 6.96M/328M [02:02<10:01:34, 8.90kB/s]
Downloading data:   2%|▏         | 6.97M/328M [02:03<8:18:23, 10.7kB/s] 
Downloading data:   2%|▏         | 6.99M/328M [02:05<9:40:12, 9.23kB/s]
Downloading data:   2%|▏         | 7.01M/328M [02:07<10:11:46, 8.75kB/s]
Downloading data:   2%|▏         | 7.02M/328M [02:08<8:25:23, 10.6kB/s] 
Downloading data:   2%|▏         | 7.04M/328M [02:09<6:45:15, 13.2kB/s]
Downloading data:   2%|▏         | 7.05M/328M [02:11<9:00:33, 9.91kB/s]
Downloading data:   2%|▏         | 7.07M/328M [02:13<9:18:13, 9.59kB/s]
Downloading data:   2%|▏         | 7.09M/328M [02:14<8:13:34, 10.8kB/s]
Downloading data:   2%|▏         | 7.10M/328M [02:17<10:28:12, 8.52kB/s]
Downloading data:   2%|▏         | 7.12M/328M [02:18<8:36:43, 10.4kB/s] 
Downloading data:   2%|▏         | 7.14M/328M [02:19<7:18:50, 12.2kB/s]
Downloading data:   2%|▏         | 7.15M/328M [02:20<7:15:32, 12.3kB/s]
Downloading data:   2%|▏         | 7.17M/328M [02:23<9:47:37, 9.11kB/s]
Downloading data:   2%|▏         | 7.19M/328M [02:23<7:42:37, 11.6kB/s]
Downloading data:   2%|▏         | 7.20M/328M [02:26<10:06:27, 8.83kB/s]
Downloading data:   2%|▏         | 7.22M/328M [02:29<10:55:36, 8.16kB/s]
Downloading data:   2%|▏         | 7.23M/328M [02:29<8:30:16, 10.5kB/s] 
Downloading data:   2%|▏         | 7.25M/328M [02:30<6:48:33, 13.1kB/s]
Downloading data:   2%|▏         | 7.27M/328M [02:33<9:47:49, 9.10kB/s]
Downloading data:   2%|▏         | 7.28M/328M [02:36<12:27:32, 7.16kB/s]
Downloading data:   2%|▏         | 7.30M/328M [02:40<14:51:08, 6.00kB/s]
Downloading data:   2%|▏         | 7.32M/328M [02:43<16:00:56, 5.57kB/s]
Downloading data:   2%|▏         | 7.33M/328M [02:47<16:49:50, 5.30kB/s]
Downloading data:   2%|▏         | 7.35M/328M [02:51<17:54:42, 4.98kB/s]
Downloading data:   2%|▏         | 7.37M/328M [02:54<18:09:36, 4.91kB/s]
Downloading data:   2%|▏         | 7.38M/328M [02:57<18:19:16, 4.87kB/s]
Downloading data:   2%|▏         | 7.40M/328M [03:01<18:26:22, 4.83kB/s]
Downloading data:   2%|▏         | 7.41M/328M [03:05<19:02:48, 4.68kB/s]
Downloading data:   2%|▏         | 7.43M/328M [03:08<18:57:13, 4.70kB/s]
Downloading data:   2%|▏         | 7.45M/328M [03:12<18:53:02, 4.72kB/s]
Downloading data:   2%|▏         | 7.46M/328M [03:15<19:21:29, 4.60kB/s]
Downloading data:   2%|▏         | 7.48M/328M [03:19<19:09:43, 4.65kB/s]
Downloading data:   2%|▏         | 7.50M/328M [03:22<19:01:21, 4.69kB/s]
Downloading data:   2%|▏         | 7.51M/328M [03:26<19:25:14, 4.59kB/s]
Downloading data:   2%|▏         | 7.53M/328M [03:29<19:11:59, 4.64kB/s]
Downloading data:   2%|▏         | 7.55M/328M [03:33<19:03:17, 4.68kB/s]
Downloading data:   2%|▏         | 7.56M/328M [03:37<19:28:06, 4.58kB/s]
Downloading data:   2%|▏         | 7.58M/328M [03:40<19:14:08, 4.63kB/s]
Downloading data:   2%|▏         | 7.60M/328M [03:43<19:04:41, 4.67kB/s]
Downloading data:   2%|▏         | 7.61M/328M [03:47<18:57:47, 4.70kB/s]
Downloading data:   2%|▏         | 7.63M/328M [03:51<19:24:04, 4.59kB/s]
Downloading data:   2%|▏         | 7.64M/328M [03:54<19:11:02, 4.64kB/s]
Downloading data:   2%|▏         | 7.66M/328M [03:57<19:02:39, 4.68kB/s]
Downloading data:   2%|▏         | 7.68M/328M [04:01<19:26:22, 4.58kB/s]
Downloading data:   2%|▏         | 7.69M/328M [04:05<19:13:04, 4.63kB/s]
Downloading data:   2%|▏         | 7.71M/328M [04:08<19:03:07, 4.67kB/s]
Downloading data:   2%|▏         | 7.73M/328M [04:12<19:27:00, 4.58kB/s]
Downloading data:   2%|▏         | 7.74M/328M [04:15<19:13:19, 4.63kB/s]
Downloading data:   2%|▏         | 7.76M/328M [04:19<19:04:06, 4.67kB/s]
Downloading data:   2%|▏         | 7.78M/328M [04:21<17:07:55, 5.20kB/s]
Downloading data:   2%|▏         | 7.79M/328M [04:21<12:25:10, 7.17kB/s]
Downloading data:   2%|▏         | 7.81M/328M [04:22<9:07:10, 9.76kB/s] 
Downloading data:   2%|▏         | 7.86M/328M [04:22<4:15:23, 20.9kB/s]
Downloading data:   2%|▏         | 7.96M/328M [04:22<1:46:42, 50.0kB/s]
Downloading data:   2%|▏         | 8.14M/328M [04:22<45:35, 117kB/s]   
Downloading data:   3%|▎         | 8.51M/328M [04:23<18:24, 290kB/s]
Downloading data:   3%|▎         | 9.25M/328M [04:23<07:45, 686kB/s]
Downloading data:   3%|▎         | 10.5M/328M [04:23<03:16, 1.61MB/s]
Downloading data:   3%|▎         | 11.0M/328M [04:23<02:58, 1.78MB/s]
Downloading data:   4%|▎         | 12.3M/328M [04:23<1:53:19, 46.5kB/s]
Traceback (most recent call last):
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\urllib3\response.py", line 710, in _error_catcher
    yield
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\urllib3\response.py", line 835, in _raw_read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
urllib3.exceptions.IncompleteRead: IncompleteRead(12264344 bytes read, 316091496 more expected)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\requests\models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\urllib3\response.py", line 940, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\urllib3\response.py", line 911, in read
    data = self._raw_read(amt)
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\urllib3\response.py", line 835, in _raw_read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "C:\Users\Maximilian\AppData\Local\Programs\Python\Python38\lib\contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\urllib3\response.py", line 727, in _error_catcher
    raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(12264344 bytes read, 316091496 more expected)', IncompleteRead(12264344 bytes read, 316091496 more expected))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\Programming\python\projects\SpeechT5-jp\tts_fine-tune.py", line 33, in <module>
    dataset_train = load_dataset("reazon-research/reazonspeech", "medium", split="train")
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\load.py", line 1809, in load_dataset
    builder_instance.download_and_prepare(
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\builder.py", line 909, in download_and_prepare
    self._download_and_prepare(
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\builder.py", line 1670, in _download_and_prepare
    super()._download_and_prepare(
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\builder.py", line 982, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "C:\Users\Maximilian\.cache\huggingface\modules\datasets_modules\datasets\reazon-research--reazonspeech\00f9d8f336dd718ea4c26dba7be9a2ce3795b9d92903c626baa912de3021ba2d\reazonspeech.py", line 84, in _split_generators
    archive_paths = dl_manager.download(url)
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\download\download_manager.py", line 427, in download
    downloaded_path_or_paths = map_nested(
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\utils\py_utils.py", line 444, in map_nested
    mapped = [
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\utils\py_utils.py", line 445, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\utils\py_utils.py", line 347, in _single_map_nested
    return function(data_struct)
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\download\download_manager.py", line 453, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\utils\file_utils.py", line 182, in cached_path
    output_path = get_from_cache(
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\utils\file_utils.py", line 610, in get_from_cache
    http_get(
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\datasets\utils\file_utils.py", line 402, in http_get
    for chunk in response.iter_content(chunk_size=1024):
  File "E:\Programming\python\projects\SpeechT5-jp\venv\lib\site-packages\requests\models.py", line 818, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(12264344 bytes read, 316091496 more expected)', IncompleteRead(12264344 bytes read, 316091496 more expected))
Downloading data files:  14%|█▍        | 9/64 [04:26<27:06, 29.57s/it]

Process finished with exit code 1

I would love to hear your thoughts on this issue. If you think you might need some additional information, feel free to ask and I’ll provide it.

1 Like

This issue is more related to the requests lib (and your connection/server hosting the files) than datasets, so I think the only solution is to retry load_dataset until it succeeds. Maybe it’s not a bad idea to reduce the number of parallel downloads by specifying download_config=DownloadConfig(num_proc=<num_proc>) in load_dataset to avoid network congestion.

1 Like