Crash during training - rate limit

emilzak · December 19, 2023, 7:53am

What am I doing wrong?


==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

> INFO    AUTOTRAIN_USERNAME: emilzak
> INFO    PROJECT_NAME: date_parser_v9-0
> INFO    TASK_ID: 28
> INFO    DATA_PATH: emilzak/autotrain-data-date_parser_v9
> INFO    MODEL: facebook/m2m100_418M
> INFO    OUTPUT_MODEL_REPO: emilzak/date_parser_v9-0
INFO:     Started server process [34]
INFO:     Waiting for application startup.
> INFO    {'data_path': 'emilzak/autotrain-data-date_parser_v9', 'model': 'facebook/m2m100_418M', 'username': 'emilzak', 'seed': 42, 'train_split': 'train', 'valid_split': 'validation', 'project_name': 'date_parser_v9-0', 'token': 'hf_**********************************', 'push_to_hub': True, 'text_column': 'autotrain_text', 'target_column': 'autotrain_label', 'repo_id': 'emilzak/date_parser_v9-0', 'lr': 5e-05, 'epochs': 3, 'max_seq_length': 128, 'max_target_length': 128, 'batch_size': 8, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'save_total_limit': 1, 'save_strategy': 'epoch', 'peft': False, 'quantization': None, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': []}
> INFO    ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.seq2seq', '--training_config', '/tmp/model/training_params.json']
> INFO    Started training with PID 40
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]
Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 21.9MB/s]
🚀 INFO   | 2023-12-19 07:49:35 | __main__:train:47 - Starting training...
🚀 INFO   | 2023-12-19 07:49:35 | __main__:train:48 - Training config: {'data_path': 'emilzak/autotrain-data-date_parser_v9', 'model': 'facebook/m2m100_418M', 'username': 'emilzak', 'seed': 42, 'train_split': 'train', 'valid_split': 'validation', 'project_name': '/tmp/model', 'token': '*****', 'push_to_hub': True, 'text_column': 'autotrain_text', 'target_column': 'autotrain_label', 'repo_id': 'emilzak/date_parser_v9-0', 'lr': 5e-05, 'epochs': 3, 'max_seq_length': 128, 'max_target_length': 128, 'batch_size': 8, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'save_total_limit': 1, 'save_strategy': 'epoch', 'peft': False, 'quantization': None, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': []}
Downloading readme:   0%|          | 0.00/617 [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 617/617 [00:00<00:00, 6.19MB/s]
Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/14.7k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 14.7k/14.7k [00:00<00:00, 84.3kB/s]
Downloading data: 100%|██████████| 14.7k/14.7k [00:00<00:00, 84.2kB/s]
Downloading data files:  50%|█████     | 1/2 [00:00<00:00,  5.67it/s]
Downloading data:   0%|          | 0.00/5.16k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 5.16k/5.16k [00:00<00:00, 98.2kB/s]
Downloading data files: 100%|██████████| 2/2 [00:00<00:00,  8.65it/s]
Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1665.73it/s]
Generating train split:   0%|          | 0/544 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 544/544 [00:00<00:00, 126634.55 examples/s]
Generating validation split:   0%|          | 0/136 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 136/136 [00:00<00:00, 98860.54 examples/s]
config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]
config.json: 100%|██████████| 908/908 [00:00<00:00, 7.51MB/s]
pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]
pytorch_model.bin:   1%|          | 10.5M/1.94G [00:01<03:26, 9.34MB/s]
pytorch_model.bin:   3%|▎         | 62.9M/1.94G [00:01<00:28, 65.4MB/s]
pytorch_model.bin:   5%|▍         | 94.4M/1.94G [00:02<00:57, 31.8MB/s]
pytorch_model.bin:   6%|▌         | 115M/1.94G [00:03<00:44, 40.7MB/s] 
pytorch_model.bin:   7%|▋         | 136M/1.94G [00:03<00:34, 51.9MB/s]
pytorch_model.bin:  15%|█▍        | 283M/1.94G [00:03<00:09, 184MB/s] 
pytorch_model.bin:  20%|██        | 388M/1.94G [00:03<00:05, 275MB/s]
pytorch_model.bin:  26%|██▌       | 503M/1.94G [00:03<00:03, 397MB/s]
pytorch_model.bin:  33%|███▎      | 629M/1.94G [00:03<00:02, 513MB/s]
pytorch_model.bin:  37%|███▋      | 724M/1.94G [00:04<00:04, 287MB/s]
pytorch_model.bin:  41%|████      | 786M/1.94G [00:04<00:04, 278MB/s]
pytorch_model.bin:  44%|████▍     | 849M/1.94G [00:04<00:03, 320MB/s]
pytorch_model.bin:  54%|█████▎    | 1.04G/1.94G [00:04<00:01, 547MB/s]
pytorch_model.bin:  59%|█████▊    | 1.13G/1.94G [00:05<00:01, 526MB/s]
pytorch_model.bin:  63%|██████▎   | 1.22G/1.94G [00:05<00:01, 494MB/s]
pytorch_model.bin:  75%|███████▌  | 1.46G/1.94G [00:05<00:00, 820MB/s]
pytorch_model.bin:  93%|█████████▎| 1.81G/1.94G [00:05<00:00, 1.32GB/s]
pytorch_model.bin: 100%|█████████▉| 1.94G/1.94G [00:05<00:00, 350MB/s] 
generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]
generation_config.json: 100%|██████████| 233/233 [00:00<00:00, 1.79MB/s]
tokenizer_config.json:   0%|          | 0.00/272 [00:00<?, ?B/s]
tokenizer_config.json: 100%|██████████| 272/272 [00:00<00:00, 2.03MB/s]
vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]
vocab.json: 100%|██████████| 3.71M/3.71M [00:00<00:00, 20.9MB/s]
vocab.json: 100%|██████████| 3.71M/3.71M [00:00<00:00, 20.7MB/s]
sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]
sentencepiece.bpe.model: 100%|██████████| 2.42M/2.42M [00:00<00:00, 31.4MB/s]
special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]
special_tokens_map.json: 100%|██████████| 1.14k/1.14k [00:00<00:00, 9.90MB/s]
  0%|          | 0/204 [00:00<?, ?it/s]❌ ERROR  | 2023-12-19 07:49:49 | autotrain.trainers.common:wrapper:79 - train has failed due to an exception: Traceback (most recent call last):
  File "/app/src/autotrain/trainers/common.py", line 76, in wrapper
    return func(*args, **kwargs)
  File "/app/src/autotrain/trainers/seq2seq/__main__.py", line 216, in train
    trainer.train()
  File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/app/env/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/app/src/autotrain/trainers/seq2seq/dataset.py", line 18, in __getitem__
    labels = self.tokenizer(text_target=target, max_length=self.max_len_target, truncation=True)
  File "/app/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2804, in __call__
    self._switch_to_target_mode()
  File "/app/env/lib/python3.10/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py", line 361, in _switch_to_target_mode
    self.set_tgt_lang_special_tokens(self.tgt_lang)
  File "/app/env/lib/python3.10/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py", line 372, in set_tgt_lang_special_tokens
    lang_token = self.get_lang_token(tgt_lang)
  File "/app/env/lib/python3.10/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py", line 378, in get_lang_token
    return self.lang_code_to_token[lang]
KeyError: None

❌ ERROR  | 2023-12-19 07:49:49 | autotrain.trainers.common:wrapper:80 - None
🚀 INFO   | 2023-12-19 07:49:49 | autotrain.trainers.common:pause_space:44 - Pausing space...
Traceback (most recent call last):
  File "/app/src/autotrain/trainers/common.py", line 76, in wrapper
    return func(*args, **kwargs)
  File "/app/src/autotrain/trainers/seq2seq/__main__.py", line 216, in train
    trainer.train()
  File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/app/env/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/app/env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/app/src/autotrain/trainers/seq2seq/dataset.py", line 18, in __getitem__
    labels = self.tokenizer(text_target=target, max_length=self.max_len_target, truncation=True)
  File "/app/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2804, in __call__
    self._switch_to_target_mode()
  File "/app/env/lib/python3.10/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py", line 361, in _switch_to_target_mode
    self.set_tgt_lang_special_tokens(self.tgt_lang)
  File "/app/env/lib/python3.10/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py", line 372, in set_tgt_lang_special_tokens
    lang_token = self.get_lang_token(tgt_lang)
  File "/app/env/lib/python3.10/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py", line 378, in get_lang_token
    return self.lang_code_to_token[lang]
KeyError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/app/env/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/spaces/emilzak/autotrain-date_parser_v9-0/discussions

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/app/env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/src/autotrain/trainers/seq2seq/__main__.py", line 248, in <module>
    train(config)
  File "/app/src/autotrain/trainers/common.py", line 81, in wrapper
    pause_space(config, is_failure=True)
  File "/app/src/autotrain/trainers/common.py", line 55, in pause_space
    api.create_discussion(
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 5126, in create_discussion
    hf_raise_for_status(resp)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 330, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/spaces/emilzak/autotrain-date_parser_v9-0/discussions (Request ID: Root=1-65814b1d-7783c5e03a0006782cf21ce8;91d498ce-240b-485b-b65a-3b52ae9b3735)

Oops ** You've been rate limited. For safety reasons, we limit the number of write operations for new users. Please try again in 24 hours or get in touch with us at website@huggingface.co if you need access now.
Oops 😱 You've been rate limited. For safety reasons, we limit the number of write operations for new users. Please try again in 24 hours or get in touch with us at website@huggingface.co if you need access now.
  0%|          | 0/204 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/app/env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/app/env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/app/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "/app/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/app/env/bin/python', '-m', 'autotrain.trainers.seq2seq', '--training_config', '/tmp/model/training_params.json']' returned non-zero exit status 1.
> INFO    Process 40 is already completed. Skipping...
> INFO    No running jobs found. Shutting down the server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [34]

Topic		Replies	Views
Crash during training 🤗Hub	3	723	December 20, 2023
Something happend Beginners	0	248	November 24, 2021
Space stops/restarts without any error at all Spaces	0	379	April 6, 2023
Launch timed out, space was not healthy after 30 min in AutotrAIN Spaces	1	233	December 5, 2023
Error when finetuning pretrained huggingface conv-ai chatbot model 🤗Transformers	2	815	April 19, 2021

Crash during training - rate limit

Related topics