Hello,
I am trying to fine-tune the google-bert/bert-base-uncased model with the lhoestq/squad dataset as shown in the documentation for an example of Extraction Question and Answering which is shown in the link below:
I have set up my autotrain UI exactly as shown in the screen shot in the above link. I have checked that I have the correct column names. I have tried running this with a CPU and the small T4. However, in both cases I get the following errors:
Device 0: Tesla T4 - 2.88MiB/15360MiB
INFO | 2025-05-31 16:58:27 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 65
INFO | 2025-05-31 16:58:27 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 65
subprocess.CalledProcessError: Command â[â/app/env/bin/pythonâ, â-mâ, âautotrain.trainers.extractive_question_answeringâ, ââtraining_configâ, âautotrain-vfbpf-ju79s/training_params.jsonâ]â returned non-zero exit status 1.
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
File â/app/env/lib/python3.10/site-packages/accelerate/commands/launch.pyâ, line 763, in simple_launcher
simple_launcher(args)
File â/app/env/lib/python3.10/site-packages/accelerate/commands/launch.pyâ, line 1168, in launch_command
args.func(args)
File â/app/env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.pyâ, line 48, in main
sys.exit(main())
File â/app/env/bin/accelerateâ, line 8, in
Traceback (most recent call last):
ImportError: cannot import name âload_metricâ from âdatasetsâ (/app/env/lib/python3.10/site-packages/datasets/init.py)
from datasets import load_metric
File â/app/env/lib/python3.10/site-packages/autotrain/trainers/extractive_question_answering/utils.pyâ, line 6, in
from autotrain.trainers.extractive_question_answering import utils
File â/app/env/lib/python3.10/site-packages/autotrain/trainers/extractive_question_answering/main.pyâ, line 30, in
exec(code, run_globals)
File â/app/env/lib/python3.10/runpy.pyâ, line 86, in _run_code
return _run_code(code, main_globals, None,
File â/app/env/lib/python3.10/runpy.pyâ, line 196, in _run_module_as_main
Traceback (most recent call last):
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
.
--dynamo_backend
was set to a value of 'no'
The following values were not passed to accelerate launch
and had defaults used instead:
INFO | 2025-05-31 16:58:15 | autotrain.backends.local:create:25 - Training PID: 65
INFO | 2025-05-31 16:58:15 | autotrain.commands:launch_command:515 - {âdata_pathâ: âlhoestq/squadâ, âmodelâ: âgoogle-bert/bert-base-uncasedâ, âlrâ: 5e-05, âepochsâ: 3, âmax_seq_lengthâ: 512, âmax_doc_strideâ: 128, âbatch_sizeâ: 8, âwarmup_ratioâ: 0.1, âgradient_accumulationâ: 1, âoptimizerâ: âadamw_torchâ, âschedulerâ: âlinearâ, âweight_decayâ: 0.0, âmax_grad_normâ: 1.0, âseedâ: 42, âtrain_splitâ: âtrainâ, âvalid_splitâ: âvalidationâ, âtext_columnâ: âcontextâ, âquestion_columnâ: âquestionâ, âanswer_columnâ: âanswersâ, âlogging_stepsâ: -1, âproject_nameâ: âautotrain-vfbpf-ju79sâ, âauto_find_batch_sizeâ: False, âmixed_precisionâ: âfp16â, âsave_total_limitâ: 1, âtokenâ: â*****â, âpush_to_hubâ: True, âeval_strategyâ: âepochâ, âusernameâ: âianmdâ, âlogâ: âtensorboardâ, âearly_stopping_patienceâ: 5, âearly_stopping_thresholdâ: 0.01}
INFO | 2025-05-31 16:58:15 | autotrain.commands:launch_command:514 - [âaccelerateâ, âlaunchâ, âânum_machinesâ, â1â, âânum_processesâ, â1â, ââmixed_precisionâ, âfp16â, â-mâ, âautotrain.trainers.extractive_question_answeringâ, ââtraining_configâ, âautotrain-vfbpf-ju79s/training_params.jsonâ]
INFO | 2025-05-31 16:58:15 | autotrain.backends.local:create:20 - Starting local trainingâŚ
INFO | 2025-05-31 16:58:15 | autotrain.app.ui_routes:handle_form:540 - hardware: local-ui
INFO | 2025-05-31 16:56:38 | autotrain.app.ui_routes:fetch_params:415 - Task: extractive-qa
INFO | 2025-05-31 16:56:27 | autotrain.app.ui_routes:fetch_params:415 - Task: llm:sft
INFO: 10.16.19.229:12486 - âGET /?__sign=eyJhbGciOiJFZERTQSJ9.eyJyZWFkIjp0cnVlLCJwZXJtaXNzaW9ucyI6eyJyZXBvLmNvbnRlbnQucmVhZCI6dHJ1ZX0sIm9uQmVoYWxmT2YiOnsia2luZCI6InVzZXIiLCJfaWQiOiI2N2VlNTdmZDM1NDdmODIzMTAyNTI5M2MiLCJ1c2VyIjoiaWFubWQiLCJzZXNzaW9uSWQiOiI2ODNhY2NkMjFhYjk5N2VlMjZkZThjZjkifSwiaWF0IjoxNzQ4NzEwNTg2LCJzdWIiOiIvc3BhY2VzL2lhbm1kL2F1dG90cmFpbi10ZXN0aW5nIiwiZXhwIjoxNzQ4Nzk2OTg2LCJpc3MiOiJodHRwczovL2h1Z2dpbmdmYWNlLmNvIn0.sExea1b6OSWrCBCUfS_3I9DmYqaIclQC9dNG4pukT00UNEB_2x8uq3bt-Culu03y-zIoAfhT94RQR_IAEfwxCw HTTP/1.1â 307 Temporary Redirect
INFO: 10.16.19.229:12486 - âGET /?__sign=eyJhbGciOiJFZERTQSJ9.eyJyZWFkIjp0cnVlLCJwZXJtaXNzaW9ucyI6eyJyZXBvLmNvbnRlbnQucmVhZCI6dHJ1ZX0sIm9uQmVoYWxmT2YiOnsia2luZCI6InVzZXIiLCJfaWQiOiI2N2VlNTdmZDM1NDdmODIzMTAyNTI5M2MiLCJ1c2VyIjoiaWFubWQiLCJzZXNzaW9uSWQiOiI2ODNhY2NkMjFhYjk5N2VlMjZkZThjZjkifSwiaWF0IjoxNzQ4NzEwNTg2LCJzdWIiOiIvc3BhY2VzL2lhbm1kL2F1dG90cmFpbi10ZXN0aW5nIiwiZXhwIjoxNzQ4Nzk2OTg2LCJpc3MiOiJodHRwczovL2h1Z2dpbmdmYWNlLmNvIn0.sExea1b6OSWrCBCUfS_3I9DmYqaIclQC9dNG4pukT00UNEB_2x8uq3bt-Culu03y-zIoAfhT94RQR_IAEfwxCw HTTP/1.1â 307 Temporary Redirect
INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
INFO: Application startup complete.
INFO: Waiting for application startup.
INFO: Started server process [49]
INFO | 2025-05-31 16:54:20 | autotrain.app.app::24 - AutoTrain started successfully
INFO | 2025-05-31 16:54:20 | autotrain.app.app::23 - AutoTrain version: 0.8.36
INFO | 2025-05-31 16:54:20 | autotrain.app.app::13 - Starting AutoTrainâŚ
INFO | 2025-05-31 16:54:20 | autotrain.app.ui_routes::315 - AutoTrain started successfully
INFO | 2025-05-31 16:54:18 | autotrain.app.ui_routes::31 - Starting AutoTrainâŚ
I have also tried other QA datasets from Huggingface but get the same errors. I have tried fine-tuning a text_classification model and everything is fine using a CPU. I have spent hours trying to figure this out and have searched online for why this is happening. The load_metric appears to be problematic but I do not understand why I should be getting this given it is exactly the same as the example given in the documentation (although I do not know what processor was being used.
Has anyone had a similar issue? I would appreciate any pointers on this.
Thanks very much
ian