Error using deepspeed for sftconfig

sai-santhosh · April 20, 2025, 1:26pm

from trl import SFTConfig
import wandb
wandb.init(mode="disabled")

# Configure training arguments
training_args = SFTConfig(
    output_dir="qwen_sign_language_interpretation",  # Directory to save the model
    max_steps=250,
    per_device_train_batch_size=1,  # Batch size for training
    per_device_eval_batch_size=1,  # Batch size for evaluation
    gradient_accumulation_steps=16,  # Steps to accumulate gradients
    gradient_checkpointing=True,  # Enable gradient checkpointing for memory efficiency
    # Optimizer and scheduler settings
    optim="adamw_torch_fused",  # Optimizer type
    learning_rate=2e-4,  # Learning rate for training
    lr_scheduler_type="constant",  # Type of learning rate scheduler
    # Logging and evaluation
    logging_steps=10,  # Steps interval for logging
    # eval_steps=10,  # Steps interval for evaluation
    # eval_strategy="steps",  # Strategy for evaluation
    save_strategy="steps",  # Strategy for saving the model
    save_steps=20,  # Steps interval for saving
    metric_for_best_model="train_loss",  # Metric to evaluate the best model
    greater_is_better=False,  # Whether higher metric values are better
    # Mixed precision and gradient settings
    bf16=True,  # Use bfloat16 precision
    deepspeed = "/content/zero_stage3_offload_config.json",
    max_grad_norm=0.3,  # Maximum norm for gradient clipping
    warmup_ratio=0.03,  # Ratio of total steps for warmup
    report_to="none",  # Reporting tool for tracking metrics
    # Gradient checkpointing settings
    # gradient_checkpointing_kwargs={"use_reentrant": False},  # Options for gradient checkpointing
    # Dataset configuration
    dataset_text_field="",  # Text field in dataset
    dataset_kwargs={"skip_prepare_dataset": True},  # Additional dataset options
    max_seq_length=1024  # Maximum sequence length for input
)

training_args.remove_unused_columns = False  # Keep unused columns in dataset

this is my code , im getting the below error.please help me resolve it

John6666 · April 21, 2025, 5:50am

Perhaps unresolved issue…

github.com/huggingface/transformers

Problem initializing Deepspeed with Trainer

opened 05:30PM - 24 Aug 23 UTC

closed 08:07AM - 11 Oct 23 UTC

lhallee

### System Info ```python 2023-08-24 17:23:17.908613: W tensorflow/compiler/tf…2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [2023-08-24 17:23:20,478] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/transformers/commands/env.py:100: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. 2023-08-24 17:23:29.664543: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. ``` Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points. - `transformers` version: 4.32.0 - Platform: Linux-5.15.109+-x86_64-with-glibc2.35 - Python version: 3.10.12 - Huggingface_hub version: 0.16.4 - Safetensors version: 0.3.3 - Accelerate version: 0.22.0 - Accelerate config: not found - PyTorch version (GPU?): 2.0.1+cu118 (True) - Tensorflow version (GPU?): 2.12.0 (True) - Flax version (CPU?/GPU?/TPU?): 0.7.2 (gpu) - Jax version: 0.4.14 - JaxLib version: 0.4.14 - Using GPU in script?: <fill in> - Using distributed or parallel set-up in script?: <fill in> ### Who can help? @pacman ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction 1. Load deepspeed config into json file 2. Pass into TrainingArguments 3. Get error Here is my code: ```python import json deepspeed_config = { "fp16": { "enabled": True, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": True }, "allgather_partitions": True, "allgather_bucket_size": 2e8, "overlap_comm": True, "reduce_scatter": True, "reduce_bucket_size": 2e8, "contiguous_gradients": True }, "steps_per_print": 100, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "activation_checkpointing": { "partition_activations": True, "contiguous_memory_optimization": True }, "wall_clock_breakdown": False } config_filename = "deepspeed_config.json" with open(config_filename, 'w') as f: json.dump(deepspeed_config, f) trainer = Trainer( model=model, train_dataset=train_data, eval_dataset=val_data, args=TrainingArguments( num_train_epochs=num_epochs, evaluation_strategy='steps' if val_set_size > 0 else 'no', save_strategy='steps', eval_steps=eval_steps if val_set_size > 0 else None, save_steps=save_steps, output_dir=output_dir, save_total_limit=save_total_limit, load_best_model_at_end=True if val_set_size > 0 else False, deepspeed='./deepspeed_config.json', ), data_collator=DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors='pt', padding=True), callbacks=[print_callback] ) model.config.use_cache = False trainer.train(resume_from_checkpoint=resume_from_checkpoint) model.save_pretrained(output_dir) ``` Here is my error: ``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) [<ipython-input-55-5c7ff182bcf8>](https://localhost:8080/#) in <cell line: 1>() 3 train_dataset=train_data, 4 eval_dataset=val_data, ----> 5 args=TrainingArguments( 6 num_train_epochs=num_epochs, 7 evaluation_strategy='steps' if val_set_size > 0 else 'no', 3 frames [/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py](https://localhost:8080/#) in __init__(self, config_file_or_dict) 64 dep_version_check("accelerate") 65 dep_version_check("deepspeed") ---> 66 super().__init__(config_file_or_dict) 67 68 TypeError: object.__init__() takes exactly one argument (the instance to initialize) ``` ### Expected behavior Trainer loads and runs. PS. This is my first ever issue reported, I'm a domain scientist sorry if this isn't the normal way to report things.

I’m don’t have a fix, but you could use 0.15.x version of deepspeed. I don’t get this error in that version

Topic		Replies	Views
Question about using trainer with DeepSpeed 🤗Transformers	0	464	April 25, 2023
Deepspeed integration with Trainer in Colab crashing: TypeError: object.__init__() takes exactly one argument (the instance to initialize) Intermediate	2	1948	October 1, 2023
DeepSpeed Further Training Issue Beginners	2	299	November 25, 2023
Issues with using DeepSpeed on multiple GPUs DeepSpeed	2	2569	September 9, 2022
DeepSpeed config file not found DeepSpeed	0	605	May 13, 2023

Error using deepspeed for sftconfig

Related topics