DeepSpeed

Topic	Replies	Views	Activity
About the DeepSpeed category	1	805	October 30, 2021
Prakash Hinduja, Geneva (Swiss) How can I ask effective technical questions on the Hugging Face forum?	1	33	August 4, 2025
AttributeError: 'ORTTrainingArguments' object has no attribute 'deepspeed_plugin'	2	511	August 2, 2025
Timeout Issue with DeepSpeed on Multiple GPUs	2	674	July 21, 2025
Use ReduceLROnPlateau with deepspeed	4	49	June 26, 2025
How to use different learning rates when deepspeed enabled	1	46	June 14, 2025
LoRA training with accelerate / deepspeed	3	2623	May 28, 2025
Incorrect total train batch size when using tp_size > 1 and deepspeed	1	122	May 20, 2025
Error using deepspeed for sftconfig	1	73	April 21, 2025
Deepspeed zero3 does not work with Diffusion Models. Does anyone know how to fix this?	1	2403	April 18, 2025
Corrupted deepspeed checkpoint	1	220	March 13, 2025
SFTTrainer Doubling Speed on a Single GPU with DeepSpeed: Proposal for an Update to the Official Documentation and Verification Report	1	89	March 7, 2025
Accelerator.backward freeze	1	86	February 24, 2025
Deepspeed ZeRO-3 flattens convolution that causes runtime error	0	238	February 17, 2025
Is there a way to terminate llm.generate and release the GPU memory for next prompt?	1	245	February 4, 2025
CUDA OOM on first backward pass after evaluation	0	312	November 20, 2024
Different metrics score between when training and when merge lora adapter testing	1	157	October 25, 2024
Trainer leaked memory?	1	803	October 15, 2024
DeepSpeed MII pipeline issue	1	43	September 30, 2024
Deepspeed mii library issues	1	90	September 29, 2024
Calculate tokens per second while fine-tuning llm?	0	154	September 17, 2024
Fitting huge models on multiple nodes	0	195	September 6, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)	5	3489	August 26, 2024
AutoTrain Error DeepSpeed Zero-3	1	310	August 21, 2024
DeepSpeed Zero 3 with LoRA - Merging adapters	1	807	August 16, 2024
DeepSpeed error: a leaf Variable that requires grad is being used in an in-place operation	1	89	July 26, 2024
Running model.generate() in deep speed training	2	564	July 25, 2024
RuntimeError: Error building extension 'cpu_adam'	4	5319	July 23, 2024
Saving checkpoints when using DeepSpeed is taking abnormally long	0	199	July 22, 2024
GPU memory usage of optimizer's states when using LoRA	4	871	July 5, 2024