About the DeepSpeed category
|
|
1
|
555
|
October 30, 2021
|
Hi why its not working?
|
|
6
|
846
|
August 24, 2023
|
Resume_from_checkpoint does not configure learning rate scheduler correctly
|
|
1
|
110
|
September 14, 2023
|
Saving checkpoint is too slow with deepspeed
|
|
1
|
163
|
September 11, 2023
|
Learning rate with deepspeed is fixed despite lr set to auto
|
|
2
|
173
|
September 6, 2023
|
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
|
|
2
|
422
|
September 5, 2023
|
Model connection timed out, even on simple requests
|
|
0
|
49
|
August 31, 2023
|
RuntimeError: Error building extension 'cpu_adam'
|
|
3
|
2154
|
August 30, 2023
|
How does from_pretrained work with ZeRO=3?
|
|
0
|
105
|
August 14, 2023
|
ZeRO3 with int8 training
|
|
0
|
98
|
August 11, 2023
|
Multi GPU training - Model parallelism
|
|
0
|
160
|
August 10, 2023
|
Unable to train model (Loss is 0.000000)
|
|
1
|
228
|
August 9, 2023
|
How to add java_home in HF space(spark + llama)
|
|
0
|
157
|
August 8, 2023
|
Get Real and Authentic Documents online
|
|
0
|
115
|
August 7, 2023
|
ZeRO uses more RAM than DDP?
|
|
0
|
142
|
August 7, 2023
|
Best practice to run DeepSpeed
|
|
1
|
578
|
August 6, 2023
|
Eval_batch_size VS per_device_eval_batch_size
|
|
0
|
107
|
August 4, 2023
|
AttributeError: 'ORTTrainingArguments' object has no attribute 'deepspeed_plugin'
|
|
0
|
131
|
August 1, 2023
|
NCCL timeout + corrupts checkpoint/latest
|
|
1
|
768
|
July 31, 2023
|
RuntimeError: tensors must be contiguous when finetuning GPT-J-6B using PEFT Lora
|
|
0
|
209
|
July 29, 2023
|
Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models
|
|
2
|
866
|
July 27, 2023
|
Parallelizing huggingface models
|
|
0
|
75
|
July 24, 2023
|
Fine-tuning a 16B CodeGen model with 256GB RAM+2xA6000s?
|
|
2
|
1038
|
July 3, 2023
|
Estimate training compute for 150B LLM
|
|
0
|
109
|
June 30, 2023
|
No module named 'deepspeed.checkpoint.utils'
|
|
6
|
499
|
June 28, 2023
|
Difference between using the Trainer class vs Accelerate library
|
|
0
|
249
|
June 27, 2023
|
How to use Whisper from huggingface for ASR
|
|
0
|
132
|
June 21, 2023
|
Trainer) training one batch with multiple GPUs
|
|
0
|
119
|
June 19, 2023
|
Struggle with finetuneing flan-t5-xxl using deepspeed
|
|
0
|
244
|
May 30, 2023
|
Multi-GPU sharded eval with Trainer and generate method during training
|
|
1
|
309
|
May 25, 2023
|