LLama3-8B - FSDP + QLORA results in OOM with 4 A40's
|
|
1
|
278
|
June 17, 2024
|
Multi-GPU Issue when trying Diffusers demo
|
|
0
|
159
|
June 16, 2024
|
How to pass `ProjectConfig` to `accelerate launch` command?
|
|
0
|
104
|
June 14, 2024
|
Resume training with lesser GPUs Error rng_state_6.pth
|
|
0
|
85
|
June 13, 2024
|
Lora finetuning 35 B model error
|
|
0
|
111
|
June 11, 2024
|
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! I am on a single T4 GPU
|
|
6
|
349
|
June 10, 2024
|
Extremely slow loading with accelerate 0.31.0?
|
|
2
|
158
|
June 10, 2024
|
Feature Request: Add DDP Communication Hooks
|
|
2
|
257
|
June 9, 2024
|
Low bf16 performance on TPU, int4 vs int8 quantizatoin
|
|
0
|
177
|
June 1, 2024
|
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
|
|
1
|
354
|
May 31, 2024
|
Weights & Biases sweep with multi gpu accelerate launch
|
|
4
|
2325
|
May 28, 2024
|
ORPO Trainer giving error when fine-tuning Llama3-8b in Multi-GPU environment
|
|
8
|
729
|
May 27, 2024
|
Segmentation fault core dumped (Solved)
|
|
1
|
234
|
May 27, 2024
|
How to do distributed Inference for large models with multiprocess?
|
|
3
|
479
|
May 26, 2024
|
ValueError (unknown key enable_cpu_affinity) on SageMaker for Accelerate >=0.29.0
|
|
3
|
566
|
May 22, 2024
|
Getting the error: AssertionError: Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False` while Finetuning GPT2 model
|
|
0
|
223
|
May 21, 2024
|
Hugging Face Trainer class with accelerate
|
|
2
|
242
|
May 21, 2024
|
Feature Request: Elastic Launch Support in `notebook_launcher`
|
|
0
|
122
|
May 16, 2024
|
Degraded results after loading from checkpoint
|
|
0
|
121
|
May 13, 2024
|
How to launch multi node training using accelerate launch
|
|
0
|
220
|
May 13, 2024
|
Key errors when trying to load an accelerate-FSDP model checkpoint
|
|
0
|
253
|
May 8, 2024
|
Accelerate FSDP config prompts
|
|
5
|
3430
|
September 15, 2023
|
cuBLAS error 13 when running code with langchain.llms on GPU
|
|
0
|
197
|
May 6, 2024
|
Wandb.watch in accelerate library
|
|
6
|
2040
|
May 1, 2024
|
Slurm Issues running accelerate
|
|
0
|
330
|
May 1, 2024
|
What is my batch size..?
|
|
2
|
586
|
April 29, 2024
|
How to remove a model (unprepare) from the accelerator
|
|
1
|
195
|
April 29, 2024
|
How should I combine Accelerate and DPOTrainer for training?
|
|
0
|
244
|
April 29, 2024
|
How to use specific gpu in accelerate?
|
|
10
|
3775
|
April 25, 2024
|
While training a T5Small model using FSDP, the model does not learn
|
|
1
|
540
|
April 15, 2024
|