Issue with LoRA Adapter Loading on Multiple GPUs during Fine-Tuning with Accelerate and SFTTrainer
|
|
3
|
526
|
September 18, 2024
|
What is the correct way to compute metrics while training using Accelerate?
|
|
0
|
20
|
October 29, 2024
|
Evaluation Metrics are not matching with Shuffle = False
|
|
0
|
19
|
October 19, 2024
|
How to specify FSDP config without launching via Accelerate
|
|
3
|
83
|
October 18, 2024
|
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs
|
|
10
|
8754
|
October 16, 2024
|
Distributed inference: how to store results in a global variable
|
|
2
|
24
|
October 16, 2024
|
Asymmetric Loss Function has no effect in Accelerate
|
|
0
|
18
|
October 13, 2024
|
Restoring the state of the DataLoader using skip_first_batches() after first epoch
|
|
0
|
22
|
October 11, 2024
|
HuggingFacePipeline Llama2 load_in_4bit from_model_id the model has been loaded with `accelerate` and therefore cannot be moved to a specific device
|
|
2
|
6863
|
October 9, 2024
|
Which (and how) Multi GPU strategy to use to train model with longer max_length (Phi-2 fits in Single GPU but qLoRa gives OOM with 512)?
|
|
3
|
1190
|
September 20, 2024
|
Why does Transformer (LLaMa 3.1-8B) give different logits during inference for the same sample when used with single versus multi gpu prediction?
|
|
0
|
63
|
September 20, 2024
|
Accelerate doesn't seem to use my GPU?
|
|
7
|
4621
|
September 18, 2024
|
Accelerator load_state for LM head with tied weights
|
|
0
|
37
|
September 16, 2024
|
Accelerate Distributed Randomly Hangs
|
|
0
|
37
|
September 11, 2024
|
FSDP Auto Wrap does not work using `accelerate` in Multi-GPU Setup
|
|
0
|
151
|
September 6, 2024
|
Learning Rate Scheduler Distributed Training
|
|
6
|
1726
|
September 5, 2024
|
Key errors when trying to load an accelerate-FSDP model checkpoint
|
|
1
|
466
|
September 2, 2024
|
Tensor parallelism for customized model
|
|
0
|
116
|
September 2, 2024
|
FSDP FULL_SHARD: 3GPUs works, 2GPUs hangs at 1st step
|
|
0
|
39
|
August 26, 2024
|
Accelerate + Gemma2 + FSDP
|
|
1
|
104
|
August 25, 2024
|
Accelerate throws CUDA: OOM
|
|
0
|
239
|
August 22, 2024
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
|
|
1
|
411
|
August 15, 2024
|
Loading a model which is saved on multiple nodes using sharded_state_dict?
|
|
0
|
36
|
August 13, 2024
|
Transformers Trainer + Accelerate FSDP: How do I load my model from a checkpoint?
|
|
2
|
10804
|
January 17, 2024
|
Accelerate device error when running evaluation
|
|
0
|
41
|
August 12, 2024
|
Weird behavior when saving checkpoint in DDP
|
|
0
|
35
|
August 11, 2024
|
Multi-GPU Training sometimes working with 2GPU, but never more than 2
|
|
5
|
2723
|
August 8, 2024
|
GPTBigCode gives garbled output on Nvidia A10G
|
|
1
|
28
|
August 5, 2024
|
Accelerate.save_model() Error all of the sudden
|
|
1
|
89
|
August 4, 2024
|
HF Accelerate uses multiple GPUs even when setting `num_processes` to 1
|
|
0
|
32
|
August 2, 2024
|