According to this new document - Trainer
There are no below properties in this new document. is there a new method to assign these parameters? or automatically detecting? or need a separate library for that?
-
n_gpu
-
parallel_mode
According to this new document - Trainer
There are no below properties in this new document. is there a new method to assign these parameters? or automatically detecting? or need a separate library for that?
n_gpu
parallel_mode
Those are internal variables, they were documented by mistake before.
is that mean it will identify a number of GPUs?
Which methods use when using multiple GPUs? DDP, DP or PP
My Experiment Details for Single GPU Document
available code.
I got the best result with 2 GPU with only FP16.
I can not see any memory improvement with this.
Use Default --------------------------
For 1 x A100 GPU
* Time: 168.06 <-- Less time - 5th
* Samples/second: 3.05
* GPU memory occupied: 12,852 MB.
For 2 x A100 GPU
* Time: 103.79 <-- Less time - 2nd
* Samples/second: 4.93
* GPU 1 memory occupied: 13,286 MB.
* GPU 2 memory occupied: 9,418 MB.
Use gradient_accumulation_steps ------
For 1 x A100 GPU
* Time: 192.55 <-- Less time - 6st
* Samples/second: 2.66
* GPU memory occupied: 12,876 MB.
For 2 x A100 GPU
* Time: 155.53 <-- Less time - 3rd
* Samples/second: 3.29
* GPU 1 memory occupied: 13,310 MB.
* GPU 2 memory occupied: 9,442 MB.
Use gradient_checkpointin + gradient_accumulation_steps -----
For 1 x A100 GPU
* Time: 262.98 <-- Max time - 9th
* Samples/second: 1.95
* GPU memory occupied: 12,876 MB.
For 2 x A100 GPU
* Time: 1872.74 <-- Less time - 9th
* Samples/second: 0.27
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.
Use FP16 without other arguments --------------------------
For 1 x A100 GPU
* Time: 91.13 <-- Less time - 1st
* Samples/second: 5.62
* GPU memory occupied: 12,876 MB.
For 2 x A100 GPU
* Time: 54.30<-- Less time - 1st
* Samples/second: 1.05
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.
Use FP16 with other arguments (gradient_checkpointin + gradient_accumulation_steps) -----
For 1 x A100 GPU
* Time: 112.80 <-- Less time - 3rd
* Samples/second: 4.54
* GPU memory occupied: 12,876 MB.
For 2 x A100 GPU
* Time: 1823.79 <-- Less time - 7th
* Samples/second: 0.28
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.
Use BF16 --------------------------
For 1 x A100 GPU
Error
For 2 x A100 GPU
Error
Use TF32 --------------------------
For 1 x A100 GPU
Error
For 2 x A100 GPU
Error
Use AdaFactor Optimizer without other arguments -----------
For 1 x A100 GPU
* Time: 232.71 <-- Less time - 8th
* Samples/second: 2.20
* GPU memory occupied: 12,876 MB.
For 2 x A100 GPU
* Time: 543.09 <-- Less time - 5th
* Samples/second: 0.94
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.
Use AdaFactor Optimizer with other arguments (gradient_checkpointin + gradient_accumulation_steps + FP16) -------------
For 1 x A100 GPU
* Time: 118.11 <-- Less time - 4th
* Samples/second: 4.33
* GPU memory occupied: 12,876 MB.
For 2 x A100 GPU
* Time: 1832.84 <-- Less time - 8th
* Samples/second: 0.28
* GPU 1 memory occupied: 9,554 MB.
* GPU 2 memory occupied: 4,920 MB.
Use adam_bnb_optim Optimizer without other arguments ----------
For 1 x A100 GPU
* Time: 224.91 <-- Less time - 7th
* Samples/second: 2.28
* GPU memory occupied: 13,256 MB.
For 2 x A100 GPU
* Time: 535.81 <-- Less time - 4th
* Samples/second: 0.96
* GPU 1 memory occupied: 10,316 MB.
* GPU 2 memory occupied: 5,140 MB.
Use adam_bnb_optim Optimizer with other arguments (gradient_checkpointin + gradient_accumulation_steps + FP16) ------------
For 1 x A100 GPU
* Time: 102.83 <-- Less time - 2nd
* Samples/second: 4.98
* GPU memory occupied: 13,266 MB.
For 2 x A100 GPU
* Time: 1817.75 <-- Less time - 6th
* Samples/second: 0.28
* GPU 1 memory occupied: 10,326 MB.
* GPU 2 memory occupied: 5,140 MB.