New Trainer Doc no some properties but Old Doc have (n_gpu, parallel_mode)

According to this new document - Trainer

There are no below properties in this new document. is there a new method to assign these parameters? or automatically detecting? or need a separate library for that?

  • n_gpu

  • parallel_mode

Old Doc - Trainer — transformers 4.7.0 documentation

Those are internal variables, they were documented by mistake before.

is that mean it will identify a number of GPUs?

Which methods use when using multiple GPUs? DDP, DP or PP

My Experiment Details for Single GPU Document available code.

I got the best result with 2 GPU with only FP16.

I can not see any memory improvement with this.

Use Default --------------------------

For 1 x A100 GPU

* Time: 168.06 <-- Less time - 5th
* Samples/second: 3.05
* GPU memory occupied: 12,852 MB.

For 2 x A100 GPU

* Time: 103.79 <-- Less time - 2nd
* Samples/second: 4.93
* GPU 1 memory occupied: 13,286 MB.
* GPU 2 memory occupied: 9,418 MB.

Use gradient_accumulation_steps ------

For 1 x A100 GPU

* Time: 192.55 <-- Less time - 6st
* Samples/second: 2.66
* GPU memory occupied: 12,876 MB.

For 2 x A100 GPU

* Time: 155.53 <-- Less time - 3rd
* Samples/second: 3.29
* GPU 1 memory occupied: 13,310 MB.
* GPU 2 memory occupied: 9,442 MB.

Use gradient_checkpointin + gradient_accumulation_steps -----

For 1 x A100 GPU

* Time: 262.98 <-- Max time - 9th
* Samples/second: 1.95
* GPU memory occupied: 12,876 MB.

For 2 x A100 GPU

* Time: 1872.74 <-- Less time - 9th
* Samples/second: 0.27
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.

Use FP16 without other arguments  --------------------------

For 1 x A100 GPU

* Time: 91.13  <-- Less time - 1st
* Samples/second: 5.62
* GPU memory occupied: 12,876 MB.

For 2 x A100 GPU

* Time: 54.30<-- Less time - 1st
* Samples/second: 1.05
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.

Use FP16 with other arguments (gradient_checkpointin + gradient_accumulation_steps) -----

For 1 x A100 GPU

* Time: 112.80 <-- Less time - 3rd
* Samples/second: 4.54
* GPU memory occupied: 12,876 MB.

For 2 x A100 GPU

* Time: 1823.79 <-- Less time - 7th
* Samples/second: 0.28
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.

Use BF16  --------------------------

For 1 x A100 GPU

Error

For 2 x A100 GPU

Error

Use TF32  --------------------------

For 1 x A100 GPU

Error

For 2 x A100 GPU

Error

Use AdaFactor Optimizer without other arguments -----------

For 1 x A100 GPU

* Time: 232.71 <-- Less time - 8th
* Samples/second: 2.20
* GPU memory occupied: 12,876 MB.

For 2 x A100 GPU

* Time: 543.09 <-- Less time - 5th
* Samples/second: 0.94
* GPU 1 memory occupied: 13,430 MB.
* GPU 2 memory occupied: 9,442 MB.

Use AdaFactor Optimizer with other arguments (gradient_checkpointin + gradient_accumulation_steps + FP16) -------------

For 1 x A100 GPU

* Time: 118.11 <-- Less time - 4th
* Samples/second: 4.33
* GPU memory occupied: 12,876 MB.

For 2 x A100 GPU

* Time: 1832.84 <-- Less time - 8th
* Samples/second: 0.28
* GPU 1 memory occupied: 9,554 MB.
* GPU 2 memory occupied: 4,920 MB.

Use adam_bnb_optim Optimizer without other arguments ----------

For 1 x A100 GPU

* Time: 224.91 <-- Less time - 7th
* Samples/second: 2.28
* GPU memory occupied: 13,256 MB.

For 2 x A100 GPU

* Time: 535.81 <-- Less time - 4th
* Samples/second: 0.96
* GPU 1 memory occupied: 10,316 MB.
* GPU 2 memory occupied: 5,140 MB.

Use adam_bnb_optim Optimizer with other arguments (gradient_checkpointin + gradient_accumulation_steps + FP16) ------------

For 1 x A100 GPU

* Time: 102.83 <-- Less time - 2nd
* Samples/second: 4.98
* GPU memory occupied: 13,266 MB.

For 2 x A100 GPU

* Time: 1817.75 <-- Less time - 6th
* Samples/second: 0.28
* GPU 1 memory occupied: 10,326 MB.
* GPU 2 memory occupied: 5,140 MB.