Question about FP16/32, LoRA and GPU Memory Usage

Hello friends.

Looking for some feedback, observations and answers here.

I am trying to wrap my head around a few things on GPU memory usage and execution time. I ran a few experiments:

A training run for sequence classification with RoBERTa base and SST-2.
I looked at all the combinations of (a) fp16 vs fp32 and (b) full finetuning vs LoRA.

Further, after reading bits and pieces of documentation I understand for fp16 the model’s weights are simultaneously loaded in fp32 and fp16 and we may only expect to see the majority of the savings in the gradients, where the savings depends on the length of the inputs and/or large batch sizes? Hence I also compared (c) normal length SST-2 inputs vs 8 x length SST-2 inputs.

Question 1: But I am wondering is that assumption even true? With more inputs I can see the memory use for gradients going up, if we use embeddings that are tunable. But with LoRA the embeddings are frozen.

Question 2: On a more technical side, if the inputs would increase the storage for gradients somehow, would I see this effect in the same way no matter if I increase the batch size or the input sequence length (as I did in c). Increasing the batch size with short inputs would be more efficient, than a smaller batch size and longer inputs, because of the quadratic complexity of attention.

Question 3: Also, I am wondering if it is correct to measure the GPU memory like I do and to just take the final value of a training run. Given that after a little back and forth in my observation the value is staying on the same level for many epochs before the end, I thought taking the last one only is a valid simplification.

Besides that timing issue; here is the mechanical part that produces the measurements, where I take the last of for comparisons:

if __name__ == "__main__":

    def schedule_gpu_memory_logging():
        def log_gpu_usage():
            if not torch.cuda.is_available():
                return

            from pynvml.smi import nvidia_smi

            nvsmi = nvidia_smi.getInstance()
            res = nvsmi.DeviceQuery("memory.free, memory.total, memory.used")["gpu"][0][
                "fb_memory_usage"
            ]
            res["percentage"] = res["used"] / res["total"] * 100
            logger.info(
                f'GPU Usage. Used: {res["used"]:5.3f} Total: {res["total"]:5.3f} ({res["percentage"]:3.1f}% used). Free: {res["free"]:5.3f}'
            )
        
        def log_loop():
            while True:
                log_gpu_usage()
                time.sleep(30)
    
        t = Thread(target=log_loop, daemon=True)
        t.start()

    schedule_gpu_memory_logging()

Here are the results of the experiment:

input_scale with a value stands for the standard length inputs of SST-2, 8 stands for those same inputs, but repeated another 7 times, so longer input sequences.

All of the above with the same batch-size, to make absolute comparisons easier.

Here is a Notebook with the code, more graphs and the logs.

Question 4: When looking at LoRA the weights of the full model are not tuned; except in the classifier. There should only be gradients of the inputs with respect to the LoRA matrices (and the classifier). And the forward pass will also use memory for the full base model including the classifier plus the LoRA matrices.

If that is the case, then the amount of memory consumed should mostly grow with more inputs, either higher batch size or longer inputs. I used longer inputs here.
But the results look the exact opposite. With FP16 (left graph below) it LoRA is taking more memory than full finetuning with long inputs (8x SST-2). But with FP32 (right graph below) it is the other way round.

With short inputs (input_scale = 1) LoRA consumes less memory with FP16 and FP32.

Re-running it with r=1 (and peft 0.4) the lora memory needed is now always below the full finetuning memory needed.