I Tried Qlora it is working fine for Starcoder model with small context length 1K on a single A100 40GB GPU. with int4
but i want to finetune with 8K context length. even if i specify more gpus its i am not able to push the context length to 8K.
i tried device_map = ‘auto’ that didn’t work fine so i tried
device_map = {
‘transformer.wte’: 0,
‘transformer.wpe’: 0,
‘transformer.drop’: 0,
‘transformer.h.0’: 0,
‘transformer.h.1’: 0,
‘transformer.h.2’: 1,
‘transformer.h.3’: 1,
‘transformer.h.4’: 1,
‘transformer.h.5’: 1,
‘transformer.h.6’: 1,
‘transformer.h.7’: 1,
‘transformer.h.8’: 1,
‘transformer.h.9’: 1,
‘transformer.h.10’: 2,
‘transformer.h.11’: 2,
‘transformer.h.12’: 2,
‘transformer.h.13’: 2,
‘transformer.h.14’: 2,
‘transformer.h.15’: 2,
‘transformer.h.16’: 2,
‘transformer.h.17’: 3,
‘transformer.h.18’: 3,
‘transformer.h.19’: 3,
‘transformer.h.20’: 3,
‘transformer.h.21’: 3,
‘transformer.h.22’: 3,
‘transformer.h.23’: 3,
‘transformer.h.24’: 3,
‘transformer.h.25’: 4,
‘transformer.h.26’: 4,
‘transformer.h.27’: 4,
‘transformer.h.28’: 4,
‘transformer.h.29’: 4,
‘transformer.h.30’: 4,
‘transformer.h.31’: 4,
‘transformer.h.32’: 4,
‘transformer.h.33’: 5,
‘transformer.h.34’: 5,
‘transformer.h.35’: 5,
‘transformer.h.36’: 5,
‘transformer.h.37’: 5,
‘transformer.h.38’: 5,
‘transformer.h.39’: 5,
‘transformer.ln_f’: 5,
‘lm_head’:0
}
this is for 6 gpus i was able to train 6K context length. but not 8K .
why is memory usage increasing rapidly with increase in context length?
i wanted to try cpu offloading from deepspeed and FSDP but when i try its not working with quantization.
Is it possible to train model using deepspeed or FSDP with quantization or not?
what am i doing wrong?
can someone help me in this? Suggestions are greatly appreciated.