What is the difference between using Flash Attention 2 via
model = AutoModelForCausalLM.from_pretrained(ckpt, attn_implementation = "sdpa")
vs
model = AutoModelForCausalLM.from_pretrained(ckpt, attn_implementation = "flash_attention_2")
when Pytorch SDPA support FA2 according to docs ?