It works in 512x512 but fails with 512x768 with this error in the vae_encoder step:
***** Compiling vae_encoder *****
...........
[GCA035] Instruction: I-5715-0 with opcode: TensorTensor couldn't be allocated in SB
Memory Location Accessed:
add.1_reload_7077_i0: 196608 Bytes per Partition and total of: 25165824 Bytes in SB
_add.1104-t7919_i0: 4 Bytes per Partition and total of: 512 Bytes in SB
add.6_i0: 2048 Bytes per Partition and total of: 262144 Bytes in SB
Total Accessed Bytes per partition by instruction: 198660
Total SB Partition Size: 196608
- Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
An error occured when trying to trace vae_encoder with the error message: neuronx-cc failed with 70.
The export is failed and vae_encoder neuron model won't be stored.
Do I need any other parameters or is it a bug that needs fixing? I’m running it on AWS inf2.2xlarge instance.
Somehow I managed to export some SD1.5 checkpoint to 512x768 a couple weeks ago but I’m unable to reproduce it now. Is that a possible regression in optimum-neuron or neuronx-cc?
@Jingya sorry to bother you directly, but you’re always so helpful
Any idea why compiling the model for any resolution other than 512x512 fails?
I have one Neuron model that I was able to compile for 512x768 a few weeks ago but I no longer have the setup and don’t remember the exact command, and now it always fails.
Is it something that can be fixed? Or am I doing something wrong?
I got this when compiling unequal height/width SD’s vae encoder with neuron SDK 2.19.1 on an inf2.8xlarge instance.
[NLA001] Unhandled exception with message: === BIR error ===
Reason: Access pattern out of bound.
Instruction: identity_pool_1_I-5532-441602-tc
Opcode: TensorCopy
Instruction Source: (float32<128 x 1027> $5532[i2_369_0_0, i2_369_0_1, i1_370_6433, i3_369_0_6433, i3_369_1_0_6433_0_0, i3_369_1_0_6433_0_1, i3_369_1_0_6433_1, i3_369_1_1_0_6433_0, i3_369_1_1_0_6433_1, i3_369_1_1_1_6433_0, i3_369_1_1_1_6433_1, i2_370_6433]:5532)0:
Argument AP:
Access Pattern: [[2051,64],[1,1],[1,1027]]
Offset: 1028
Memory Location: {add.11_VN_191_ReloadStore111619}@SB<0,175096>(128x8204)#Internal DebugInfo: <add.11||UNDEF||[128, 2051, 1]>
- Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
I will ask Annapurna folks. And it will be super helpful if you can share the env where you succeeded in compiling it!
Hi @Jingya thanks for confirming the issue. Unfortunately I can’t find my old virtual env with the versions that worked. I think it was on my spot instance that’s now gone
No worries, I talked with the Annapurna team, they are working on a fix for the compiler regression. Thanks again for letting us know, I will add a unit test for unequal width/height once the patch is out.