Hi @alkibijad! Glad to know that you are making progress
Regarding your observation about the outputs, you are right: it is normal that the results are not numerically identical to the ones from PyTorch. Even if the weights are stored in 32-bit format, it doesn’t mean that the model has been converted using 32-bit precision, or runs with it.
The default conversion procedure is not guaranteed to happen in float-32, as described in this documentation. Furthermore, execution precision depends on the hardware you run your model on. Another factor that also adds some confusion is the conversion format (the legacy “Neural Network” format, vs the newer “ML Program”). If you convert to Neural Network, then inference precision will be 32-bit when the model (or portions of it) run on CPU, but 16-bit on GPU and Neural Engine. Using ML Program, you can also run 32-bit precision on GPU (but not on Neural Engine).
You can force conversion to happen in 32-bit mode. In ML-Program mode, you can even convert most of your operations using 16-bit precision but preserve some of them in 32-bit mode.
To verify the correctness of the conversion, you can first run the result on CPU, selecting the appropriate compute units, and measure the error with respect to the outputs from the original model, but without expecting numerical equivalence. This article suggests a signal-to-noise metric to measure the difference. Then you can run predictions incorporating GPU and/or NE, and measure the quality again. It’s usually ok to use conversion defaults and run in 16-bit mode on all eligible devices, but depending on your project (and the model) you might need to override some settings or resort to more exotic things such as per-op precision specification using typed tensors.
Let us know how it goes