Profiling all layers of a model

I want to profile all layers of a model, meaning the time, memory, performance (IPC for instance).

From a Pytorch perspective, there is the Pytorch profiler (PyTorch Profiler — PyTorch Tutorials 2.2.0+cu121 documentation) and fordard/backward hooks to layers (this won’t allow me to measure the layer, only track the start of it).

The Pytorch profile seems a good approach, but for models from Hugging Face fails to provide useful information about the layers.
For instance, when using Pytorch profiler with the model GPT-J (from GPT-J), I get the following output, which shows no layer but other auxiliary functions:

----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
               forward        90.29%     558.000us        94.34%     583.000us     583.000us             1  
           aten::zeros         5.02%      31.000us         5.66%      35.000us      35.000us             1  
          aten::unbind         1.62%      10.000us         2.43%      15.000us      15.000us             1  
          aten::detach         0.49%       3.000us         1.29%       8.000us       8.000us             1  
          aten::select         0.65%       4.000us         0.81%       5.000us       5.000us             1  
                detach         0.81%       5.000us         0.81%       5.000us       5.000us             1  
           aten::empty         0.65%       4.000us         0.65%       4.000us       2.000us             2  
           aten::zero_         0.16%       1.000us         0.16%       1.000us       1.000us             1  
      aten::as_strided         0.16%       1.000us         0.16%       1.000us       1.000us             1  
              aten::to         0.16%       1.000us         0.16%       1.000us       1.000us             1  
    aten::resolve_conj         0.00%       0.000us         0.00%       0.000us       0.000us             1  
     aten::resolve_neg         0.00%       0.000us         0.00%       0.000us       0.000us             1  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 618.000us

What would be the approach to profile all layers? Let me add that I’m running on only CPU.
What would be the approach when running in a GPU? Is there a cross-platform mechanism?