Optimising performance non-standard systems

I am experimenting deploying a question answering system using huggingface pretrained models.
I am finding it difficult to get consistent results and performance.
the system performs inference over k items retrieved by elastic search (k~=50) and performs answer extraction on them for a given question, so for each query we perform inference over 50 question, text pairs.

I have 3 versions, one simply iterating over the 50 pairs, one using hf pipeline and one where I load all pairs into a 3d tensor and compute simultaneously.

Some interesting findings (all on k80 unless otherwise stated):
My expectation was the 3d batch tensor would be fastest - the inference phase is, by a factor of x100 (taking 0.07s) but it takes so long to move the tensors to cpu for post processing that that this overall takes around 11s.
hf pipelines is by far the slowest solution - taking around twice the time of this solutions (22s). whilst the pipeline accepts batches it has no optimisation for this and simply loops through the items. Whilst I havnt done in depth profiling of the pipeline code it seems that the additional time likely derives from converting to squad features.
The surprise winner is my simple loop through the pairs which comes in at around 7s.

A few questions:

  • Given that the model inference is so fast in batch mode, does anyone have any tips for solving the slow tensor placement?
  • does hf intend to have some more optimal pipelines for production uses? Am I using this incorrectly somehow? My base assumption was that the hf implementation would be best (and would support proper batching).
  • any other pointers for a better solution design?
  • any good tools for detailed profiling of ML/torch systems? It has been a really slow, difficult process finding these bottlenecks.

Many thanks!
Justin

I am also very interested in this topic as i am having issues with performance using pipelines as well. so bumping in hopes to see some expert reaction in this regard!

If not pipeline or other methods, what is the best strategy to make sure you have maximum performance running a model continuously?

Did you find an acceptable solution? I’m also struggling to optimizing the inference for production usage.