How to run Phi3 with candle-onnx/Rust?

I’m currently trying to get a simple CLI-chat example working using candle-onnx and Phi-3-mini-128k-instruct-onnx. My objective was to use onnx to have access to all the models available in this format, and I’ve chosen Phi because there is already some native candle example for this model available.

Now I have the problem that I didn’t find any examples with candle-onnx except for the one in the candle repo, which is about image analysis. So basically I’m stuck on how to get it running, probably because the weights are not loaded. Probably I just don’t now how to initialize the model correctly and which part of the inputs are required and how they have to be provided.

Are there any examples out there or some more detailed usage docs for candle-onnx? Or is it currently not yet really used?

Steps to Get Phi3 Running:

Load the ONNX Model:
Rust

use candle_onnx::OnnxModel;

let model_path = "path/to/Phi3-mini-128k-instruct-onnx.onnx";
let model = OnnxModel::load(model_path).expect("Failed to load ONNX model");


Replace "path/to/Phi3-mini-128k-instruct-onnx.onnx" with the actual location of your downloaded model file.

Prepare the Input:

Identify Input Tensors: Consult the Phi3 model's documentation (if available) or use tools like netron to visualize the model's structure and determine the expected input tensor names and shapes. Common input names for text models might include "input_ids" or "input_sequence".
Create Input Data: Prepare your user input as a Rust tensor with the correct data type and shape based on the identified input requirements. You might need to tokenize or pre-process the text depending on the model's expectations.

Run Inference:
Rust

let input_data = // Your prepared input tensor data
let output = model.run(&[input_data]).expect(“Failed to run inference”);

Replace // Your prepared input tensor data with the actual tensor you created.
The output variable will contain the model's predictions or response to the user input.

Process the Output:

Depending on the output format (e.g., probabilities, logits), interpret and potentially post-process the output to generate a human-readable response for your CLI chat.

Hi Campbell, thank you very much. On my experiments over the weekend to find some working solution for me I also tried burn and ort, and had no success. With candle-onnx I gave up after I realized (or at least that’s what I concluded from the code), that candle-onnx can’t load the extra-data files with the weights coming with the bigger models - like phi-3.

With burn and ort (using the latest 2.0-rc) I also had no success, I think because of unimplemented Ops required by Phi-3. I tried then llama-3 with ort, but again it failed with missing Ops. I just got some older gpt-2 Model running, at least some proof that “something” is working. :wink:

Now I’m focused on the candle implementation of the Phi model. I got the example provided by candle running (Phi-2), albeit with horrible performance on my notebook without GPU (1 token per minute). One reason is that candle uses a single core only. But now I remember also that half a year ago I tried the SD example and it was about 100 times slower than on my colleagues machine (comparable to mine) using GPT4all.

So, it looks like I’m still at the beginning of the journey.