How to run Phi3 with candle-onnx/Rust?

Miczhong · May 17, 2024, 1:24pm

I’m currently trying to get a simple CLI-chat example working using candle-onnx and Phi-3-mini-128k-instruct-onnx. My objective was to use onnx to have access to all the models available in this format, and I’ve chosen Phi because there is already some native candle example for this model available.

Now I have the problem that I didn’t find any examples with candle-onnx except for the one in the candle repo, which is about image analysis. So basically I’m stuck on how to get it running, probably because the weights are not loaded. Probably I just don’t now how to initialize the model correctly and which part of the inputs are required and how they have to be provided.

Are there any examples out there or some more detailed usage docs for candle-onnx? Or is it currently not yet really used?

CampbellDorsey · May 21, 2024, 6:34pm

Steps to Get Phi3 Running:

Load the ONNX Model:
Rust

use candle_onnx::OnnxModel;

let model_path = "path/to/Phi3-mini-128k-instruct-onnx.onnx";
let model = OnnxModel::load(model_path).expect("Failed to load ONNX model");


Replace "path/to/Phi3-mini-128k-instruct-onnx.onnx" with the actual location of your downloaded model file.

Prepare the Input:

Identify Input Tensors: Consult the Phi3 model's documentation (if available) or use tools like netron to visualize the model's structure and determine the expected input tensor names and shapes. Common input names for text models might include "input_ids" or "input_sequence".
Create Input Data: Prepare your user input as a Rust tensor with the correct data type and shape based on the identified input requirements. You might need to tokenize or pre-process the text depending on the model's expectations.

Run Inference:
Rust

let input_data = // Your prepared input tensor data
let output = model.run(&[input_data]).expect(“Failed to run inference”);

Replace // Your prepared input tensor data with the actual tensor you created.
The output variable will contain the model's predictions or response to the user input.

Process the Output:

Depending on the output format (e.g., probabilities, logits), interpret and potentially post-process the output to generate a human-readable response for your CLI chat.

Miczhong · May 22, 2024, 6:50am

Hi Campbell, thank you very much. On my experiments over the weekend to find some working solution for me I also tried burn and ort, and had no success. With candle-onnx I gave up after I realized (or at least that’s what I concluded from the code), that candle-onnx can’t load the extra-data files with the weights coming with the bigger models - like phi-3.

With burn and ort (using the latest 2.0-rc) I also had no success, I think because of unimplemented Ops required by Phi-3. I tried then llama-3 with ort, but again it failed with missing Ops. I just got some older gpt-2 Model running, at least some proof that “something” is working.

Now I’m focused on the candle implementation of the Phi model. I got the example provided by candle running (Phi-2), albeit with horrible performance on my notebook without GPU (1 token per minute). One reason is that candle uses a single core only. But now I remember also that half a year ago I tried the SD example and it was about 100 times slower than on my colleagues machine (comparable to mine) using GPT4all.

So, it looks like I’m still at the beginning of the journey.

Topic		Replies	Views
How to run Phi-1_5 on cpu? 🤗Transformers	1	630	December 30, 2023
Run ONNXRUNTIME for insightface Model Models	0	1634	February 22, 2024
Phi-3-mini-128k-instruct not working with pro inference api Inference Endpoints on the Hub	14	2287	August 26, 2024
Load Phi 3 small on Nvidia Tesla V100 - Flash Attention 🤗Transformers	3	1008	August 6, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! Beginners	0	205	July 3, 2024

How to run Phi3 with candle-onnx/Rust?

Related topics