Turning Human Speech into Set Commands

Hello! I’m interested in training a model for the purposes of taking casual human speech and translating that into set commands defined for the AI model in relation to building in a 3D voxel environment.

For example, I want to define a command for the AI to use called “fill”;
this will be defined as “fill [position] [size] [color]”

A user should be able to say “Place a blue block at 0 0 0” and the program will respond with
“fill 0 0 0 1 1 1 #0000ff”. Just to explain this output, the first three 0s are the x,y,z coordinates, the three 1s are denoting the height, width, and length of the block, and then the last bit is the color.
As it is right now, the way we’ve been achieving this is through a large prompt we feed to ChatGPT, but for cost reasons we can’t continue using it.

So what I’m asking is, what model/pipeline/etc would be a good jumping off point for this sort of task? Ideally I would like to both provide context to the AI about it’s voxel environment and also feed it training data that demonstrates how it should respond to certain speech requests. I also see a lot of different functions under natural language processing in the model database, which one should I be looking into?