Training for langgraph agent

Hello everyone,

I wanna use a LLM to create an agent in langgraph with this kind of architecture:


The idea is similar to a React agent where the model has to provide a prompt for my terminal tool and observe if it achieve a given objective.
I have a question : how should i train such model?
Could i use DPO/ORPO procedure to align my model with multi-step context ?
Or is there a smarter way to do that?