I’ve been building a reinforcement learning trading agent using a synthetic sine wave as the price series — basically the simplest dataset I could imagine to test whether an agent can learn to buy low and sell high. But after weeks of experimentation, I still can’t get any meaningful behavior from the model. Here’s a quick rundown of what I’ve tried and how things are set up:
The Environment:
- A simple gymnasium.Env that generates a sine wave as the price series, optionally with noise.
- The action space has 3 discrete actions:
0 = HOLD
,1 = ENTER
,2 = EXIT
. - Reward is based on PnL percentage. I’ve also experimented with big and small penalties for overtrading or unnecessary actions.
- Observations include engineered features like:
norm_dist
: price normalized between min/max of the waveproximity_to_extremes
: closeness to min/max in the wavelog_return
: log difference between timestepshold_duration
: number of steps the position has been heldaction_hint
: even added sraightforward hint what action to do
Models Tried:
Dueling DQN
(via Stable-Baselines3)PPO
withMlpPolicy
RecurrentPPO
withMlpLstmPolicy
(via SB3- contrib)- Tried wide range of Hyper Parameters and num of iteration up to 1_500_000
Results:
- The model does learn to open and close positions — but with no intelligent timing.
- It enters trades regardless of wave direction (both on rises and falls), and closes them without regard for PnL.
- Sometimes the frequency of trades changes, but not the logic.
- I’ve tried feeding observations one step at a time, and also as rolling windows. Neither improved performance.
Extra Debugging Steps:
- Visualized trades on the sine wave — still just random entries and exits.
- Printed out per-step actions, rewards, and PnL.
- Logged tensorboard metrics — no obvious convergence.
- Verified that the rewards are non-zero and properly aligned with good trades.
Appreciate any feedback or insights
Here’s some code snippets:
Environment (Simplified)
class SineTradeEnv(gym.Env):
def __init__(self, prices, norm_dist, proximity, log_return):
self.prices = prices
self.norm_dist_array = norm_dist
self.proximity_array = proximity
self.log_return_array = log_return
self.max_t = len(prices)
self.action_space = spaces.Discrete(3) # HOLD, OPEN, CLOSE
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)
def reset(self):
self.t = 0
self.position = None
self.entry_step = None
return self._get_obs(), {}
def step(self, action):
self.t += 1
done = self.t >= self.max_t - 1
current_price = self.prices[self.t]
# PnL-based reward
if self.position is not None:
pnl_pct = (current_price - self.position) / self.position
else:
pnl_pct = 0
# Reward logic
reward = 0
if self.position is None and action == 1:
self.position = current_price
self.entry_step = self.t
elif self.position is not None and action == 2:
reward = pnl_pct * 3 if pnl_pct > 0 else pnl_pct
self.position = None
self.entry_step = None
elif self.position is not None and action == 0:
reward = pnl_pct * 1
return self._get_obs(), reward, done, False, {}
def _get_obs(self):
return np.array([
float(self.position is not None),
self.norm_dist_array[self.t],
self.proximity_array[self.t],
self.log_return_array[self.t],
float(self.t - self.entry_step) if self.entry_step is not None else 0.0
], dtype=np.float32)
Model Training
model = PPO(
policy="MlpPolicy",
env=env,
n_steps=1024,
batch_size=512,
learning_rate=2.5e-4,
ent_coef=0.01,
verbose=1,
tensorboard_log="./tensorboard_logs/"
)
model.learn(total_timesteps=1_500_000)
Data Generation (Synthetic Sine Wave)
def generate_price_series(max_t=500):
x = np.linspace(0, 6 * np.pi, max_t)
sine = np.sin(x)
prices = 1000 + 200 * sine
log_return = np.append([0], np.diff(np.log(prices)))
norm_dist = (prices - prices.min()) / (prices.max() - prices.min()) * 2 - 1
proximity = np.minimum(
(prices - prices.min()) / (prices.max() - prices.min()),
(prices.max() - prices) / (prices.max() - prices.min())
)
return prices, norm_dist, proximity, log_return
Inference & Debug Visualization
obs, _ = env.reset()
for t in range(env.max_t):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, _, _ = env.step(int(action))
print(f"t={t} | action={action} | reward={reward:.4f}")
if done:
break