RL Trading Agent Can't Learn Sensible Behavior Even on a Simple Sine Wave — What Am I Doing Wrong?

I’ve been building a reinforcement learning trading agent using a synthetic sine wave as the price series — basically the simplest dataset I could imagine to test whether an agent can learn to buy low and sell high. But after weeks of experimentation, I still can’t get any meaningful behavior from the model. Here’s a quick rundown of what I’ve tried and how things are set up:

:test_tube: The Environment:

  • A simple gymnasium.Env that generates a sine wave as the price series, optionally with noise.
  • The action space has 3 discrete actions: 0 = HOLD, 1 = ENTER, 2 = EXIT.
  • Reward is based on PnL percentage. I’ve also experimented with big and small penalties for overtrading or unnecessary actions.
  • Observations include engineered features like:
    • norm_dist: price normalized between min/max of the wave
    • proximity_to_extremes: closeness to min/max in the wave
    • log_return: log difference between timesteps
    • hold_duration: number of steps the position has been held
    • action_hint: even added sraightforward hint what action to do

:brain: Models Tried:

  • Dueling DQN (via Stable-Baselines3)
  • PPO with MlpPolicy
  • RecurrentPPO with MlpLstmPolicy (via SB3- contrib)
  • Tried wide range of Hyper Parameters and num of iteration up to 1_500_000

:chart_decreasing: Results:

  • The model does learn to open and close positions — but with no intelligent timing.
  • It enters trades regardless of wave direction (both on rises and falls), and closes them without regard for PnL.
  • Sometimes the frequency of trades changes, but not the logic.
  • I’ve tried feeding observations one step at a time, and also as rolling windows. Neither improved performance.

:light_bulb: Extra Debugging Steps:

  • Visualized trades on the sine wave — still just random entries and exits.
  • Printed out per-step actions, rewards, and PnL.
  • Logged tensorboard metrics — no obvious convergence.
  • Verified that the rewards are non-zero and properly aligned with good trades.

Appreciate any feedback or insights

Here’s some code snippets:

:brick: Environment (Simplified)

class SineTradeEnv(gym.Env):
    def __init__(self, prices, norm_dist, proximity, log_return):
        self.prices = prices
        self.norm_dist_array = norm_dist
        self.proximity_array = proximity
        self.log_return_array = log_return
        self.max_t = len(prices)
        self.action_space = spaces.Discrete(3)  # HOLD, OPEN, CLOSE
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)

    def reset(self):
        self.t = 0
        self.position = None
        self.entry_step = None
        return self._get_obs(), {}

    def step(self, action):
        self.t += 1
        done = self.t >= self.max_t - 1
        current_price = self.prices[self.t]

        # PnL-based reward
        if self.position is not None:
            pnl_pct = (current_price - self.position) / self.position
        else:
            pnl_pct = 0

        # Reward logic
        reward = 0
        if self.position is None and action == 1:
            self.position = current_price
            self.entry_step = self.t
        elif self.position is not None and action == 2:
            reward = pnl_pct * 3 if pnl_pct > 0 else pnl_pct
            self.position = None
            self.entry_step = None
        elif self.position is not None and action == 0:
            reward = pnl_pct * 1

        return self._get_obs(), reward, done, False, {}

    def _get_obs(self):
        return np.array([
            float(self.position is not None),
            self.norm_dist_array[self.t],
            self.proximity_array[self.t],
            self.log_return_array[self.t],
            float(self.t - self.entry_step) if self.entry_step is not None else 0.0
        ], dtype=np.float32)

:brain: Model Training

model = PPO(
    policy="MlpPolicy",
    env=env,
    n_steps=1024,
    batch_size=512,
    learning_rate=2.5e-4,
    ent_coef=0.01,
    verbose=1,
    tensorboard_log="./tensorboard_logs/"
)

model.learn(total_timesteps=1_500_000)

:bar_chart: Data Generation (Synthetic Sine Wave)

def generate_price_series(max_t=500):
    x = np.linspace(0, 6 * np.pi, max_t)
    sine = np.sin(x)
    prices = 1000 + 200 * sine
    log_return = np.append([0], np.diff(np.log(prices)))
    norm_dist = (prices - prices.min()) / (prices.max() - prices.min()) * 2 - 1
    proximity = np.minimum(
        (prices - prices.min()) / (prices.max() - prices.min()),
        (prices.max() - prices) / (prices.max() - prices.min())
    )
    return prices, norm_dist, proximity, log_return

:detective: Inference & Debug Visualization

obs, _ = env.reset()
for t in range(env.max_t):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _, _ = env.step(int(action))
    print(f"t={t} | action={action} | reward={reward:.4f}")
    if done:
        break
1 Like