RL Trading Agent Can't Learn Sensible Behavior Even on a Simple Sine Wave — What Am I Doing Wrong?

ifitch · July 8, 2025, 7:34pm

I’ve been building a reinforcement learning trading agent using a synthetic sine wave as the price series — basically the simplest dataset I could imagine to test whether an agent can learn to buy low and sell high. But after weeks of experimentation, I still can’t get any meaningful behavior from the model. Here’s a quick rundown of what I’ve tried and how things are set up:

The Environment:

A simple gymnasium.Env that generates a sine wave as the price series, optionally with noise.
The action space has 3 discrete actions: 0 = HOLD, 1 = ENTER, 2 = EXIT.
Reward is based on PnL percentage. I’ve also experimented with big and small penalties for overtrading or unnecessary actions.
Observations include engineered features like:
- norm_dist: price normalized between min/max of the wave
- proximity_to_extremes: closeness to min/max in the wave
- log_return: log difference between timesteps
- hold_duration: number of steps the position has been held
- action_hint: even added sraightforward hint what action to do

Models Tried:

Dueling DQN (via Stable-Baselines3)
PPO with MlpPolicy
RecurrentPPO with MlpLstmPolicy (via SB3- contrib)
Tried wide range of Hyper Parameters and num of iteration up to 1_500_000

Results:

The model does learn to open and close positions — but with no intelligent timing.
It enters trades regardless of wave direction (both on rises and falls), and closes them without regard for PnL.
Sometimes the frequency of trades changes, but not the logic.
I’ve tried feeding observations one step at a time, and also as rolling windows. Neither improved performance.

Extra Debugging Steps:

Visualized trades on the sine wave — still just random entries and exits.
Printed out per-step actions, rewards, and PnL.
Logged tensorboard metrics — no obvious convergence.
Verified that the rewards are non-zero and properly aligned with good trades.

Appreciate any feedback or insights

Here’s some code snippets:

Environment (Simplified)

class SineTradeEnv(gym.Env):
    def __init__(self, prices, norm_dist, proximity, log_return):
        self.prices = prices
        self.norm_dist_array = norm_dist
        self.proximity_array = proximity
        self.log_return_array = log_return
        self.max_t = len(prices)
        self.action_space = spaces.Discrete(3)  # HOLD, OPEN, CLOSE
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)

    def reset(self):
        self.t = 0
        self.position = None
        self.entry_step = None
        return self._get_obs(), {}

    def step(self, action):
        self.t += 1
        done = self.t >= self.max_t - 1
        current_price = self.prices[self.t]

        # PnL-based reward
        if self.position is not None:
            pnl_pct = (current_price - self.position) / self.position
        else:
            pnl_pct = 0

        # Reward logic
        reward = 0
        if self.position is None and action == 1:
            self.position = current_price
            self.entry_step = self.t
        elif self.position is not None and action == 2:
            reward = pnl_pct * 3 if pnl_pct > 0 else pnl_pct
            self.position = None
            self.entry_step = None
        elif self.position is not None and action == 0:
            reward = pnl_pct * 1

        return self._get_obs(), reward, done, False, {}

    def _get_obs(self):
        return np.array([
            float(self.position is not None),
            self.norm_dist_array[self.t],
            self.proximity_array[self.t],
            self.log_return_array[self.t],
            float(self.t - self.entry_step) if self.entry_step is not None else 0.0
        ], dtype=np.float32)

Model Training

model = PPO(
    policy="MlpPolicy",
    env=env,
    n_steps=1024,
    batch_size=512,
    learning_rate=2.5e-4,
    ent_coef=0.01,
    verbose=1,
    tensorboard_log="./tensorboard_logs/"
)

model.learn(total_timesteps=1_500_000)

Data Generation (Synthetic Sine Wave)

def generate_price_series(max_t=500):
    x = np.linspace(0, 6 * np.pi, max_t)
    sine = np.sin(x)
    prices = 1000 + 200 * sine
    log_return = np.append([0], np.diff(np.log(prices)))
    norm_dist = (prices - prices.min()) / (prices.max() - prices.min()) * 2 - 1
    proximity = np.minimum(
        (prices - prices.min()) / (prices.max() - prices.min()),
        (prices.max() - prices) / (prices.max() - prices.min())
    )
    return prices, norm_dist, proximity, log_return

Inference & Debug Visualization

obs, _ = env.reset()
for t in range(env.max_t):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _, _ = env.step(int(action))
    print(f"t={t} | action={action} | reward={reward:.4f}")
    if done:
        break

Topic		Replies	Views
Deep Q-Learning : Successful Training but Fails in Testing Beginners	0	186	August 21, 2023
Stable Baselines3 - Different method for learn model Beginners	0	298	March 10, 2023
🔬 Exploring Reinforcement Learning for Molecule Generation with GPT-Based Models; Loss Fluctuations Intermediate	2	283	April 11, 2024
NN oscillating using PyTorch for Reinforce on lunarlander Models	0	260	December 12, 2023
Decision Transformer for Discrete action Beginners	5	418	December 7, 2024

RL Trading Agent Can't Learn Sensible Behavior Even on a Simple Sine Wave — What Am I Doing Wrong?

Related topics