Custom env: catching a ball flying by a parabolic trajectory

Hi,
I’m practicing to implement my own environment with Farama Gymnasium and PyGame,
the source code is available here.
I implemented physics and a policy gradient method from the lesson: Hands on - Hugging Face Deep RL Course

The same code is capable of solving a cart pole problem.
But I’m struggling with making an agent to move a racket to catch a ball.

The environment:

  • a ball is thrown from left to right with some initial velocity
  • on the right side of the game field there is a racket that can move up or down
  • the reward is assigned when the ball hits the opposite side of the field. +10 if it hits the racket, -10 if it misses

I’m using a network with one hidden layer.

After the training, the platform either goes up or to the bottom and stays there.
My idea was that it learns the concept of gravity and a parabolic curve, or at least tries to follow the Y-coordinate of the ball,
but that doesn’t happen.

Do you have any hints on how to tune my parameters or reward function?

Any help is appreciated.

2 Likes

1. Reward Function Design

  • Current Reward Problem: The sparse reward (+10 or -10) might not provide enough guidance during training, especially early on.
  • Improvement Ideas:
    • Add intermediate rewards to encourage the racket to follow the ball’s trajectory. For example:
      • A small positive reward for reducing the distance between the racket’s Y-coordinate and the ball’s Y-coordinate.
      • Penalize the racket for moving unnecessarily.
    • Use a shaped reward:
      • Reward = 10−distance_to_ball10 - \text{distance_to_ball}10−distance_to_ball when the ball reaches the racket.
      • Add a constant offset for successfully hitting the ball.

2. State Space

  • Ensure your state space provides sufficient information for the agent to learn:
    • Ball position (x,yx, yx,y) and velocity (vx,vyv_x, v_yvx​,vy​).
    • Racket position (yrackety_{racket}yracket​).
    • Distance between ball and racket (distancey=yball−yracket\text{distance}y = y{ball} - y_{racket}distancey​=yball​−yracket​).
  • If gravity plays a role, ensure that the dynamics of vyv_yvy​ reflect it accurately.

3. Action Space

  • Restrict unnecessary actions:
    • Actions like moving up when the ball is below the racket, or vice versa, could confuse the model.
    • Consider reducing the action space to just three discrete actions: UP, DOWN, STAY.

4. Network Architecture

  • For a parabolic trajectory, the model might require more representational power:
    • Increase the number of neurons in the hidden layer.
    • Use two hidden layers instead of one to better capture non-linear dynamics.
  • Ensure the activation functions allow non-linearity (e.g., ReLU for hidden layers).

5. Training and Hyperparameters

  • Learning Rate:
    • Use a lower learning rate (e.g., 0.0010.0010.001) to avoid overshooting.
  • Exploration:
    • Add entropy regularization to encourage exploration in policy gradient methods.
  • Reward Scaling:
    • Normalize rewards or use reward clipping to stabilize learning.
  • Training Episodes:
    • Extend training duration to allow the agent to encounter more varied ball trajectories.
  • Discount Factor (γ\gammaγ):
    • Set an appropriate discount factor (e.g., 0.990.990.99) to balance short-term and long-term rewards.

6. Debugging the Agent’s Behavior

  • Visualize the Trajectory:
    • Plot the ball’s trajectory and racket’s movements during training to see if the agent is learning anything useful.
  • Track Metrics:
    • Monitor the reward per episode to ensure the policy improves.
  • Baseline Comparisons:
    • Test simpler heuristics (e.g., follow ybally_{ball}yball​) to set a baseline.

7. Experiment with Variants

  • Gradually introduce randomness in the ball’s velocity and position to make the agent robust.
  • Test different gravity strengths and ball speeds.
1 Like

Thank you so much, @Alanturner2 for such an elaborate response!
Indeed, I shoot the ball randomly every episode, that might confuse the network.
I’ll try to teach the agent to catch the ball with the same initial parameters first.

2 Likes

Yeah, you mentioned you train your agent using randomly chosen actions that is one of the training method in RL. This is called SARSA. In this method you choose the action randomly then you should change your policy by the reward of the environment. But in DeepRL you can use actor-critic model. In this method you can also choose the action randomly. But every iteration you should update your policy following your reward.

1 Like