Custom env: catching a ball flying by a parabolic trajectory

zjor · December 15, 2024, 12:35pm

Hi,
I’m practicing to implement my own environment with Farama Gymnasium and PyGame,
the source code is available here.
I implemented physics and a policy gradient method from the lesson: Hands on - Hugging Face Deep RL Course

The same code is capable of solving a cart pole problem.
But I’m struggling with making an agent to move a racket to catch a ball.

The environment:

a ball is thrown from left to right with some initial velocity
on the right side of the game field there is a racket that can move up or down
the reward is assigned when the ball hits the opposite side of the field. +10 if it hits the racket, -10 if it misses

I’m using a network with one hidden layer.

After the training, the platform either goes up or to the bottom and stays there.
My idea was that it learns the concept of gravity and a parabolic curve, or at least tries to follow the Y-coordinate of the ball,
but that doesn’t happen.

Do you have any hints on how to tune my parameters or reward function?

Any help is appreciated.

Alanturner2 · December 15, 2024, 1:28pm

1. Reward Function Design

Current Reward Problem: The sparse reward (+10 or -10) might not provide enough guidance during training, especially early on.
Improvement Ideas:
- Add intermediate rewards to encourage the racket to follow the ball’s trajectory. For example:
  - A small positive reward for reducing the distance between the racket’s Y-coordinate and the ball’s Y-coordinate.
  - Penalize the racket for moving unnecessarily.
- Use a shaped reward:
  - Reward = 10−distance_to_ball10 - \text{distance_to_ball}10−distance_to_ball when the ball reaches the racket.
  - Add a constant offset for successfully hitting the ball.

2. State Space

Ensure your state space provides sufficient information for the agent to learn:
- Ball position (x,yx, yx,y) and velocity (vx,vyv_x, v_yvx,vy).
- Racket position (yrackety_{racket}yracket).
- Distance between ball and racket (distancey=yball−yracket\text{distance}y = y{ball} - y_{racket}distancey=yball−yracket).
If gravity plays a role, ensure that the dynamics of vyv_yvy reflect it accurately.

3. Action Space

Restrict unnecessary actions:
- Actions like moving up when the ball is below the racket, or vice versa, could confuse the model.
- Consider reducing the action space to just three discrete actions: UP, DOWN, STAY.

4. Network Architecture

For a parabolic trajectory, the model might require more representational power:
- Increase the number of neurons in the hidden layer.
- Use two hidden layers instead of one to better capture non-linear dynamics.
Ensure the activation functions allow non-linearity (e.g., ReLU for hidden layers).

5. Training and Hyperparameters

Learning Rate:
- Use a lower learning rate (e.g., 0.0010.0010.001) to avoid overshooting.
Exploration:
- Add entropy regularization to encourage exploration in policy gradient methods.
Reward Scaling:
- Normalize rewards or use reward clipping to stabilize learning.
Training Episodes:
- Extend training duration to allow the agent to encounter more varied ball trajectories.
Discount Factor (γ\gammaγ):
- Set an appropriate discount factor (e.g., 0.990.990.99) to balance short-term and long-term rewards.

6. Debugging the Agent’s Behavior

Visualize the Trajectory:
- Plot the ball’s trajectory and racket’s movements during training to see if the agent is learning anything useful.
Track Metrics:
- Monitor the reward per episode to ensure the policy improves.
Baseline Comparisons:
- Test simpler heuristics (e.g., follow ybally_{ball}yball) to set a baseline.

7. Experiment with Variants

Gradually introduce randomness in the ball’s velocity and position to make the agent robust.
Test different gravity strengths and ball speeds.

zjor · December 15, 2024, 3:42pm

Thank you so much, @Alanturner2 for such an elaborate response!
Indeed, I shoot the ball randomly every episode, that might confuse the network.
I’ll try to teach the agent to catch the ball with the same initial parameters first.

Alanturner2 · December 16, 2024, 12:19am

Yeah, you mentioned you train your agent using randomly chosen actions that is one of the training method in RL. This is called SARSA. In this method you choose the action randomly then you should change your policy by the reward of the environment. But in DeepRL you can use actor-critic model. In this method you can also choose the action randomly. But every iteration you should update your policy following your reward.

Topic	Replies	Views
In the rl course for the elf on slippery ice Beginners	148	June 3, 2023
RL Trading Agent Can't Learn Sensible Behavior Even on a Simple Sine Wave — What Am I Doing Wrong? Beginners	26	July 8, 2025
I'd like to understand on how to train a neural net with agents and evolution Research	547	November 1, 2022
NN oscillating using PyTorch for Reinforce on lunarlander Models	261	December 12, 2023
Deploy my rl agent hosting in my hugging gace account in my own computer Beginners	8	September 3, 2024