Maintaining Plasticity in Reinforcement Learning

💡Abstract

Reinforcement learning (RL) has demonstrated the ability to maintain the plasticity of the policy throughout short-term training in aerial robot control. However, these policies have been shown to loss of plasticity when extended to long-term learning in non-stationary environments. For example, the standard proximal policy optimization (PPO) policy is observed to collapse in long-term training settings and lead to significant control performance degradation. To address this problem, this work proposes a cost-aware framework that uses a retrospective cost mechanism (RECOM) to balance rewards and losses in RL training with a nonstationary environment. Using a cost gradient relation between rewards and losses, our framework dynamically updates the learning rate to actively train the control policy in a disturbed wind environment. Our experimental results show that our framework learned a policy for the hovering task without policy collapse in variable wind conditions and has a successful result of 11.72% less dormant units than L2 regularization with PPO.

Anticipated images of aerial robot hovering from different initial positions under variable wind conditions

❓Loss of Plasticity

During the 20M training, the wind disturbance value was changed every 2M timesteps to be [3.0, 2.0, 2.5, 1.5, 2.5] meter per second

RL has demonstrated the ability to maintain the plasticity throughout short-term training. However, these policies have been shown to loss of plasticity when extended to long-term learning in non-stationary environments. For example, We changed wind disturbance value every 2 million timesteps. The standard PPO policy is observed to collapse in long-term training settings and lead to significant control performance degradation.

✨Retrospective Cost Mechanism

To address this problem, this work proposes a cost-aware framework that uses a retrospective cost mechanism (RECOM) to balance rewards and losses in RL training with a non-stationary environment. Using a cost gradient relation between rewards and losses, our framework dynamically updates the learning rate to continually train the control policy in a disturbed wind environment.

Your image description

🔧Training and Evaluation

Comparison of different reinforcement learning agents in training performance with wind disturbance.

Change of dormant units in the policy network during training under the wind disturbance.

✨Experiments

Standard PPO	L2 Regularization with PPO	RECOM with L2 PPO

Success Rate (%): 30	Success Rate (%): 88	Success Rate (%): 90

Simulation results: MSE error metrics comparisonfor three policies.

Our experimental results show that the baseline PPO policy collapses under dynamic wind changes, leading to significant control performance degradation. By integrating the proposed RECOM with the PPO algorithm, the policy remains stable throughout long-term training, and dormant units in the neural network are effectively utilized.

💡Key Findings:

1. Policy collapse prevention during long-term training.
2. Activation of dormant units in response to variable wind conditions.
3. Demonstrated balance between rewards and losses through the RECOM.

📌Conclusion

In this work, we proposed a retrospective cost mechanism (RECOM) to balance rewards and losses specifically focusing on the aerial robot for hovering tasks in RotorPy with non-stationary environment scenarios. With the proposed RECOM with L2 PPO, policy collapse is prevented in long-term training and the amount of dormant units in the policy network is kept 11.29% at 20 million timesteps. The experimental results demonstrated that rewards and losses could be balanced with a retrospective mechanism inspired by the perspective of neuroscience and cognitive science.

🌻Acknowledgements

This research was partially supported by seedcorn funds from Civil, Aerospace and Design Engineering, Isambard AI, and Bristol Digital Futures Institute at the University of Bristol. Furthermore, we acknowledge Jianduo Chai for proofreading this work.

🎈BibTeX

If you find this work helpful, please cite us.


@misc{karasahin2025maintainingplasticityreinforcementlearning,
      title={Maintaining Plasticity in Reinforcement Learning: A Cost-Aware Framework for Aerial Robot Control in Non-stationary Environments}, 
      author={Ali Tahir Karasahin and Ziniu Wu and Basaran Bahadir Kocer},
      year={2025},
      eprint={2503.00282},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2503.00282}, 
}