R-AIF (Robust Active Inference): Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models

Sparse Reward Tasks: A General Machine Intelligence (MI) Challenge

Sparse reward reinforcement learning (RL) tasks are well-known for their difficulty in machine intelligence research, such as those characterized by the mountain car [1] and the robotic pick-place [2], [3] control problems. Generally, dense reward functions are designed to reduce the difficulty of the learning problem for computational agents, such as those built from deep neural networks. However, in the real world, it is impractical to engineer a reward system for every single task, particularly in robotic control environments. When it comes to RL with no rewarding signal provided at all except for the last step, signifying whether the agent has failed or succeeded, most machine intelligence and RL algorithms struggle to solve these types of tasks in a short period of time [1], [4].

Specifically, RL methods such as PPO [5], SAC series [6], [7] , TRPO [8], and DDPG [9] depend largely on the (designed/engineered) feedback reward signals provided by the environment in order to compute the temporal difference/regret signal through the Bellman equation to then update the deep neural model through gradient optimization methodologies [1]. One of the reasons why these methods suffer from solving/completing sparse reward tasks is that the reward signals, which rarely occur, do not provide any meaningful gradient to the model, thus contributing to possible model divergence.

While designing dense reward functions for each task might be too difficult, infeasible/impractical, or too brittle, it is more likely and feasible to use an expert, e.g., an experienced human equipment operator, who may demonstrate some form of successful behavior to the robot/RL controller. Using this demonstration data to aid in the construction of an effective policy is often called imitation learning. However, while more useful/viable than designing a myriad of dense reward processes, normal imitation learning methods also suffer from model trajectory divergences and violate the non-i.i.d characteristics of the real world trajectories induced by complex problems such as those that constitute robotic control.

Pixel-Based Learning as another MI Difficulty

Another challenge in real-world RL is the task environment design. Normally, reinforcement learning tasks are designed under simulation conditions where experimentalists have full control over the states (in order to set up tractable Markov decision processes) – how the environment is represented through meaningful derived, hand-crafted features – such as the coordinates of objects with respect to the robot sensors (its “eyes”). However, in the real-world scenarios, we do not always have this kind of information (and must resort to raw, complex low-level information such as pixels), making the environment partially observed.

World Model and Active Inference

To overcome the need to manually design both the real world environment states and reward systems, past work has developed methodologies that observe the world through different sensors (mostly vision) and work to infer the corresponding action through the estimation of the environment’s hidden underlying states and its corresponding rewards. The process of modeling perception like this is often sorts to drawing from methodology in the variational inference [10] literature, and the perception-action model as a whole is called the “world model” [11] or an active inference [12] module/sub-system (which balances learning from goal-orienting signals with mechanisms that drive intelligent exploration, or epistemic foraging).

Robust Active Inference on Solving Sparse Reward, Pixel-Based Tasks

One might ask: "why can we just use behavior cloning (BC) to solve the sparse reward tasks if we are going to use the expert anyway?" The answer to this reasonable question is simple: a behavior cloning-like architecture often requires a lot of expert data in order to produce reasonable action distributions given every possible state of the environment. However, acquiring enough data to build an effective cloning system is highly impractical and costly, especially given the fact that the collected expert data cannot account for every single case that the agent would encounter now and in the future. Another reason why BC has problems is that it suffers from dataset covariate distribution shift: BC assumes that data from each time step is independently and identically distributed (i.i.d) whereas the actual trajectory is not. This issue can be visualized in Sergey Levine's imitation learning lecture and in this visualization below:

Instead of learning to imitate the expert as in behavior cloning, which as we see requires a large amount of expert data, our work seeks to leverage only a small quantity of “seed imitation data” (from an expert) and then learn an ANN that dynamically produces the preferences over states at each time step as a means of providing a dense instrumental signal (or a local goal state for each step). As a result, the agent moves according to a trajectory that is shaped towards its own estimation of future preferred states (using instrumental signal) while still performing local exploration stably (using an epistemic signal). In order to achieve this kind of “trajectory nudging”, the agent is trained to estimate future preferred states accurately such that those estimated states resemble the imitation/positive data distribution. This means that our imitation-seeded “neural prior” adapts with time.

We achieve this by employing a novel technique that we called the Contrastive Recurrent State Prior Preference (CRSPP) model. Similar to the recurrent state-space model (RSSM) used in approaches such as in the “Dreamer” model series, CRSPP also encodes the observation and, based on a dynamically computed prior preference rate (or how “desirable” a state is), it learns either to pull the predicted goal state towards or to push it away from the actual world model (latent) state. This helps the encoded state distribution stay aligned with the goal state distribution. Furthermore, to estimate the next correct goal state, a dynamical transition network is learned that minimizes the KL divergence between the predicted goal and the desired state. This helps the temporal transition model accurately predict the next goal state posterior.

The contrastive recurrent state prior preference model.

Note that our scheme entails dynamically computing the prior preference rate by decaying a constant backward from the end of the episode, depending on the agent's success. Finally, we train the complete agent using our proposed robust active inference (R-AIF) framework. Details on the implementation and mathematical documentation are on our github repository and in the paper.

Experiments and Results

Empirically, we train our agent on three main environments/worlds that consist of different control tasks and robotic simulations: gymnasium mountain car, Meta-World [2], and robosuite [3], resulting in 16 different tasks. We modified the environments so that they are pixel-based POMDP environments, controlled under continuous action space, only provide sparse reward signals, and result in varied goal conditions. Here are the quantitative results of our R-AIF agent compared to other baselines (DreamerV3[13], DAIVPG-G [14], and Himst-G [15]).

Here are the qualitative results from our robotic control experiments with R-AIF and relatred baselines. Note that the green checkmark shows that the corresponding agent has completed the task.

References

[1]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Second. The MIT Press, 2018. [Online]. Available: http://incompleteideas.net/book/the-book-2nd.html

[2]

T. Yu et al., “Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning,” in Conference on Robot Learning (CoRL), 2019. [Online]. Available: https://arxiv.org/abs/1910.10897

[3]

Y. Zhu et al., “robosuite: A Modular Simulation Framework and Benchmark for Robot Learning,” in arXiv preprint arXiv:2009.12293, 2020.

[4]

S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd ed. Prentice Hall, 2010.

[5]

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” CoRR, vol. abs/1707.06347, 2017, [Online]. Available: http://arxiv.org/abs/1707.06347

[6]

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause, Eds., in Proceedings of Machine Learning Research, vol. 80. PMLR, 2018, pp. 1856–1865. [Online]. Available: http://proceedings.mlr.press/v80/haarnoja18b.html

[7]

T. Haarnoja et al., “Soft Actor-Critic Algorithms and Applications.” arXiv, 2018. doi: 10.48550/ARXIV.1812.05905.

[8]

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust Region Policy Optimization,” in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, Eds., in Proceedings of Machine Learning Research, vol. 37. Lille, France: PMLR, Jul. 2015, pp. 1889–1897. [Online]. Available: https://proceedings.mlr.press/v37/schulman15.html

[9]

T. P. Lillicrap et al., “Continuous control with deep reinforcement learning.” arXiv, 2015. doi: 10.48550/ARXIV.1509.02971.

[10]

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic Variational Inference,” Journal of Machine Learning Research, vol. 14, no. 40, pp. 1303–1347, 2013, [Online]. Available: http://jmlr.org/papers/v14/hoffman13a.html

[11]

D. R. Ha and J. Schmidhuber, “World Models,” ArXiv, vol. abs/1803.10122, 2018, [Online]. Available: https://api.semanticscholar.org/CorpusID:4807711

[12]

K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo, “Active inference: a process theory,” Neural computation, vol. 29, no. 1, pp. 1–49, 2017.

[13]

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering Diverse Domains through World Models,” ArXiv, vol. abs/2301.04104, 2023, [Online]. Available: https://api.semanticscholar.org/CorpusID:255569874

[14]

B. Millidge, “Deep active inference as variational policy gradients,” Journal of Mathematical Psychology, vol. 96, p. 102348, 2020, doi: https://doi.org/10.1016/j.jmp.2020.102348.

[15]

O. van der Himst and P. Lanillos, “Deep Active Inference for Partially Observable MDPs,” in Active Inference, T. Verbelen, P. Lanillos, C. L. Buckley, and C. De Boom, Eds., Cham: Springer International Publishing, 2020, pp. 61–71.