Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents.
Out goal is to to build agents that can actively explore their environment, gather feedback, and leverage this experience for more effective exploitation. To this end, we propose LaMer, a Meta-RL framework that enables LLM agents to explore and adapt at test time. LaMer contains two key design principles. First, it introduces a cross-episode training scheme that treats each trial as a sequence of episodes, enabling the agent to explore in early episodes and exploit this information in later ones. The agent is trained to maximize the cross-episode cumulative rewards. Second, LaMer uses self-reflection as an in-context adaptation mechanism, allowing the agent to summarize past experiences and adjust its strategy accordingly without updating the model parameters.
Across all three environments, LaMer trained with Meta-RL consistently outperforms both prompting-based baselines and RL-training methods on the final pass@3 success rate. LaMer trained agents demonstrate remarkable performance gains across attempts. From pass@1 to pass@3, LaMer yields improvements of 13.5% (Sokoban), 30.3% (MineSweeper), and 20.3% (Webshop), significantly larger than RL and prompting-based baselines. Overall, the results indicate that LaMer obtains better performance and stronger test-time scaling compared to prior methods.
We quantify exploration by estimating the entropy of the empirical distribution over unique sampled trajectories. We observe that the base model exhibits the highest entropy and standard RL converges to deterministic behaviors. In contrast, LaMer preserves significantly higher diversity than RL baselines, allowing more exploration at test time.
We evaluate the trained agents with increased difficulty on the tasks of Sokoban and MineSweeper. The consistent gap indicates that LaMer trained with Meta-RL not only performs better on the training distribution, but also generalizes better to the harder tasks.
We further evaluate the out-of-distribution (OOD) generalization performance using the ALFWorld environment. While standard RL performs well on ID tasks (>20% improvement over prompting), it struggles on OOD tasks. In contrast, LaMer consistently outperforms RL on ID and OOD tasks.
@article{jiang2025metarl,
title={Meta-RL Induces Exploration in Language Agents},
author={Yulun Jiang and Liangze Jiang and Damien Teney and Michael Moor and Maria Brbic},
journal={arXiv preprint arXiv:2512.16848},
year={2025}
}