Meta-RL Induces Exploration in
Language Agents

EPFL Logo
ETH Zurich Logo Logo
1Department of Computer and Communication Sciences, EPFL, Lausanne, Switzerland
2Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
3Idiap Research Institute, Martigny, Switzerland
*Equal Contribution, Co-Corresponding Authors
LaMer Teaser

LaMer is a Meta-RL framework that enables LLM agents to actively explore and learn from environment feedback at test time.

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents.

Method

Out goal is to to build agents that can actively explore their environment, gather feedback, and leverage this experience for more effective exploitation. To this end, we propose LaMer, a Meta-RL framework that enables LLM agents to explore and adapt at test time. LaMer contains two key design principles. First, it introduces a cross-episode training scheme that treats each trial as a sequence of episodes, enabling the agent to explore in early episodes and exploit this information in later ones. The agent is trained to maximize the cross-episode cumulative rewards. Second, LaMer uses self-reflection as an in-context adaptation mechanism, allowing the agent to summarize past experiences and adjust its strategy accordingly without updating the model parameters.

LaMer Framework

Experimental Results

Overall Performance

Across all three environments, LaMer trained with Meta-RL consistently outperforms both prompting-based baselines and RL-training methods on the final pass@3 success rate. LaMer trained agents demonstrate remarkable performance gains across attempts. From pass@1 to pass@3, LaMer yields improvements of 13.5% (Sokoban), 30.3% (MineSweeper), and 20.3% (Webshop), significantly larger than RL and prompting-based baselines. Overall, the results indicate that LaMer obtains better performance and stronger test-time scaling compared to prior methods.

Overall Results

Meta-RL Induces Exploration

We quantify exploration by estimating the entropy of the empirical distribution over unique sampled trajectories. We observe that the base model exhibits the highest entropy and standard RL converges to deterministic behaviors. In contrast, LaMer preserves significantly higher diversity than RL baselines, allowing more exploration at test time.

Exploration Entropy

Generalization to Harder Tasks

We evaluate the trained agents with increased difficulty on the tasks of Sokoban and MineSweeper. The consistent gap indicates that LaMer trained with Meta-RL not only performs better on the training distribution, but also generalizes better to the harder tasks.

Harder Tasks

Generalization to Unseen Tasks (ALFWorld)

We further evaluate the out-of-distribution (OOD) generalization performance using the ALFWorld environment. While standard RL performs well on ID tasks (>20% improvement over prompting), it struggles on OOD tasks. In contrast, LaMer consistently outperforms RL on ID and OOD tasks.

Unseen Tasks

BibTeX

@article{jiang2025metarl,
  title={Meta-RL Induces Exploration in Language Agents},
  author={Yulun Jiang and Liangze Jiang and Damien Teney and Michael Moor and Maria Brbic},
  journal={arXiv preprint arXiv:2512.16848},
  year={2025}
}