Learning World Value Functions(WVFs) without Exploration

This work demonstrates that World Value Functions (WVFs) can be learned purely from static datasets using offline RL, eliminating the need for environment interaction.

What are World Value Functions?

WVFs extend standard value functions to the multi-goal setting. Instead of learning $Q(s, a)$ for a single task, WVFs learn $\bar{Q}(s, g, a)$ —the value of taking action $a$ in state $s$ to reach goal $g$ .

This enables an agent to solve any goal-reaching task in its environment by reusing learned knowledge.

The Offline RL Problem

Standard RL optimizes:

\pi^* = \arg\max_\pi \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]

Offline RL constrains this to a fixed dataset $\mathcal{D}$ :

\pi^* = \arg\max_\pi \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

No new environment interactions allowed.

Algorithms Tested

Discrete domain (Boxman): Offline DQN and discrete BCQ
Continuous domain (Panda-Gym): Continuous BCQ

The Q-learning update was modified for WVFs—iterating over all known goals $g' \in G$ for each transition:

\delta = \left[\bar{r} + \max_{a'} \bar{Q}(s', g', a'; \theta')\right] - \bar{Q}(s, g', a; \theta)

Where $\bar{r} = R_{\text{MIN}}$ if $g' \neq s'$ and $s'$ is terminal, else $r$ .

Core insight: Large, diverse datasets are critical for learning WVFs offline. Performance improves substantially with more data across both domains.

Why It Matters

Offline learning of WVFs enables:

Sample-efficient multi-goal learning from logged data
Deployment in domains where online exploration is costly/dangerous
Knowledge reuse via logical composition of learned goals