Learning World Value Functions(WVFs) without Exploration

GitHub
PDF
2 mins read·completed
Learning World Value Functions(WVFs) without Exploration

This work demonstrates that World Value Functions (WVFs) can be learned purely from static datasets using offline RL, eliminating the need for environment interaction.

What are World Value Functions?

WVFs extend standard value functions to the multi-goal setting. Instead of learning Q(s,a)Q(s, a) for a single task, WVFs learn Qˉ(s,g,a)\bar{Q}(s, g, a)—the value of taking action aa in state ss to reach goal gg.

This enables an agent to solve any goal-reaching task in its environment by reusing learned knowledge.

The Offline RL Problem

Standard RL optimizes:

π=argmaxπE[t=0γtR(st,at)]\pi^* = \arg\max_\pi \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]

Offline RL constrains this to a fixed dataset D\mathcal{D}:

π=argmaxπE(s,a,r,s)D[t=0γtrt]\pi^* = \arg\max_\pi \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

No new environment interactions allowed.

Algorithms Tested

Discrete domain (Boxman): Offline DQN and discrete BCQ
Continuous domain (Panda-Gym): Continuous BCQ

The Q-learning update was modified for WVFs—iterating over all known goals gGg' \in G for each transition:

δ=[rˉ+maxaQˉ(s,g,a;θ)]Qˉ(s,g,a;θ)\delta = \left[\bar{r} + \max_{a'} \bar{Q}(s', g', a'; \theta')\right] - \bar{Q}(s, g', a; \theta)

Where rˉ=RMIN\bar{r} = R_{\text{MIN}} if gsg' \neq s' and ss' is terminal, else rr.

Core insight: Large, diverse datasets are critical for learning WVFs offline. Performance improves substantially with more data across both domains.

Why It Matters

Offline learning of WVFs enables: