This work demonstrates that World Value Functions (WVFs) can be learned purely from static datasets using offline RL, eliminating the need for environment interaction.
What are World Value Functions?
WVFs extend standard value functions to the multi-goal setting. Instead of learning for a single task, WVFs learn —the value of taking action in state to reach goal .
This enables an agent to solve any goal-reaching task in its environment by reusing learned knowledge.
The Offline RL Problem
Standard RL optimizes:
Offline RL constrains this to a fixed dataset :
No new environment interactions allowed.
Algorithms Tested
Discrete domain (Boxman): Offline DQN and discrete BCQ
Continuous domain (Panda-Gym): Continuous BCQ
The Q-learning update was modified for WVFs—iterating over all known goals for each transition:
Where if and is terminal, else .
Core insight: Large, diverse datasets are critical for learning WVFs offline. Performance improves substantially with more data across both domains.
Why It Matters
Offline learning of WVFs enables:
- Sample-efficient multi-goal learning from logged data
- Deployment in domains where online exploration is costly/dangerous
- Knowledge reuse via logical composition of learned goals
