Density / Diversity Double Bind

Standard RL is Myopic. Max Entropy is Wasteful.

0.0
RETURN
0.0
DIVERSITY
0%
WASTE

Iterative Mixture Learning

Visualizing the Frank-Wolfe Optimization Loop

1. Current Mixture

Aggregate distribution: $\bar{\pi}_{k-1}$

2. Custom Reward

Target the gaps: $r = 1 - d(s)$

3. New Policy

Greedy on Reward: $\mu_k$
Iteration $k=$ 0 | Weight $\lambda_k=$ 1.0

The Core Logic

Convex Optimization meets Reinforcement Learning

The Objective Function:
Maximize Return while minimizing Concentration (Gini Index).
$$ \mathcal{Z}(\bar{\pi}) = \underbrace{\sum_{s\in \mathcal{S}^+} d[\bar{\pi}](s)}_{\text{Return}} - \underbrace{\frac{1}{2} \sum_{s\in \mathcal{S}^+} d[\bar{\pi}](s)^2}_{\text{Concentration}} $$

By subtracting the squared density, we penalize "putting all eggs in one basket." Since this objective is concave, we can use the Frank-Wolfe algorithm to solve it iteratively.

Algorithm 1: DDGC
Input: MDP $\mathcal{M}$
Initialize: Mixture $\bar{\pi}_0$

for k = 1 to K do
  // 1. Estimate current coverage
  $\hat{d}_k(s) \leftarrow$ EstimateDensity($\bar{\pi}_{k-1}$)

  // 2. Compute "Gap" Reward
  $r_k(s) \leftarrow 1 - \hat{d}_k(s)$

  // 3. Solve Offline RL Sub-problem
  $\mu_k \leftarrow$ FittedQIteration($r_k$)

  // 4. Update Mixture (Frank-Wolfe)
  $\bar{\pi}_k \leftarrow (1-\alpha)\bar{\pi}_{k-1} + \alpha\mu_k$
end for
return $\bar{\pi}_K$

Theoretical Guarantees

Full Finite-Sample Convergence Bound

Theorem 3.1: With high probability $1 - (\delta + \delta_d)$, the sub-optimality gap $h(\bar{\pi}_K)$ is bounded by:
$$ \begin{aligned} h(\bar{\pi}_K) \leq \quad & \color{#4285F4}{\mathcal{O}\left(\frac{1}{K}\right)} \\ & + \color{#F4B400}{\mathcal{O}(\gamma^{H}) + \mathcal{O}(\gamma^{N_{\text{FQI}}}) + \mathcal{O}(\epsilon_{\text{approx}})} \\ & + \color{#EA4335}{\mathcal{O}\left(\sqrt{\frac{\log(K\text{dim}_\mathcal{F}/\delta) + \text{dim}_\mathcal{F}\log(HN_T)}{HN_T}}\right)} \\ & + \color{#EA4335}{\mathcal{O}\left(\sqrt{\frac{1}{N_T}\log(K/\delta_d)}\right)} \end{aligned} $$

Optimization Rate

Frank-Wolfe convergence.

Systematic Bias

Compute & Model limits.

Sample Complexity

Finite data variance.

Symbol Definition
$K$ Number of Policy Mixture Iterations
$N_T$ Number of Trajectories (Sample Size)
$\gamma$ Discount Factor (Time Horizon)
$H$ Episode Horizon Length
$N_{\text{FQI}}$ Fitted Q-Iteration Steps
$\epsilon_{\text{approx}}$ Inherent Bellman Approximation Error
$\text{dim}_\mathcal{F}$ Pseudo-dimension of Function Class $\mathcal{F}$
$\delta, \delta_d$ High-probability confidence parameters

Frequently Asked Questions

Common questions regarding settings, baselines, and practical usage.

?
Why not just use Goal-Conditioned RL (GCRL)?
GCRL assumes you can query a specific goal state (e.g., "Go to coordinate X,Y").

Our setting addresses Discovery: The agent doesn't know what the valid goals are yet. It must explore to find them (Oracle) and then learn a policy to cover them.
Analogy: GCRL is "Drive to this specific address." DDGC is "Find all the good restaurants in the city and visit them equally."
?
Is the "Oracle Classifier" a realistic assumption?
Yes. This matches the emerging field of RL on Verifiable Rewards (RLVR).
  • Code Gen: It's easy to check if code runs (Oracle), but hard to find all valid solutions.
  • Chemistry: It's easy to simulate stability (Oracle), but hard to enumerate all stable molecules.
Standard RL collapses to the single easiest solution (Mode Collapse). DDGC discovers the full diversity of valid solutions.
?
Why maximize Gini Index instead of Shannon Entropy?
Two reasons: Stability and Theory.

1. Bounded Rewards: The gradient of Entropy ($\log p$) explodes to $-\infty$ when probability is near zero. The gradient of Gini ($1-p$) is always bounded between 0 and 1. This stabilizes the offline RL step.
2. Convergence: Gini has a bounded curvature constant ($C_\mathcal{Z}$), which is a requirement for our Frank-Wolfe convergence proof (Theorem 3.1).
?
Is training a mixture expensive?
It is comparable to state-of-the-art baselines like SMM.

While we maintain $K$ policies (e.g., $K=8$), we train them sequentially. During inference or training, only one network is active at a time.
Ant Environment (V100 GPU):
SAC: ~6h 45m
DDGC: ~8h 31m (Comparable to SMM)
?
How does the Goal Buffer affect the theory?
It actually improves the theoretical bounds.

The theoretical bound depends on the "Concentrability Coefficient" ($C$)—essentially, how well we cover the state space. By using a Goal Buffer and Exploratory Sampling, we ensure the behavior policy has non-zero mass over goal regions, keeping $C$ finite and minimizing the error term in the Offline RL step.