The Core Logic

Convex Optimization meets Reinforcement Learning

The Objective Function:
Maximize Return while minimizing Concentration (Gini Index).

$$ \mathcal{Z}(\bar{\pi}) = \underbrace{\sum_{s\in \mathcal{S}^+} d[\bar{\pi}](s)}_{\text{Return}} - \underbrace{\frac{1}{2} \sum_{s\in \mathcal{S}^+} d[\bar{\pi}](s)^2}_{\text{Concentration}} $$

By subtracting the squared density, we penalize "putting all eggs in one basket." Since this objective is concave, we can use the Frank-Wolfe algorithm to solve it iteratively.

Algorithm 1: DDGC

Input: MDP $\mathcal{M}$
Initialize: Mixture $\bar{\pi}_0$

for k = 1 to K do
  // 1. Estimate current coverage
  $\hat{d}_k(s) \leftarrow$ EstimateDensity($\bar{\pi}_{k-1}$)

  // 2. Compute "Gap" Reward
  $r_k(s) \leftarrow 1 - \hat{d}_k(s)$

  // 3. Solve Offline RL Sub-problem
  $\mu_k \leftarrow$ FittedQIteration($r_k$)

  // 4. Update Mixture (Frank-Wolfe)
  $\bar{\pi}_k \leftarrow (1-\alpha)\bar{\pi}_{k-1} + \alpha\mu_k$
end for
return $\bar{\pi}_K$

Theoretical Guarantees

Full Finite-Sample Convergence Bound

Theorem 3.1: With high probability $1 - (\delta + \delta_d)$, the sub-optimality gap $h(\bar{\pi}_K)$ is bounded by:

$$ \begin{aligned} h(\bar{\pi}_K) \leq \quad & \color{#4285F4}{\mathcal{O}\left(\frac{1}{K}\right)} \\ & + \color{#F4B400}{\mathcal{O}(\gamma^{H}) + \mathcal{O}(\gamma^{N_{\text{FQI}}}) + \mathcal{O}(\epsilon_{\text{approx}})} \\ & + \color{#EA4335}{\mathcal{O}\left(\sqrt{\frac{\log(K\text{dim}_\mathcal{F}/\delta) + \text{dim}_\mathcal{F}\log(HN_T)}{HN_T}}\right)} \\ & + \color{#EA4335}{\mathcal{O}\left(\sqrt{\frac{1}{N_T}\log(K/\delta_d)}\right)} \end{aligned} $$

Optimization Rate

Frank-Wolfe convergence.

Systematic Bias

Compute & Model limits.

Sample Complexity

Finite data variance.

Symbol	Definition
$K$	Number of Policy Mixture Iterations
$N_T$	Number of Trajectories (Sample Size)
$\gamma$	Discount Factor (Time Horizon)
$H$	Episode Horizon Length
$N_{\text{FQI}}$	Fitted Q-Iteration Steps
$\epsilon_{\text{approx}}$	Inherent Bellman Approximation Error
$\text{dim}_\mathcal{F}$	Pseudo-dimension of Function Class $\mathcal{F}$
$\delta, \delta_d$	High-probability confidence parameters

Frequently Asked Questions

Common questions regarding settings, baselines, and practical usage.

Why not just use Goal-Conditioned RL (GCRL)?

▼

GCRL assumes you can query a specific goal state (e.g., "Go to coordinate X,Y").

Our setting addresses Discovery: The agent doesn't know what the valid goals are yet. It must explore to find them (Oracle) and then learn a policy to cover them.

                            Analogy: GCRL is "Drive to this specific address." DDGC is "Find all the good
                            restaurants in the city and visit them equally."
                        

Is the "Oracle Classifier" a realistic assumption?

▼

Yes. This matches the emerging field of RL on Verifiable Rewards (RLVR).

Code Gen: It's easy to check if code runs (Oracle), but hard to find all valid solutions.
Chemistry: It's easy to simulate stability (Oracle), but hard to enumerate all stable molecules.

Standard RL collapses to the single easiest solution (Mode Collapse). DDGC discovers the full diversity of valid solutions.

Why maximize Gini Index instead of Shannon Entropy?

▼

Two reasons: Stability and Theory.

1. Bounded Rewards: The gradient of Entropy ($\log p$) explodes to $-\infty$ when probability is near zero. The gradient of Gini ($1-p$) is always bounded between 0 and 1. This stabilizes the offline RL step.
2. Convergence: Gini has a bounded curvature constant ($C_\mathcal{Z}$), which is a requirement for our Frank-Wolfe convergence proof (Theorem 3.1).

Is training a mixture expensive?

▼

It is comparable to state-of-the-art baselines like SMM.

While we maintain $K$ policies (e.g., $K=8$), we train them sequentially. During inference or training, only one network is active at a time.

                            Ant Environment (V100 GPU):

                            SAC: ~6h 45m

                            DDGC: ~8h 31m (Comparable to SMM)

How does the Goal Buffer affect the theory?

▼

It actually improves the theoretical bounds.

The theoretical bound depends on the "Concentrability Coefficient" ($C$)—essentially, how well we cover the state space. By using a Goal Buffer and Exploratory Sampling, we ensure the behavior policy has non-zero mass over goal regions, keeping $C$ finite and minimizing the error term in the Offline RL step.

Density / Diversity Double Bind

Standard RL is Myopic. Max Entropy is Wasteful.

Iterative Mixture Learning

Visualizing the Frank-Wolfe Optimization Loop

1. Current Mixture

2. Custom Reward

3. New Policy