By subtracting the squared density, we penalize "putting all eggs in one basket." Since this objective is concave, we can use the Frank-Wolfe algorithm to solve it iteratively.
Frank-Wolfe convergence.
Compute & Model limits.
Finite data variance.
| Symbol | Definition |
|---|---|
| $K$ | Number of Policy Mixture Iterations |
| $N_T$ | Number of Trajectories (Sample Size) |
| $\gamma$ | Discount Factor (Time Horizon) |
| $H$ | Episode Horizon Length |
| $N_{\text{FQI}}$ | Fitted Q-Iteration Steps |
| $\epsilon_{\text{approx}}$ | Inherent Bellman Approximation Error |
| $\text{dim}_\mathcal{F}$ | Pseudo-dimension of Function Class $\mathcal{F}$ |
| $\delta, \delta_d$ | High-probability confidence parameters |