combining three breakthrough:
- Truncated importance sampling with bias correction
- stochastic dueling network architectures
- a new trust region policy optimization method
Utilize recent developments in
- deep neural networks,
- variance reduction techniques,
- the off-policy Retrace algorithm (Munos et al., 2016) Retrace
- parallel training of RL agents (Mnih et al., 2016) A3C
Theoretical result: Retrace operator can be rewritten from our proposed truncated importance sampling with bias correction technique.