Skip to content

Why use V(s) instead of Q(s, a) in computing TD error in Actor-Critic agent? #30

Discussion options

You must be logged in to vote

Good observation!
That's because, in the Actor-Critic agent recipe, the policy gradient is computed using the (one-step) TD-error and for this, the Critic network only needs to learn the state-value function V(s), which is a scalar (length=1). (TD Actor-Critic)

There's an equivalent policy gradient form called Q Actor-Critic that uses the action-value function Q(s, a), but then the Critic network needs to learn the action-value function Q(s, a), which would be a vector (length=action_dim). Due to the higher-dimensional nature of Q(s, a), it is a harder function to be approximated/learned by a neural network and practically it's often unstable. (Q Actor-Critic)

There's yet another equivale…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@QuantHao
Comment options

Answer selected by QuantHao
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #29 on May 24, 2021 06:48.