You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1- Page 93-94: It is not clear what the difference between the policy in Figure 5.6 (a) ("a given $\epsilon$-greedy policy for $\epsilon =0$") and the policy in Figure 5.67(a) ("the optimal $\epsilon$-greedy policy for $\epsilon =0$") is. How are these policies obtained?
My guess is that to obtain the policy corresponding to $\epsilon=0$ in Figure 5.6 (a) and other $\epsilon$-greedy policies in Figure 5.6, , first Belman Optimality Equation is solved to obtain optimal action values, and then from
these action values, $\epsilon$- greedy policies are obtained for different values of $\epsilon$. Otherwise, if you use Algorithm 5.3 with $\epsilon$ values, there is risk of local optima and we may not be able to obtain the results in Figure 5.6, especially the optimal result in Figure 5.6 (a).
2- In Figure 5.6, it is said that "These $\epsilon$-greedy policies are consistent with each other in the sense that the actions with the greatest probabilities are the same". No explanation is given why we need this condition.
3- We obtain the results in Figure 5.7 using the Algorithm 5.3, right? However, based on the results of Figure 5.7, the book gives the impression that optimal policies will be obtained if we have $\epsilon=0$ in the $\epsilon$-greedy algorithm in Algorithm 5.3, which may not be correct (especially with a bad initial policy) because we may get stuck at a local optima.
The discussions are ambiguous and need clarity in my opinion. Especially, it is not clear how the results in Figure 5.6 and Figure 5.7 are obtained.
4-What does the left blue arrow in (1,1) cell of Figure 5.7 denote? Why do we have it here?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The following points are unclear for Chapter 5:
1- Page 93-94: It is not clear what the difference between the policy in Figure 5.6 (a) ("a given$\epsilon$ -greedy policy for $\epsilon =0$ ") and the policy in Figure 5.67(a) ("the optimal $\epsilon$ -greedy policy for $\epsilon =0$ ") is. How are these policies obtained?$\epsilon=0$ in Figure 5.6 (a) and other $\epsilon$ -greedy policies in Figure 5.6, , first Belman Optimality Equation is solved to obtain optimal action values, and then from$\epsilon$ - greedy policies are obtained for different values of $\epsilon$ . Otherwise, if you use Algorithm 5.3 with $\epsilon$ values, there is risk of local optima and we may not be able to obtain the results in Figure 5.6, especially the optimal result in Figure 5.6 (a).
My guess is that to obtain the policy corresponding to
these action values,
2- In Figure 5.6, it is said that "These$\epsilon$ -greedy policies are consistent with each other in the sense that the actions with the greatest probabilities are the same". No explanation is given why we need this condition.
3- We obtain the results in Figure 5.7 using the Algorithm 5.3, right? However, based on the results of Figure 5.7, the book gives the impression that optimal policies will be obtained if we have$\epsilon=0$ in the $\epsilon$ -greedy algorithm in Algorithm 5.3, which may not be correct (especially with a bad initial policy) because we may get stuck at a local optima.
The discussions are ambiguous and need clarity in my opinion. Especially, it is not clear how the results in Figure 5.6 and Figure 5.7 are obtained.
4-What does the left blue arrow in (1,1) cell of Figure 5.7 denote? Why do we have it here?
Beta Was this translation helpful? Give feedback.
All reactions