2021-04-15-Thu-Study
Apr 15, 2021
»
study
RL
Policy Gradient Alogirhtm
Directly model and optimize policies
A function consisting of specific parameters,
On-Policy Policy Gradient
Training samples should be collected from a policy that is very similar to the target policy that you want to optimize and obtain.
Off-Policy Policy Gradient
Since full trajectory is not required, trajectory is extracted from the previous episode in terms of sample efficiency and reused.
The behavior policy used to collect samples is "known policy", which is called B(a|s).
Objective Function is the sum of all the rewards calculated in the state distrubution defined by the behavior policy.
Value-based agent
Learning value-based agents using Neural Net
1. When the policy is fixed, learning the value function Vpi(s) of the policy using Neural Net
* To learn a neural net, you need to define a loss function that means the difference between the prediction and the correct answer.