RL

Policy Gradient Alogirhtm

Directly model and optimize policies
    A function consisting of specific parameters,

On-Policy Policy Gradient

Training samples should be collected from a policy that is very similar to the target policy that you want to optimize and obtain.

Off-Policy Policy Gradient

Since full trajectory is not required, trajectory is extracted from the previous episode in terms of sample efficiency and reused.
The behavior policy used to collect samples is "known policy", which is called B(a|s).
    Objective Function is the sum of all the rewards calculated in the state distrubution defined by the behavior policy.

Value-based agent

Learning value-based agents using Neural Net
    1. When the policy is fixed, learning the value function Vpi(s) of the policy using Neural Net

* To learn a neural net, you need to define a loss function that means the difference between the prediction and the correct answer.