2021-04-16-Fri-Study

» study

RL

DQN

Stabilizes learning about Q-functions through experience replay and frozen target networks.
    It operates in a discrete space.

DPG

Policy is modeled to make deterministic decisions.
    When a policy requires a single action, it computes the gradient of the policy function.

DDPG

It is a model-free off-policy actor-critic algorithm that combines DPG with DQN.
    Although DQN originally operates in a discrete space, DDPG utilizes an actor-critic framework.
        Using deterministic policy, the effect has been extended to a continuous space.

Noise can be added for better exploration.

Soft update is applied to actor network and critical network respectively.

DP4G

Several improvements are applied to reflect the distrubutional concept in the DDPG.
1.Distrubtional Critic
2. N-step returns
3. Multiple Distributed Parallel Actors
4. Prioritized Experienced Replay

MADDPG

So that multiple agents can process tasks with only local information
    Expanding the DDPG into an environment of cooperation with each other.

Reinforce Algirutgm

Get data for one episode as a policy,
    Update the parameters with that data,
        Use the updated policy to get the experience of the next episode and
            Also, iterative work to reinforce with that data.

*Proceed until parameters do not change or performance improvement does not occur.

Actor-Critic

Q Actor Critic

An actor-critic methodology that learns policy network and value network together
    critic is updated in the direction of learning the value of the current policy function
    The actor sees the result through the evaluation of Q and strengthens or weakens it.

Advantage Actor-Critic

Compared to the Q Actor Critic, the Actor-Critic has an advantage over the variability of the gradient estimate.
    By reducing it, it helps learning efficiency.

However, it requires 3 pairs of Neural Nets.

TD Actor Critic

Needs one pair less than the Advantage Actor Critic (Q)