2021-04-12-Mon-Study

» study

RL

Evaluating the value when you don’t know the MDP

Temproal Difference

Monte Carlo has to wait until the episode is over for an update
    You need a return to update, but you won't know the return until the episode is over.

Let's update guesswork
Target: Target = TD Target

exp

Temporal Difference Learning Algorithm

1. Initialize the table, and initialize the value to 0 for each state
2. Leaving the agent in So to gain experience
Two differences from MC --> update formula and timing
Gt = Rt+1 + gamma*V(St+1)

Monte-Carlo vs Temporal Difference

Learning point

How to learn the value function when an arbitrary policy function is given without knowing the MDP
Learning point
    MC: Only when the episode is over and a return is decided, we look back and learn
    TD: Immediately update table values ​​whenever one step is turned off

    Episode MDP: There is an end state, so the agent's experience is divided into episodes.
    Non-Episode MDP: MDP in which one episode continues indefinitely without an end state

    -> MC is applied only to Episode MDP, but TD is applicable to both.

bias

MC is not biased because the average converges to the actual value as several samples are collected.
TD is biased because it updates in a direction that reduces the difference between the TD Target and the current estimate.
    There is no guarantee that it will approach real value.
    * Bellmann expectation equation = Vpi(St+1) vs TD = V(St+1) --> the value you want Vpi

Dispersion

MC is updated through return, so variance is large
TD can be updated every sample, so the variance is small

Intermediate of MC and TD

N Step TD

There is no reason to go and evaluate only one step, so go n steps to judge the value
--> A3C paper: In the last step of a specific section while going through the for loop, we put V(s), the evaluation value of the value function,
    After that, the gamma is multiplied and the actual compensation value is added.

Coding when MDP is unknown

import random
import numpy as np

class GridWorld():
    def __init__(self):
        self.x = 0
        self.y = 0
    
    def step(self, a):
        if a == 0:
            self.move_right()
        elif a== 1:
            self.move_left()
        elif a== 2:
            self.move_up()
        elif a==3:
            self.move_down()
        
        reward = -1
        done = self.is_done()
        return (self.x, self.y), reward, done
        def move_right(self):
        self.y += 1
        if self.y> 3:
            self.y = 3
      
    def move_left(self):
        self.y -= 1
        if self.y <0:
            self.y = 0
      
    def move_up(self):
        self.x -= 1
        if self.x <0:
            self.x = 0
  
    def move_down(self):
        self.x += 1
        if self.x> 3:
            self.x = 3

    def is_done(self):
        if self.x == 3 and self.y == 3:
            return True
        else:
            return False

    def get_state(self):
        return (self.x, self.y)
      
    def reset(self):
        self.x = 0
        self.y = 0
        return (self.x, self.y)

class Agent():
    def __init__(self):
        pass

    def select_action(self):
        coin = random.random()
        if coin <0.25:
            action = 0
        elif coin <0.5:
            action = 1
        elif coin <0.75:
            action = 2
        else:
            action = 3
        return action

MC

def main():
    env = GridWorld()
    agent = Agent()
    data = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
    gamma = 1.0
    reward = -1
    alpha = 0.001

    for k in range(50000): #50,000 episodes
        done = False
        history = []

        while not done:
            action = agent.select_action()
            (x,y), reward, done = env.step(action)
            history.append((x,y,reward))
        env.reset()
        # Update the table using data immediately after each episode
        cum_reward = 0
        for transition in history[::-1]: # Calculate return in sequence by looking at the visited states from behind
            x, y, reward = transition
            data[x][y] = data[x][y] + alpha*(cum_reward-data[x][y])
            cum_reward = reward + gamma*cum_reward
        # Data output
    for row in data:
        print(row)

if __name__ =='__main__':
    main()

TD

def main():
    env = GridWorld()
    agent = Agent()
    data = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
    gamma = 1.0
    reward = -1
    alpha = 0.01 # Larger value compared to MC

    for k in range(50000):
        done = False
        while not done:
            x, y = env.get_state()
            action = agent.select_action()
            (x_prime, y_prime), reward, done = env.step(action)
            x_prime, y_prime = env.get_state()
            data[x][y] = data[x][y] + alpha*(reward+gamma*data[x_prime][y_prime]-data[x][y]) # Update data immediately after one step
        env.reset()
            
    for row in data:
        print(row)

if __name__ =='__main__':
    main()