2021-04-12-Mon-Study
Apr 12, 2021
»
study
RL
Evaluating the value when you don’t know the MDP
Temproal Difference
Monte Carlo has to wait until the episode is over for an update
You need a return to update, but you won't know the return until the episode is over.
Let's update guesswork
Target: Target = TD Target
Temporal Difference Learning Algorithm
1. Initialize the table, and initialize the value to 0 for each state
2. Leaving the agent in So to gain experience
Two differences from MC --> update formula and timing
Gt = Rt+1 + gamma*V(St+1)
Monte-Carlo vs Temporal Difference
Learning point
How to learn the value function when an arbitrary policy function is given without knowing the MDP
Learning point
MC: Only when the episode is over and a return is decided, we look back and learn
TD: Immediately update table values whenever one step is turned off
Episode MDP: There is an end state, so the agent's experience is divided into episodes.
Non-Episode MDP: MDP in which one episode continues indefinitely without an end state
-> MC is applied only to Episode MDP, but TD is applicable to both.
bias
MC is not biased because the average converges to the actual value as several samples are collected.
TD is biased because it updates in a direction that reduces the difference between the TD Target and the current estimate.
There is no guarantee that it will approach real value.
* Bellmann expectation equation = Vpi(St+1) vs TD = V(St+1) --> the value you want Vpi
Dispersion
MC is updated through return, so variance is large
TD can be updated every sample, so the variance is small
Intermediate of MC and TD
N Step TD
There is no reason to go and evaluate only one step, so go n steps to judge the value
--> A3C paper: In the last step of a specific section while going through the for loop, we put V(s), the evaluation value of the value function,
After that, the gamma is multiplied and the actual compensation value is added.
Coding when MDP is unknown
import random
import numpy as np
class GridWorld():
def __init__(self):
self.x = 0
self.y = 0
def step(self, a):
if a == 0:
self.move_right()
elif a== 1:
self.move_left()
elif a== 2:
self.move_up()
elif a==3:
self.move_down()
reward = -1
done = self.is_done()
return (self.x, self.y), reward, done
def move_right(self):
self.y += 1
if self.y> 3:
self.y = 3
def move_left(self):
self.y -= 1
if self.y <0:
self.y = 0
def move_up(self):
self.x -= 1
if self.x <0:
self.x = 0
def move_down(self):
self.x += 1
if self.x> 3:
self.x = 3
def is_done(self):
if self.x == 3 and self.y == 3:
return True
else:
return False
def get_state(self):
return (self.x, self.y)
def reset(self):
self.x = 0
self.y = 0
return (self.x, self.y)
class Agent():
def __init__(self):
pass
def select_action(self):
coin = random.random()
if coin <0.25:
action = 0
elif coin <0.5:
action = 1
elif coin <0.75:
action = 2
else:
action = 3
return action
MC
def main():
env = GridWorld()
agent = Agent()
data = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
gamma = 1.0
reward = -1
alpha = 0.001
for k in range(50000): #50,000 episodes
done = False
history = []
while not done:
action = agent.select_action()
(x,y), reward, done = env.step(action)
history.append((x,y,reward))
env.reset()
# Update the table using data immediately after each episode
cum_reward = 0
for transition in history[::-1]: # Calculate return in sequence by looking at the visited states from behind
x, y, reward = transition
data[x][y] = data[x][y] + alpha*(cum_reward-data[x][y])
cum_reward = reward + gamma*cum_reward
# Data output
for row in data:
print(row)
if __name__ =='__main__':
main()
TD
def main():
env = GridWorld()
agent = Agent()
data = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
gamma = 1.0
reward = -1
alpha = 0.01 # Larger value compared to MC
for k in range(50000):
done = False
while not done:
x, y = env.get_state()
action = agent.select_action()
(x_prime, y_prime), reward, done = env.step(action)
x_prime, y_prime = env.get_state()
data[x][y] = data[x][y] + alpha*(reward+gamma*data[x_prime][y_prime]-data[x][y]) # Update data immediately after one step
env.reset()
for row in data:
print(row)
if __name__ =='__main__':
main()