2021-04-14-Wen-Study
Apr 14, 2021
»
study
RL
Deep RL
How to solve MDP with a large number of states
Deep RL combining deep learning and reinforcement learning
Approximation using functions
1. Continuous state space
Infinite possible states
--> Impossible to create table
2. Discrete state space
The possible state just falls off
Table-based approach is not possible because it takes too much time
Function that receives the state value as input and returns the value -> Actual value function ~= Approximation function
Since it is an approximation function, the least squares method is used to make it as close as possible.
least squares method
Methodology for finding a and b that minimizes the picture in the picture above
If expressed irrespective of the number of data, the sum of squares of each error is averaged
It minimizes the mean square error (MSE).
function
Functions are used to record state and value pairs.
What was previously stored in a table is now stored in a function,
Fit the curve of the function so that it passes close to the data.
To do well fitting, modify the parameters of the function
As a method of obtaining the value of the parameter, the least squares method that minimizes the MSE is used.
polynomial function
Fitting the function
1. Write data to the function.
2. Draw a function to pass the data points closest.
3. Find the value of the parameter of the function f
4. Learn learning f.
The higher the order, the less favorable it is to record data.
-> Because there is noise.
This is because I have learned even the noise too accurately.
Using a function that is too flexible, f fitting to noise
Learning Too Accurately = Overfitting
The situation where the error with the given data is large due to the lack of flexibility of the function f to contain the actual model.
= Underfitting
-> There are not many free parameters.
The best function is somewhere in between, neither underfitting nor overfitting.
Advantages of functions: generalization
The existing table method approach is that it cannot contain all the values of all states.
Functions require less storage capacity due to their generalization nature.
Table-based methodology does not know what value it will have when visiting a new state,
The function can be generalized to predict the outcome.
+ If you use a function with an appropriate pre parameter
There is no shortage of storage space.
artificial neural network
neural network
Very flexible function
Flexibility of the function can be expressed through the number of pre parameters.
The neural network is composed of a hidden layer, and the hidden layer is composed of nodes.
Node applies nonlinear actiavtion after linear combination of values coming into the corresponding node.
It linearly combines n inputs and passes the nonlinear function.
Nonlinear function = RELU (Rectified Linear Unit)
Linear combination = the process of creating a new feature
Linear combination: making the feature necessary to evaluate the current state
Nonlinear function: A function required when the relationship between input and output is nonlinear
Training of neural networks
Gradient Descent
The process of minimizing the objective function by calculating the gradient and updating the parameters
Loss Function: (actual value-function value)^2
Input value should be adjusted to reduce the loss function
input = x
output = f(x)
Alpha = Learning Rate or Step Size
Differentiating f(x) by x, the effect of x on f(x)
w1 and w2 are random values
w1 = 0.5, w2 = 1.2
fitting data = (1,2,1)
--> fw(1,2) = 1
But the current value is -0.9
Alpha = 0.01
Closes to 1 after update.
If you do a lot of data and updates, you get close to the right answer.
Learning implementation of neural network using Pytorch
Automatic differentiation library
The gradient of a very complex function is found very efficiently through the Back Propagation Algorithm.
Efficiently calculate gradient by caching and reusing intermediate derivatives
Example
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(1, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, 128)
self.fc4 = nn.Linear(128, 1, bias=False)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.fc4(x)
return x
def true_fun(X):
noise = np.random.rand(X.shape[0]) * 0.4-0.2
return np.cos(1.5 * np.pi * X) + X + noise
def plot_results(model):
x = np.linspace(0, 5, 100)
input_x = torch.from_numpy(x).float().unsqueeze(1)
plt.plot(x, true_fun(x), label="Truth")
plt.plot(x, model(input_x).detach().numpy(), label="Prediction")
plt.legend(loc='lower right',fontsize=15)
plt.xlim((0, 5))
plt.ylim((-1, 5))
plt.grid()
def main():
data_x = np.random.rand(10000) * 5 # Sample 10,000 numbers between 0-5 and use as input
model = Model()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for step in range(10000):
batch_x = np.random.choice(data_x, 32) # Compose a mini-batch with 32 randomly picked data
batch_x_tensor = torch.from_numpy(batch_x).float().unsqueeze(1)
pred = model(batch_x_tensor)
batch_y = true_fun(batch_x)
truth = torch.from_numpy(batch_y).float().unsqueeze(1)
loss = F.mse_loss(pred, truth) # The part that calculates the loss function MSE
optimizer.zero_grad()
loss.mean().backward() # Part where gradient calculation occurs through backpropagation
optimizer.step() # The part that actually updates the parameters
plot_results(model)
if __name__ =='__main__':
main()
Agent
Constraint X
Model Free situation
Because the state space and action space are very large, the table method cannot be applied.
*Neural Net solved
A method of expressing a value function or action value function in a neural net
The method of expressing the policy function itself as a neural net
Value-Based Agent
Choosing an action based on a value function
Selecting an action by looking at the action-avlue function
ex) SARSA, Q Learning
Policy-based agent
Directly select the action by looking at the policy function, do not report the value
Actor-Critic
Use both value and policy functions
actor: policy function
critic: value function