RL

Deep RL

How to solve MDP with a large number of states
Deep RL combining deep learning and reinforcement learning

Approximation using functions

1. Continuous state space
    Infinite possible states
        --> Impossible to create table
2. Discrete state space
    The possible state just falls off

Table-based approach is not possible because it takes too much time

Function that receives the state value as input and returns the value -> Actual value function ~= Approximation function
    Since it is an approximation function, the least squares method is used to make it as close as possible.

least squares method

Methodology for finding a and b that minimizes the picture in the picture above
If expressed irrespective of the number of data, the sum of squares of each error is averaged
    It minimizes the mean square error (MSE).

function

Functions are used to record state and value pairs.
What was previously stored in a table is now stored in a function,
    Fit the curve of the function so that it passes close to the data.
To do well fitting, modify the parameters of the function
    As a method of obtaining the value of the parameter, the least squares method that minimizes the MSE is used.

polynomial function

Fitting the function
    1. Write data to the function.
    2. Draw a function to pass the data points closest.
    3. Find the value of the parameter of the function f
    4. Learn learning f.

The higher the order, the less favorable it is to record data.
    -> Because there is noise.
    This is because I have learned even the noise too accurately.

Using a function that is too flexible, f fitting to noise
    Learning Too Accurately = Overfitting

The situation where the error with the given data is large due to the lack of flexibility of the function f to contain the actual model.
     = Underfitting
    -> There are not many free parameters.

The best function is somewhere in between, neither underfitting nor overfitting.

Advantages of functions: generalization

The existing table method approach is that it cannot contain all the values ​​of all states.
Functions require less storage capacity due to their generalization nature.

Table-based methodology does not know what value it will have when visiting a new state,
    The function can be generalized to predict the outcome.
        + If you use a function with an appropriate pre parameter
            There is no shortage of storage space.

artificial neural network

neural network

Very flexible function
    Flexibility of the function can be expressed through the number of pre parameters.

The neural network is composed of a hidden layer, and the hidden layer is composed of nodes.
    Node applies nonlinear actiavtion after linear combination of values ​​coming into the corresponding node.

It linearly combines n inputs and passes the nonlinear function.
    Nonlinear function = RELU (Rectified Linear Unit)
    Linear combination = the process of creating a new feature

Linear combination: making the feature necessary to evaluate the current state
Nonlinear function: A function required when the relationship between input and output is nonlinear

Training of neural networks

Gradient Descent
The process of minimizing the objective function by calculating the gradient and updating the parameters
    Loss Function: (actual value-function value)^2
        Input value should be adjusted to reduce the loss function
        input = x
        output = f(x)
        Alpha = Learning Rate or Step Size

Differentiating f(x) by x, the effect of x on f(x)

w1 and w2 are random values
w1 = 0.5, w2 = 1.2
fitting data = (1,2,1)
    --> fw(1,2) = 1
        But the current value is -0.9
        Alpha = 0.01
Closes to 1 after update.
    If you do a lot of data and updates, you get close to the right answer.

Learning implementation of neural network using Pytorch

Automatic differentiation library
    The gradient of a very complex function is found very efficiently through the Back Propagation Algorithm.
    Efficiently calculate gradient by caching and reusing intermediate derivatives

Example

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(1, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 128)
        self.fc4 = nn.Linear(128, 1, bias=False)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

def true_fun(X):
    noise = np.random.rand(X.shape[0]) * 0.4-0.2
    return np.cos(1.5 * np.pi * X) + X + noise

def plot_results(model):
    x = np.linspace(0, 5, 100)
    input_x = torch.from_numpy(x).float().unsqueeze(1)
    plt.plot(x, true_fun(x), label="Truth")
    plt.plot(x, model(input_x).detach().numpy(), label="Prediction")
    plt.legend(loc='lower right',fontsize=15)
    plt.xlim((0, 5))
    plt.ylim((-1, 5))
    plt.grid()

def main():
    data_x = np.random.rand(10000) * 5 # Sample 10,000 numbers between 0-5 and use as input
    model = Model()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for step in range(10000):
        batch_x = np.random.choice(data_x, 32) # Compose a mini-batch with 32 randomly picked data
        batch_x_tensor = torch.from_numpy(batch_x).float().unsqueeze(1)
        pred = model(batch_x_tensor)

        batch_y = true_fun(batch_x)
        truth = torch.from_numpy(batch_y).float().unsqueeze(1)
        loss = F.mse_loss(pred, truth) # The part that calculates the loss function MSE
        
        optimizer.zero_grad()
        loss.mean().backward() # Part where gradient calculation occurs through backpropagation
        optimizer.step() # The part that actually updates the parameters

    plot_results(model)

if __name__ =='__main__':
    main()

Agent

Constraint X
Model Free situation
Because the state space and action space are very large, the table method cannot be applied.

*Neural Net solved

A method of expressing a value function or action value function in a neural net
The method of expressing the policy function itself as a neural net

Value-Based Agent

Choosing an action based on a value function
    Selecting an action by looking at the action-avlue function
ex) SARSA, Q Learning

Policy-based agent

Directly select the action by looking at the policy function, do not report the value

Actor-Critic

Use both value and policy functions
actor: policy function
critic: value function