2021-03-31-Wen-Study

» study

RL

Bellman Equation:

Bellman Expectation Equation is used when you have a policy given 
    and you want to evaluate a policy
Bellman Optmal Equation is used to Find the best value

Recursive Relationship

The Bellman equation is basically an equation for a recursive relationship.
Recursive functions express themselves using their relationship to themselves.

Bellman Expectation Equation

1

It is the same even if the return proceeds one step first to receive the reward, 
    and then adds the reward to be received in the future from the next state.
    * Expectation(E) is necessary because we don't know which state will be next.

2

EQ1.
S Value = SIGNMA{Probability of running A in S * The value of running A in S}
    It is a method of multiplying the value of the action by the probability 
        of choosing the action and adding them all using sigma.

EQ2.
The value of running A in S = Instant Reward + γ*SIGMA{Probability of arriving 
    at s'by executing a on s * s' value} [s' element = S] 
        [Instant Reward = Reward for selecting an action in each state]
You can add the reward you receive first, then multiply the value of the state 
    to be reached by the probability and add it.

3

Final Formula of Bellman Expectation Equation

The approach to learning when information about MDP is unknown 
    (when the reward function is not known) is called a model-free approach.
Conversely, if you know, this approach is called model-based or planning.

Bellman Optimal Equation

This means that given an MDP, the value calculated by selecting the best policy 
    (that is, the highest value) among all the policies 
        that exist within that MDP is the optimal value.
Easily, The sum of rewards for following the best policy is the greatest 
    than for following other policies.

4

Here comes new operator called Max
In the state s, the optimal value is the same even if the optimal value of the next state, 
    s', is added to the compensation that has been progressed by one step.
The notable thing is that the π is gone.

5

EQ1.
The optimal value of state s is the same as the value of the action 
    with the highest value among the actions that can be selected in s.
Beacuse it follows optimal policy, don't care about probability for Q(s,a)

EQ2.
Same as Bellman Expectation Equation EQ2, but there is star on π

6

Final Formula of Bellman Optimal Equation

C

scanf
    int scanf(const char *format-string, argument-list);
    Reads and stores data from the standard input stream to the position 
        of each item specified in the parameter list.
sscanf
    int sscanf(const char *buffer, const char *format, argument-list);
    In this input string, the sscanf() function reads the data 
        according to the format string on its own and stores it in the arguments.
fscanf
    int fscanf (FILE *stream, const char *format-string, argument-list);
    Read data from the file stream according to the format string.

pow function
    double pow( double x, double y)
    --> x^ y

printf
    %f for double and float
        %f -> print decimal point
        %.f -> make result min
        %.0f -> print precise result
    
fprintf
    writes formatted text to the output stream you specify
sprintf
    writes formatted text to an array of char, as opposed to a stream
    In the sprintf function, put the array to store the string, 
        the format to create the string, and the value (string) to create the string in order.
strcat
    char* strcat(char* dest, const char* origin);
        This is a function that concatenates the string in origin to the end of dest.
        \0 is gone which is at the end of dest
    char* strncat(char* dest, const char* origin, size_t n);
    strncat -> take as many as n
strcpy
    char* strcpy(char* dest, const char* origin);
    copy origin to dest

strcmp