Machine Learning (Chapter 4): Reinforcement Learning

By Ritesh Sahu June 05, 2024

Machine Learning (Chapter 4): Reinforcement Learning

Introduction

Reinforcement Learning (RL) is a subfield of Machine Learning where an agent learns to make decisions by performing certain actions and observing the rewards or penalties resulting from those actions. Unlike supervised learning, where the model learns from a labeled dataset, RL focuses on learning from interaction with an environment. This interaction is modeled as a Markov Decision Process (MDP), where the agent's goal is to maximize cumulative rewards over time.

Key Concepts in Reinforcement Learning

Agent: The learner or decision-maker.
Environment: The world through which the agent moves and interacts.
State (S): A representation of the current situation of the agent in the environment.
Action (A): The choices the agent can make in each state.
Reward (R): The feedback from the environment after the agent takes an action.
Policy (π): A strategy used by the agent to determine its actions based on the current state.
Value Function (V(s)): The expected cumulative reward from a given state.
Q-Value or Action-Value (Q(s, a)): The expected cumulative reward from taking action 'a' in state 's' and then following a policy.

Markov Decision Process (MDP)

An MDP is defined by a tuple $(S, A, P, R, \gamma)$ , where:

$S$ is a finite set of states.
$A$ is a finite set of actions.
$P(s'|s, a)$ is the probability of transitioning to state $s'$ after taking action $a$ in state $s$ .
$R(s, a)$ is the immediate reward received after taking action $a$ in state $s$ .
$\gamma \in [0, 1]$ is the discount factor, representing the difference in importance between future and present rewards.

The goal is to find a policy $\pi$ that maximizes the expected return $G_t$ from each state $s$ , defined as:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Value Functions

State-Value Function $V(s)$

The state-value function $V(s)$ under a policy $\pi$ is the expected return starting from state $s$ and then following policy $\pi$ :

V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s] = \mathbb{E}_\pi \left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \Bigg| S_t = s\right]

Action-Value Function $Q(s, a)$

The action-value function $Q(s, a)$ is the expected return starting from state $s$ , taking action $a$ , and then following policy $\pi$ :

Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

Bellman Equations

The Bellman equation provides a recursive decomposition of the value functions.

Bellman Equation for $V(s)$

V^\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s' \in S} P(s'|s, a) \left[R(s, a, s') + \gamma V^\pi(s')\right]

Bellman Equation for $Q(s, a)$

Q^\pi(s, a) = \sum_{s' \in S} P(s'|s, a) \left[R(s, a, s') + \gamma \sum_{a' \in A} \pi(a'|s') Q^\pi(s', a')\right]

Q-Learning: An Off-Policy Approach

Q-Learning is a popular off-policy reinforcement learning algorithm where the goal is to learn the optimal action-value function $Q^*(s, a)$ directly. The update rule for Q-Learning is:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]

Where:

$\alpha$ is the learning rate.
$R_{t+1}$ is the reward received after taking action $a_t$ in state $s_t$ .
$s_{t+1}$ is the new state after the action $a_t$ is taken.

Python Implementation of Q-Learning

Below is a simple Python implementation of the Q-Learning algorithm using a toy environment.

python:
import numpy as np
import gym

# Initialize environment
env = gym.make("FrozenLake-v1", is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n

# Initialize Q-table with zeros
Q = np.zeros((n_states, n_actions))

# Set learning parameters
alpha = 0.8  # Learning rate
gamma = 0.95  # Discount factor
epsilon = 0.1  # Exploration rate
n_episodes = 1000

# Q-learning algorithm
for episode in range(n_episodes):
    state = env.reset()
    done = False

    while not done:
        # Choose action (epsilon-greedy policy)
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])

        # Take action and observe reward and next state
        next_state, reward, done, _ = env.step(action)

        # Update Q-value
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state, :]) - Q[state, action]
        )

        # Transition to next state
        state = next_state

# Print the learned Q-table
print("Learned Q-table:")
print(Q)

# Demonstrate the learned policy
state = env.reset()
done = False
while not done:
    action = np.argmax(Q[state, :])
    state, _, done, _ = env.step(action)
    env.render()

Conclusion

Reinforcement Learning, particularly through algorithms like Q-Learning, provides a powerful framework for training agents to make optimal decisions through trial and error. By balancing exploration and exploitation, an RL agent can learn effective strategies for interacting with its environment, even in complex and uncertain settings. The mathematical foundation, coupled with practical implementation, highlights the robustness and versatility of RL in solving real-world problems.

Search This Blog

Machine learning and artificial intelligence