Machine Learning (Chapter 4): Reinforcement Learning

 


Machine Learning (Chapter 4): Reinforcement Learning

Introduction

Reinforcement Learning (RL) is a subfield of Machine Learning where an agent learns to make decisions by performing certain actions and observing the rewards or penalties resulting from those actions. Unlike supervised learning, where the model learns from a labeled dataset, RL focuses on learning from interaction with an environment. This interaction is modeled as a Markov Decision Process (MDP), where the agent's goal is to maximize cumulative rewards over time.

Key Concepts in Reinforcement Learning

  • Agent: The learner or decision-maker.
  • Environment: The world through which the agent moves and interacts.
  • State (S): A representation of the current situation of the agent in the environment.
  • Action (A): The choices the agent can make in each state.
  • Reward (R): The feedback from the environment after the agent takes an action.
  • Policy (π): A strategy used by the agent to determine its actions based on the current state.
  • Value Function (V(s)): The expected cumulative reward from a given state.
  • Q-Value or Action-Value (Q(s, a)): The expected cumulative reward from taking action 'a' in state 's' and then following a policy.

Markov Decision Process (MDP)

An MDP is defined by a tuple (S,A,P,R,γ)(S, A, P, R, \gamma), where:

  • SS is a finite set of states.
  • AA is a finite set of actions.
  • P(ss,a)P(s'|s, a) is the probability of transitioning to state ss' after taking action aa in state ss.
  • R(s,a)R(s, a) is the immediate reward received after taking action aa in state ss.
  • γ[0,1]\gamma \in [0, 1] is the discount factor, representing the difference in importance between future and present rewards.

The goal is to find a policy π\pi that maximizes the expected return GtG_t from each state ss, defined as:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Value Functions

State-Value Function V(s)V(s)

The state-value function V(s)V(s) under a policy π\pi is the expected return starting from state ss and then following policy π\pi:

Vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s] = \mathbb{E}_\pi \left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \Bigg| S_t = s\right]

Action-Value Function Q(s,a)Q(s, a)

The action-value function Q(s,a)Q(s, a) is the expected return starting from state ss, taking action aa, and then following policy π\pi:

Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

Bellman Equations

The Bellman equation provides a recursive decomposition of the value functions.

Bellman Equation for V(s)V(s)

Vπ(s)=aAπ(as)sSP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s' \in S} P(s'|s, a) \left[R(s, a, s') + \gamma V^\pi(s')\right]

Bellman Equation for Q(s,a)Q(s, a)

Qπ(s,a)=sSP(ss,a)[R(s,a,s)+γaAπ(as)Qπ(s,a)]Q^\pi(s, a) = \sum_{s' \in S} P(s'|s, a) \left[R(s, a, s') + \gamma \sum_{a' \in A} \pi(a'|s') Q^\pi(s', a')\right]

Q-Learning: An Off-Policy Approach

Q-Learning is a popular off-policy reinforcement learning algorithm where the goal is to learn the optimal action-value function Q(s,a)Q^*(s, a) directly. The update rule for Q-Learning is:

Q(st,at)Q(st,at)+α[Rt+1+γmaxaQ(st+1,a)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]

Where:

  • α\alpha is the learning rate.
  • Rt+1R_{t+1} is the reward received after taking action ata_t in state sts_t.
  • st+1s_{t+1} is the new state after the action ata_t is taken.

Python Implementation of Q-Learning

Below is a simple Python implementation of the Q-Learning algorithm using a toy environment.

python:
import numpy as np import gym # Initialize environment env = gym.make("FrozenLake-v1", is_slippery=False) n_states = env.observation_space.n n_actions = env.action_space.n # Initialize Q-table with zeros Q = np.zeros((n_states, n_actions)) # Set learning parameters alpha = 0.8 # Learning rate gamma = 0.95 # Discount factor epsilon = 0.1 # Exploration rate n_episodes = 1000 # Q-learning algorithm for episode in range(n_episodes): state = env.reset() done = False while not done: # Choose action (epsilon-greedy policy) if np.random.uniform(0, 1) < epsilon: action = env.action_space.sample() else: action = np.argmax(Q[state, :]) # Take action and observe reward and next state next_state, reward, done, _ = env.step(action) # Update Q-value Q[state, action] = Q[state, action] + alpha * ( reward + gamma * np.max(Q[next_state, :]) - Q[state, action] ) # Transition to next state state = next_state # Print the learned Q-table print("Learned Q-table:") print(Q) # Demonstrate the learned policy state = env.reset() done = False while not done: action = np.argmax(Q[state, :]) state, _, done, _ = env.step(action) env.render()

Conclusion

Reinforcement Learning, particularly through algorithms like Q-Learning, provides a powerful framework for training agents to make optimal decisions through trial and error. By balancing exploration and exploitation, an RL agent can learn effective strategies for interacting with its environment, even in complex and uncertain settings. The mathematical foundation, coupled with practical implementation, highlights the robustness and versatility of RL in solving real-world problems.

Comments

Popular posts from this blog

Machine Learning (Chapter 35): Decision Trees - Multiway Splits

Machine Learning (Chapter 6): Statistical Decision Theory - Classification

Machine Learning (Chapter 32): Stopping Criteria & Pruning