Machine Learning (Chapter 4): Reinforcement Learning
Machine Learning (Chapter 4): Reinforcement Learning
Introduction
Reinforcement Learning (RL) is a subfield of Machine Learning where an agent learns to make decisions by performing certain actions and observing the rewards or penalties resulting from those actions. Unlike supervised learning, where the model learns from a labeled dataset, RL focuses on learning from interaction with an environment. This interaction is modeled as a Markov Decision Process (MDP), where the agent's goal is to maximize cumulative rewards over time.
Key Concepts in Reinforcement Learning
- Agent: The learner or decision-maker.
- Environment: The world through which the agent moves and interacts.
- State (S): A representation of the current situation of the agent in the environment.
- Action (A): The choices the agent can make in each state.
- Reward (R): The feedback from the environment after the agent takes an action.
- Policy (π): A strategy used by the agent to determine its actions based on the current state.
- Value Function (V(s)): The expected cumulative reward from a given state.
- Q-Value or Action-Value (Q(s, a)): The expected cumulative reward from taking action 'a' in state 's' and then following a policy.
Markov Decision Process (MDP)
An MDP is defined by a tuple , where:
- is a finite set of states.
- is a finite set of actions.
- is the probability of transitioning to state after taking action in state .
- is the immediate reward received after taking action in state .
- is the discount factor, representing the difference in importance between future and present rewards.
The goal is to find a policy that maximizes the expected return from each state , defined as:
Value Functions
State-Value Function
The state-value function under a policy is the expected return starting from state and then following policy :
Action-Value Function
The action-value function is the expected return starting from state , taking action , and then following policy :
Bellman Equations
The Bellman equation provides a recursive decomposition of the value functions.
Bellman Equation for
Bellman Equation for
Q-Learning: An Off-Policy Approach
Q-Learning is a popular off-policy reinforcement learning algorithm where the goal is to learn the optimal action-value function directly. The update rule for Q-Learning is:
Where:
- is the learning rate.
- is the reward received after taking action in state .
- is the new state after the action is taken.
Python Implementation of Q-Learning
Below is a simple Python implementation of the Q-Learning algorithm using a toy environment.
python:import numpy as np
import gym
# Initialize environment
env = gym.make("FrozenLake-v1", is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n
# Initialize Q-table with zeros
Q = np.zeros((n_states, n_actions))
# Set learning parameters
alpha = 0.8 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.1 # Exploration rate
n_episodes = 1000
# Q-learning algorithm
for episode in range(n_episodes):
state = env.reset()
done = False
while not done:
# Choose action (epsilon-greedy policy)
if np.random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state, :])
# Take action and observe reward and next state
next_state, reward, done, _ = env.step(action)
# Update Q-value
Q[state, action] = Q[state, action] + alpha * (
reward + gamma * np.max(Q[next_state, :]) - Q[state, action]
)
# Transition to next state
state = next_state
# Print the learned Q-table
print("Learned Q-table:")
print(Q)
# Demonstrate the learned policy
state = env.reset()
done = False
while not done:
action = np.argmax(Q[state, :])
state, _, done, _ = env.step(action)
env.render()
Conclusion
Reinforcement Learning, particularly through algorithms like Q-Learning, provides a powerful framework for training agents to make optimal decisions through trial and error. By balancing exploration and exploitation, an RL agent can learn effective strategies for interacting with its environment, even in complex and uncertain settings. The mathematical foundation, coupled with practical implementation, highlights the robustness and versatility of RL in solving real-world problems.
Comments
Post a Comment