一文读懂强化学习:从Q-learning到PPO
Q-learning是强化学习入门的核心算法,通过学习Q值函数实现最优策略,适合理解强化学习的核心逻辑,但仅适用于简单离散任务;PPO是工业界主流的强化学习算法,基于Actor-Critic架构和近端策略优化,解决了传统算法训练不稳定的问题,支持离散/连续动作空间;实战是理解强化学习的关键:Grid World的Q-learning实现帮你掌握值迭代核心,CartPole的PPO实战让你理解策略优
一文读懂强化学习:从Q-learning到PPO
强化学习作为人工智能领域的核心分支,其本质是让智能体通过与环境的交互“试错学习”,最终找到最优决策策略。从经典的Q-learning到如今工业界主流的PPO(近端策略优化),强化学习算法经历了从简单值迭代到稳定策略优化的演进。本文将从核心原理入手,结合实战代码,带你打通从Q-learning到PPO的学习链路。
欢迎加入开源鸿蒙跨平台社区:https://openharmonycrossplatform.csdn.net

一、强化学习核心基础:从Q-learning说起
1. Q-learning核心原理
Q-learning是一种经典的时序差分(TD)学习算法,属于无模型(Model-Free)、异策略(Off-Policy)的强化学习方法。其核心是学习“状态-动作值函数”(Q函数),即Q(s,a)表示在状态s下执行动作a后,能获得的未来累计奖励期望。
Q-learning的更新遵循贝尔曼方程:
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)−Q(s,a)]
其中:
- α \alpha α 是学习率(控制更新幅度)
- γ \gamma γ 是折扣因子(权衡即时奖励与未来奖励)
- r r r 是执行动作后获得的即时奖励
- s ′ s' s′ 是执行动作后进入的新状态
2. Q-learning实战:Grid World环境
我们先通过经典的Grid World(网格世界)实现Q-learning,直观理解算法核心逻辑。
环境说明
构建一个4x4的网格,智能体从任意位置出发,目标是到达终点(3,3),碰到障碍(1,1)则奖励-10并结束回合,每走一步奖励-1(鼓励最短路径),到达终点奖励10。
import numpy as np
import random
# 定义Grid World环境
class GridWorld:
def __init__(self):
self.rows = 4
self.cols = 4
self.end_state = (3, 3)
self.obstacle_state = (1, 1)
self.current_state = (0, 0)
# 重置环境
def reset(self):
self.current_state = (random.randint(0, 3), random.randint(0, 3))
# 避免初始状态是终点或障碍
while self.current_state == self.end_state or self.current_state == self.obstacle_state:
self.current_state = (random.randint(0, 3), random.randint(0, 3))
return self.current_state
# 执行动作:0-上 1-下 2-左 3-右
def step(self, action):
x, y = self.current_state
# 动作执行
if action == 0:
x = max(0, x - 1)
elif action == 1:
x = min(self.rows - 1, x + 1)
elif action == 2:
y = max(0, y - 1)
elif action == 3:
y = min(self.cols - 1, y + 1)
# 更新状态
self.current_state = (x, y)
# 奖励与终止判断
if self.current_state == self.obstacle_state:
reward = -10
done = True
elif self.current_state == self.end_state:
reward = 10
done = True
else:
reward = -1
done = False
return self.current_state, reward, done
# Q-learning智能体
class QLearningAgent:
def __init__(self, actions, learning_rate=0.1, gamma=0.9, epsilon=0.1):
self.actions = actions # 动作空间
self.lr = learning_rate # 学习率
self.gamma = gamma # 折扣因子
self.epsilon = epsilon # 探索率
self.q_table = {} # Q表
# 获取状态的Q值
def get_q_value(self, state):
if state not in self.q_table:
self.q_table[state] = [0.0] * len(self.actions)
return self.q_table[state]
# 选择动作:ε-贪心策略
def choose_action(self, state):
if random.uniform(0, 1) < self.epsilon:
# 探索:随机选动作
action = random.choice(self.actions)
else:
# 利用:选Q值最大的动作
q_values = self.get_q_value(state)
action = np.argmax(q_values)
return action
# 更新Q表
def learn(self, state, action, reward, next_state):
q_values = self.get_q_value(state)
next_q_values = self.get_q_value(next_state)
# 贝尔曼方程更新
target = reward + self.gamma * np.max(next_q_values)
q_values[action] += self.lr * (target - q_values[action])
self.q_table[state] = q_values
# 训练智能体
if __name__ == "__main__":
# 初始化环境和智能体
env = GridWorld()
agent = QLearningAgent(actions=[0, 1, 2, 3], lr=0.1, gamma=0.9, epsilon=0.1)
# 训练参数
episodes = 1000
total_rewards = []
# 开始训练
for episode in range(episodes):
state = env.reset()
done = False
episode_reward = 0
while not done:
# 选择动作
action = agent.choose_action(state)
# 执行动作
next_state, reward, done = env.step(action)
# 学习更新
agent.learn(state, action, reward, next_state)
episode_reward += reward
state = next_state
total_rewards.append(episode_reward)
# 每100轮打印一次结果
if (episode + 1) % 100 == 0:
avg_reward = np.mean(total_rewards[-100:])
print(f"Episode {episode+1}, Average Reward: {avg_reward:.2f}")
# 打印训练后的Q表(部分关键状态)
print("\n训练后的Q表(终点附近状态):")
key_states = [(2,3), (3,2), (1,3), (3,1)]
for state in key_states:
if state in agent.q_table:
print(f"State {state}: Q值 = {agent.q_table[state]}")
代码说明
- 环境类GridWorld:定义了网格的大小、终点/障碍位置,实现了状态重置和动作执行逻辑;
- 智能体类QLearningAgent:核心包含ε-贪心策略(平衡探索与利用)和Q表更新逻辑;
- 训练过程:1000轮训练后,智能体能学会从任意位置以最短路径到达终点,Q表中终点相邻状态的最优动作Q值显著高于其他动作。
运行结果
训练完成后,你会看到平均奖励逐渐从负数提升至正数,说明智能体找到了最优路径;终点相邻状态(如(2,3)、(3,2))的向下/向右动作Q值会远高于其他动作。
二、从Q-learning到PPO:解决强化学习的核心痛点
Q-learning虽然简单易懂,但存在明显局限:
- 仅适用于离散动作空间,无法处理连续动作(如机器人控制);
- 训练不稳定,Q值易高估,且异策略学习的数据效率低;
- 无法直接优化策略,仅通过值函数间接推导策略。
PPO(近端策略优化)作为当前最流行的强化学习算法,属于策略梯度类方法,解决了传统策略梯度训练不稳定、样本效率低的问题。
1. PPO核心原理
PPO的核心是“近端策略优化”,通过限制策略更新的幅度(clipped surrogate objective),避免策略更新过大导致训练崩溃:
L C L I P ( θ ) = E ^ t [ min ( r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A ^ t ) ] L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right] LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1−ϵ,1+ϵ)A^t)]
其中:
- r t ( θ ) r_t(\theta) rt(θ) 是新旧策略的概率比;
- A ^ t \hat{A}_t A^t 是优势函数(衡量动作好坏);
- clip \text{clip} clip 函数限制策略更新幅度,避免训练震荡。
2. PPO实战:OpenAI Gym CartPole环境
我们使用PyTorch实现PPO,在经典的CartPole(倒立摆)环境中训练智能体。
前置依赖
pip install gymnasium torch numpy
核心代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
# 策略网络(Actor)
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=64):
super(Actor, self).__init__()
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = torch.tanh(self.fc1(x))
x = torch.tanh(self.fc2(x))
logits = self.fc3(x)
return torch.distributions.Categorical(logits=logits)
# 价值网络(Critic)
class Critic(nn.Module):
def __init__(self, state_dim, hidden_dim=64):
super(Critic, self).__init__()
self.fc1 = nn.Linear(state_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, 1)
def forward(self, x):
x = torch.tanh(self.fc1(x))
x = torch.tanh(self.fc2(x))
return self.fc3(x)
# PPO智能体
class PPOAgent:
def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99, clip_epsilon=0.2):
self.actor = Actor(state_dim, action_dim)
self.critic = Critic(state_dim)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr)
self.gamma = gamma
self.clip_epsilon = clip_epsilon
# 存储轨迹数据
self.states = []
self.actions = []
self.rewards = []
self.log_probs = []
# 选择动作
def choose_action(self, state):
state = torch.tensor(state, dtype=torch.float32)
dist = self.actor(state)
action = dist.sample()
log_prob = dist.log_prob(action)
self.states.append(state)
self.actions.append(action)
self.log_probs.append(log_prob)
return action.item()
# 存储奖励
def store_reward(self, reward):
self.rewards.append(reward)
# 计算优势函数和回报
def compute_gae(self):
states = torch.stack(self.states)
rewards = np.array(self.rewards)
# 计算折扣回报
returns = []
running_return = 0
for r in reversed(rewards):
running_return = r + self.gamma * running_return
returns.insert(0, running_return)
returns = torch.tensor(returns, dtype=torch.float32)
# 计算价值函数估计
values = self.critic(states).squeeze()
# 优势函数
advantages = returns - values
return returns, advantages
# 更新策略和价值网络
def update(self):
returns, advantages = self.compute_gae()
log_probs = torch.stack(self.log_probs)
actions = torch.stack(self.actions)
# 策略更新
dist = self.actor(torch.stack(self.states))
new_log_probs = dist.log_prob(actions)
ratio = torch.exp(new_log_probs - log_probs)
# Clipped损失
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * advantages
actor_loss = -torch.min(surr1, surr2).mean()
# 价值网络更新
critic_loss = nn.MSELoss()(self.critic(torch.stack(self.states)).squeeze(), returns)
# 反向传播
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# 清空轨迹
self.states = []
self.actions = []
self.rewards = []
self.log_probs = []
# 训练PPO
if __name__ == "__main__":
# 初始化环境
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
# 初始化智能体
agent = PPOAgent(state_dim, action_dim)
# 训练参数
episodes = 500
max_steps = 500
total_rewards = []
# 开始训练
for episode in range(episodes):
state, _ = env.reset()
episode_reward = 0
for step in range(max_steps):
# 选择动作
action = agent.choose_action(state)
# 执行动作
next_state, reward, done, truncated, _ = env.step(action)
# 存储奖励
agent.store_reward(reward)
# 更新状态和奖励
state = next_state
episode_reward += reward
# 回合结束
if done or truncated:
break
# 更新策略
agent.update()
total_rewards.append(episode_reward)
# 打印训练进度
if (episode + 1) % 50 == 0:
avg_reward = np.mean(total_rewards[-50:])
print(f"Episode {episode+1}, Average Reward: {avg_reward:.2f}")
# 测试训练后的智能体
env = gym.make("CartPole-v1", render_mode="human")
state, _ = env.reset()
for _ in range(1000):
action = agent.choose_action(state)
state, _, done, truncated, _ = env.step(action)
env.render()
if done or truncated:
break
env.close()
代码说明
- Actor-Critic架构:Actor网络输出动作概率分布,Critic网络估计状态价值;
- PPO核心更新:通过clip函数限制策略更新幅度,避免训练崩溃;
- GAE(广义优势估计):更精准地计算优势函数,提升训练稳定性;
- 训练效果:CartPole-v1的满分是500,训练500轮后,智能体平均奖励可稳定在450以上。
三、Q-learning vs PPO:核心差异与应用场景
| 维度 | Q-learning | PPO |
|---|---|---|
| 算法类型 | 值函数方法(异策略) | 策略梯度(同策略) |
| 动作空间 | 仅支持离散 | 支持离散/连续 |
| 训练稳定性 | 易震荡、Q值高估 | 稳定、更新幅度可控 |
| 样本效率 | 较低 | 较高 |
| 应用场景 | 简单离散任务(如GridWorld) | 复杂任务(游戏、机器人) |
总结
- Q-learning 是强化学习入门的核心算法,通过学习Q值函数实现最优策略,适合理解强化学习的核心逻辑,但仅适用于简单离散任务;
- PPO 是工业界主流的强化学习算法,基于Actor-Critic架构和近端策略优化,解决了传统算法训练不稳定的问题,支持离散/连续动作空间;
- 实战是理解强化学习的关键:Grid World的Q-learning实现帮你掌握值迭代核心,CartPole的PPO实战让你理解策略优化的本质。
✨ 坚持用 清晰的图解 +易懂的硬件架构 + 硬件解析, 让每个知识点都 简单明了 !
🚀 个人主页 :一只大侠的侠 · CSDN💬 座右铭 : “所谓成功就是以自己的方式度过一生。”
更多推荐


所有评论(0)