一文读懂强化学习:从Q-learning到PPO

强化学习作为人工智能领域的核心分支,其本质是让智能体通过与环境的交互“试错学习”,最终找到最优决策策略。从经典的Q-learning到如今工业界主流的PPO(近端策略优化),强化学习算法经历了从简单值迭代到稳定策略优化的演进。本文将从核心原理入手,结合实战代码,带你打通从Q-learning到PPO的学习链路。

欢迎加入开源鸿蒙跨平台社区:https://openharmonycrossplatform.csdn.net
在这里插入图片描述

在这里插入图片描述


一、强化学习核心基础:从Q-learning说起

1. Q-learning核心原理

Q-learning是一种经典的时序差分(TD)学习算法,属于无模型(Model-Free)、异策略(Off-Policy)的强化学习方法。其核心是学习“状态-动作值函数”(Q函数),即Q(s,a)表示在状态s下执行动作a后,能获得的未来累计奖励期望。

Q-learning的更新遵循贝尔曼方程:
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max ⁡ a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)Q(s,a)+α[r+γamaxQ(s,a)Q(s,a)]
其中:

  • α \alpha α 是学习率(控制更新幅度)
  • γ \gamma γ 是折扣因子(权衡即时奖励与未来奖励)
  • r r r 是执行动作后获得的即时奖励
  • s ′ s' s 是执行动作后进入的新状态

2. Q-learning实战:Grid World环境

我们先通过经典的Grid World(网格世界)实现Q-learning,直观理解算法核心逻辑。

环境说明

构建一个4x4的网格,智能体从任意位置出发,目标是到达终点(3,3),碰到障碍(1,1)则奖励-10并结束回合,每走一步奖励-1(鼓励最短路径),到达终点奖励10。

import numpy as np
import random

# 定义Grid World环境
class GridWorld:
    def __init__(self):
        self.rows = 4
        self.cols = 4
        self.end_state = (3, 3)
        self.obstacle_state = (1, 1)
        self.current_state = (0, 0)
    
    # 重置环境
    def reset(self):
        self.current_state = (random.randint(0, 3), random.randint(0, 3))
        # 避免初始状态是终点或障碍
        while self.current_state == self.end_state or self.current_state == self.obstacle_state:
            self.current_state = (random.randint(0, 3), random.randint(0, 3))
        return self.current_state
    
    # 执行动作:0-上 1-下 2-左 3-右
    def step(self, action):
        x, y = self.current_state
        
        # 动作执行
        if action == 0:
            x = max(0, x - 1)
        elif action == 1:
            x = min(self.rows - 1, x + 1)
        elif action == 2:
            y = max(0, y - 1)
        elif action == 3:
            y = min(self.cols - 1, y + 1)
        
        # 更新状态
        self.current_state = (x, y)
        
        # 奖励与终止判断
        if self.current_state == self.obstacle_state:
            reward = -10
            done = True
        elif self.current_state == self.end_state:
            reward = 10
            done = True
        else:
            reward = -1
            done = False
        
        return self.current_state, reward, done

# Q-learning智能体
class QLearningAgent:
    def __init__(self, actions, learning_rate=0.1, gamma=0.9, epsilon=0.1):
        self.actions = actions  # 动作空间
        self.lr = learning_rate  # 学习率
        self.gamma = gamma  # 折扣因子
        self.epsilon = epsilon  # 探索率
        self.q_table = {}  # Q表
    
    # 获取状态的Q值
    def get_q_value(self, state):
        if state not in self.q_table:
            self.q_table[state] = [0.0] * len(self.actions)
        return self.q_table[state]
    
    # 选择动作:ε-贪心策略
    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            # 探索:随机选动作
            action = random.choice(self.actions)
        else:
            # 利用:选Q值最大的动作
            q_values = self.get_q_value(state)
            action = np.argmax(q_values)
        return action
    
    # 更新Q表
    def learn(self, state, action, reward, next_state):
        q_values = self.get_q_value(state)
        next_q_values = self.get_q_value(next_state)
        
        # 贝尔曼方程更新
        target = reward + self.gamma * np.max(next_q_values)
        q_values[action] += self.lr * (target - q_values[action])
        self.q_table[state] = q_values

# 训练智能体
if __name__ == "__main__":
    # 初始化环境和智能体
    env = GridWorld()
    agent = QLearningAgent(actions=[0, 1, 2, 3], lr=0.1, gamma=0.9, epsilon=0.1)
    
    # 训练参数
    episodes = 1000
    total_rewards = []
    
    # 开始训练
    for episode in range(episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        
        while not done:
            # 选择动作
            action = agent.choose_action(state)
            # 执行动作
            next_state, reward, done = env.step(action)
            # 学习更新
            agent.learn(state, action, reward, next_state)
            
            episode_reward += reward
            state = next_state
        
        total_rewards.append(episode_reward)
        
        # 每100轮打印一次结果
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(total_rewards[-100:])
            print(f"Episode {episode+1}, Average Reward: {avg_reward:.2f}")
    
    # 打印训练后的Q表(部分关键状态)
    print("\n训练后的Q表(终点附近状态):")
    key_states = [(2,3), (3,2), (1,3), (3,1)]
    for state in key_states:
        if state in agent.q_table:
            print(f"State {state}: Q值 = {agent.q_table[state]}")
代码说明
  1. 环境类GridWorld:定义了网格的大小、终点/障碍位置,实现了状态重置和动作执行逻辑;
  2. 智能体类QLearningAgent:核心包含ε-贪心策略(平衡探索与利用)和Q表更新逻辑;
  3. 训练过程:1000轮训练后,智能体能学会从任意位置以最短路径到达终点,Q表中终点相邻状态的最优动作Q值显著高于其他动作。
运行结果

训练完成后,你会看到平均奖励逐渐从负数提升至正数,说明智能体找到了最优路径;终点相邻状态(如(2,3)、(3,2))的向下/向右动作Q值会远高于其他动作。

二、从Q-learning到PPO:解决强化学习的核心痛点

Q-learning虽然简单易懂,但存在明显局限:

  1. 仅适用于离散动作空间,无法处理连续动作(如机器人控制);
  2. 训练不稳定,Q值易高估,且异策略学习的数据效率低;
  3. 无法直接优化策略,仅通过值函数间接推导策略。

PPO(近端策略优化)作为当前最流行的强化学习算法,属于策略梯度类方法,解决了传统策略梯度训练不稳定、样本效率低的问题。

1. PPO核心原理

PPO的核心是“近端策略优化”,通过限制策略更新的幅度(clipped surrogate objective),避免策略更新过大导致训练崩溃:
L C L I P ( θ ) = E ^ t [ min ⁡ ( r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A ^ t ) ] L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right] LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]
其中:

  • r t ( θ ) r_t(\theta) rt(θ) 是新旧策略的概率比;
  • A ^ t \hat{A}_t A^t 是优势函数(衡量动作好坏);
  • clip \text{clip} clip 函数限制策略更新幅度,避免训练震荡。

2. PPO实战:OpenAI Gym CartPole环境

我们使用PyTorch实现PPO,在经典的CartPole(倒立摆)环境中训练智能体。

前置依赖
pip install gymnasium torch numpy
核心代码
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym

# 策略网络(Actor)
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        x = torch.tanh(self.fc2(x))
        logits = self.fc3(x)
        return torch.distributions.Categorical(logits=logits)

# 价值网络(Critic)
class Critic(nn.Module):
    def __init__(self, state_dim, hidden_dim=64):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        x = torch.tanh(self.fc2(x))
        return self.fc3(x)

# PPO智能体
class PPOAgent:
    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99, clip_epsilon=0.2):
        self.actor = Actor(state_dim, action_dim)
        self.critic = Critic(state_dim)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr)
        self.gamma = gamma
        self.clip_epsilon = clip_epsilon
        
        # 存储轨迹数据
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []
    
    # 选择动作
    def choose_action(self, state):
        state = torch.tensor(state, dtype=torch.float32)
        dist = self.actor(state)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        
        self.states.append(state)
        self.actions.append(action)
        self.log_probs.append(log_prob)
        
        return action.item()
    
    # 存储奖励
    def store_reward(self, reward):
        self.rewards.append(reward)
    
    # 计算优势函数和回报
    def compute_gae(self):
        states = torch.stack(self.states)
        rewards = np.array(self.rewards)
        
        # 计算折扣回报
        returns = []
        running_return = 0
        for r in reversed(rewards):
            running_return = r + self.gamma * running_return
            returns.insert(0, running_return)
        returns = torch.tensor(returns, dtype=torch.float32)
        
        # 计算价值函数估计
        values = self.critic(states).squeeze()
        # 优势函数
        advantages = returns - values
        
        return returns, advantages
    
    # 更新策略和价值网络
    def update(self):
        returns, advantages = self.compute_gae()
        log_probs = torch.stack(self.log_probs)
        actions = torch.stack(self.actions)
        
        # 策略更新
        dist = self.actor(torch.stack(self.states))
        new_log_probs = dist.log_prob(actions)
        ratio = torch.exp(new_log_probs - log_probs)
        
        # Clipped损失
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * advantages
        actor_loss = -torch.min(surr1, surr2).mean()
        
        # 价值网络更新
        critic_loss = nn.MSELoss()(self.critic(torch.stack(self.states)).squeeze(), returns)
        
        # 反向传播
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()
        
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        # 清空轨迹
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []

# 训练PPO
if __name__ == "__main__":
    # 初始化环境
    env = gym.make("CartPole-v1")
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    # 初始化智能体
    agent = PPOAgent(state_dim, action_dim)
    
    # 训练参数
    episodes = 500
    max_steps = 500
    total_rewards = []
    
    # 开始训练
    for episode in range(episodes):
        state, _ = env.reset()
        episode_reward = 0
        
        for step in range(max_steps):
            # 选择动作
            action = agent.choose_action(state)
            # 执行动作
            next_state, reward, done, truncated, _ = env.step(action)
            # 存储奖励
            agent.store_reward(reward)
            # 更新状态和奖励
            state = next_state
            episode_reward += reward
            
            # 回合结束
            if done or truncated:
                break
        
        # 更新策略
        agent.update()
        total_rewards.append(episode_reward)
        
        # 打印训练进度
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(total_rewards[-50:])
            print(f"Episode {episode+1}, Average Reward: {avg_reward:.2f}")
    
    # 测试训练后的智能体
    env = gym.make("CartPole-v1", render_mode="human")
    state, _ = env.reset()
    for _ in range(1000):
        action = agent.choose_action(state)
        state, _, done, truncated, _ = env.step(action)
        env.render()
        if done or truncated:
            break
    env.close()
代码说明
  1. Actor-Critic架构:Actor网络输出动作概率分布,Critic网络估计状态价值;
  2. PPO核心更新:通过clip函数限制策略更新幅度,避免训练崩溃;
  3. GAE(广义优势估计):更精准地计算优势函数,提升训练稳定性;
  4. 训练效果:CartPole-v1的满分是500,训练500轮后,智能体平均奖励可稳定在450以上。

三、Q-learning vs PPO:核心差异与应用场景

维度 Q-learning PPO
算法类型 值函数方法(异策略) 策略梯度(同策略)
动作空间 仅支持离散 支持离散/连续
训练稳定性 易震荡、Q值高估 稳定、更新幅度可控
样本效率 较低 较高
应用场景 简单离散任务(如GridWorld) 复杂任务(游戏、机器人)

总结

  1. Q-learning 是强化学习入门的核心算法,通过学习Q值函数实现最优策略,适合理解强化学习的核心逻辑,但仅适用于简单离散任务;
  2. PPO 是工业界主流的强化学习算法,基于Actor-Critic架构和近端策略优化,解决了传统算法训练不稳定的问题,支持离散/连续动作空间;
  3. 实战是理解强化学习的关键:Grid World的Q-learning实现帮你掌握值迭代核心,CartPole的PPO实战让你理解策略优化的本质。

✨ 坚持用 清晰的图解 +易懂的硬件架构 + 硬件解析, 让每个知识点都 简单明了
🚀 个人主页一只大侠的侠 · CSDN

💬 座右铭“所谓成功就是以自己的方式度过一生。”
在这里插入图片描述

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐