动手学强化学习第三章(马尔可夫决策过程)
马尔可夫决策过程1.公式总结2.代码实践2.1 计算序列的回报2.2 利用贝尔曼方程的矩阵形式计算解析解2.3 解析法计算MDP中每个状态价值2.4 使用蒙特卡洛方法计算MDP的状态价值2.5 估算策略的占用度量
文章目录
马尔可夫决策过程
文章转载自《动手学强化学习》https://hrl.boyuai.com/chapter/intro
1.公式总结
- MRP中的价值函数
V ( s ) = E [ G t ∣ S t = s ] = E [ R t + γ R t + 1 + γ 2 R t + 2 + … ∣ S t = s ] = E [ R t + γ ( R t + 1 + γ R t + 2 + … ) ∣ S t = s ] = E [ R t + γ G t + 1 ∣ S t = s ] = E [ R t + γ V ( S t + 1 ) ∣ S t = s ] \begin{aligned} V(s) &=\mathbb{E}\left[G_{t} \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t}+\gamma R_{t+1}+\gamma^{2} R_{t+2}+\ldots \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t}+\gamma\left(R_{t+1}+\gamma R_{t+2}+\ldots\right) \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t}+\gamma G_{t+1} \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t}+\gamma V\left(S_{t+1}\right) \mid S_{t}=s\right] \end{aligned} V(s)=E[Gt∣St=s]=E[Rt+γRt+1+γ2Rt+2+…∣St=s]=E[Rt+γ(Rt+1+γRt+2+…)∣St=s]=E[Rt+γGt+1∣St=s]=E[Rt+γV(St+1)∣St=s]
- MRP中的贝尔曼方程
V ( s ) = r ( s ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) V(s)=r(s)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right) V(s)=r(s)+γs′∈S∑p(s′∣s)V(s′)
- 矩阵形式的贝尔曼方程
V = R + γ P V [ V ( s 1 ) V ( s 2 ) … V ( s n ) ] = [ r ( s 1 ) r ( s 2 ) … r ( s n ) ] + γ [ P ( s 1 ∣ s 1 ) p ( s 2 ∣ s 1 ) … P ( s n ∣ s 1 ) P ( s 1 ∣ s 2 ) P ( s 2 ∣ s 2 ) … P ( s n ∣ s 2 ) … P ( s 1 ∣ s n ) P ( s 2 ∣ s n ) … P ( s n ∣ s n ) ] [ V ( s 1 ) V ( s 2 ) … V ( s n ) ] \begin{array}{c} \mathcal{V}=\mathcal{R}+\gamma \mathcal{P} \mathcal{V} \\ {\left[\begin{array}{c} V\left(s_{1}\right) \\ V\left(s_{2}\right) \\ \ldots \\ V\left(s_{n}\right) \end{array}\right]=\left[\begin{array}{c} r\left(s_{1}\right) \\ r\left(s_{2}\right) \\ \ldots \\ r\left(s_{n}\right) \end{array}\right]+\gamma\left[\begin{array}{cccc} P\left(s_{1} \mid s_{1}\right) & p\left(s_{2} \mid s_{1}\right) & \ldots & P\left(s_{n} \mid s_{1}\right) \\ P\left(s_{1} \mid s_{2}\right) & P\left(s_{2} \mid s_{2}\right) & \ldots & P\left(s_{n} \mid s_{2}\right) \\ \ldots & & & \\ P\left(s_{1} \mid s_{n}\right) & P\left(s_{2} \mid s_{n}\right) & \ldots & P\left(s_{n} \mid s_{n}\right) \end{array}\right]\left[\begin{array}{c} V\left(s_{1}\right) \\ V\left(s_{2}\right) \\ \ldots \\ V\left(s_{n}\right) \end{array}\right]} \end{array} V=R+γPV⎣⎢⎢⎡V(s1)V(s2)…V(sn)⎦⎥⎥⎤=⎣⎢⎢⎡r(s1)r(s2)…r(sn)⎦⎥⎥⎤+γ⎣⎢⎢⎡P(s1∣s1)P(s1∣s2)…P(s1∣sn)p(s2∣s1)P(s2∣s2)P(s2∣sn)………P(sn∣s1)P(sn∣s2)P(sn∣sn)⎦⎥⎥⎤⎣⎢⎢⎡V(s1)V(s2)…V(sn)⎦⎥⎥⎤
- 矩阵形式贝尔曼方程的解析解
V = R + γ P V ( I − γ P ) V = R V = ( I − γ P ) − 1 R \begin{aligned} \mathcal{V} &=\mathcal{R}+\gamma \mathcal{P} \mathcal{V} \\ (I-\gamma \mathcal{P}) \mathcal{V} &=\mathcal{R} \\ \mathcal{V} &=(I-\gamma \mathcal{P})^{-1} \mathcal{R} \end{aligned} V(I−γP)VV=R+γPV=R=(I−γP)−1R
- 状态价值函数
V π ( s ) = E π [ G t ∣ S t = s ] V^{\pi}(s)=\mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s\right] Vπ(s)=Eπ[Gt∣St=s]
- 动作价值函数
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] Q^{\pi}(s, a)=\mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s, A_{t}=a\right] Qπ(s,a)=Eπ[Gt∣St=s,At=a]
- 状态价值函数与动作价值函数的转换
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) V^{\pi}(s)=\sum_{a \in A} \pi(a \mid s) Q^{\pi}(s, a) Vπ(s)=a∈A∑π(a∣s)Qπ(s,a)
Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V π ( s ′ ) Q^{\pi}(s, a)=r(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V^{\pi}\left(s^{\prime}\right) Qπ(s,a)=r(s,a)+γs′∈S∑P(s′∣s,a)Vπ(s′)
- 贝尔曼期望方程
V π ( s ) = E π [ R t + γ V π ( S t + 1 ) ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) ( r ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π ( s ′ ) ) \begin{aligned} V^{\pi}(s) &=\mathbb{E}_{\pi}\left[R_{t}+\gamma V^{\pi}\left(S_{t+1}\right) \mid S_{t}=s\right] \\ &=\sum_{a \in A} \pi(a \mid s)\left(r(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V^{\pi}\left(s^{\prime}\right)\right) \end{aligned} Vπ(s)=Eπ[Rt+γVπ(St+1)∣St=s]=a∈A∑π(a∣s)(r(s,a)+γs′∈S∑p(s′∣s,a)Vπ(s′))
Q π ( s , a ) = E π [ R t + γ Q π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] = r ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) \begin{aligned} Q^{\pi}(s, a) &=\mathbb{E}_{\pi}\left[R_{t}+\gamma Q^{\pi}\left(S_{t+1}, A_{t+1}\right) \mid S_{t}=s, A_{t}=a\right] \\ &=r(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) \sum_{a^{\prime} \in A} \pi\left(a^{\prime} \mid s^{\prime}\right) Q^{\pi}\left(s^{\prime}, a^{\prime}\right) \end{aligned} Qπ(s,a)=Eπ[Rt+γQπ(St+1,At+1)∣St=s,At=a]=r(s,a)+γs′∈S∑p(s′∣s,a)a′∈A∑π(a′∣s′)Qπ(s′,a′)
- 最优状态价值函数
V ∗ ( s ) = max π V π ( s ) , ∀ s ∈ S V^{*}(s)=\max _{\pi} V^{\pi}(s), \quad \forall s \in \mathcal{S} V∗(s)=πmaxVπ(s),∀s∈S
- 最优动作价值函数
Q ∗ ( s , a ) = max π Q π ( s , a ) , ∀ s ∈ S , a ∈ A Q^{*}(s, a)=\max _{\pi} Q^{\pi}(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A} Q∗(s,a)=πmaxQπ(s,a),∀s∈S,a∈A
- 最优状态函数与最优动作函数之间的转换
Q ∗ ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V ∗ ( s ′ ) Q^{*}(s, a)=r(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V^{*}\left(s^{\prime}\right) Q∗(s,a)=r(s,a)+γs′∈S∑P(s′∣s,a)V∗(s′)
V ∗ ( s ) = max a ∈ A Q ∗ ( s , a ) V^{*}(s)=\max _{a \in \mathcal{A}} Q^{*}(s, a) V∗(s)=a∈AmaxQ∗(s,a)
2.代码实践
代码参考自动手学强化学习(jupyter notebook版本):https://github.com/boyu-ai/Hands-on-RL
使用pycharm打开的请查看:https://github.com/zxs-000202/dsx-rl
方便的话给个star~
2.1 计算序列的回报
import numpy as np
np.random.seed(0)
# 定义状态转移概率矩阵P
P = [
[0.9, 0.1, 0.0, 0.0, 0.0, 0.0],
[0.5, 0.0, 0.5, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.6, 0.0, 0.4],
[0.0, 0.0, 0.0, 0.0, 0.3, 0.7],
[0.0, 0.2, 0.3, 0.5, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
]
P = np.array(P)
rewards = [-1, -2, -2, 10, 1, 0] # 定义奖励函数
gamma = 0.5 # 定义折扣因子
# 给定一条序列,计算从某个索引(起始状态)开始到序列最后(终止状态)得到的回报
def compute_return(start_index, chain, gamma):
G = 0
for i in reversed(range(start_index, len(chain))):
G = gamma * G + rewards[chain[i] - 1]
return G
# 一个状态序列,s1-s2-s3-s6
chain = [1, 2, 3, 6]
start_index = 0
G = compute_return(start_index, chain, gamma)
print("根据本序列计算得到回报为:%s。" % G)
根据本序列计算得到回报为:-2.5。
2.2 利用贝尔曼方程的矩阵形式计算解析解
def compute(P, rewards, gamma, states_num):
''' 利用贝尔曼方程的矩阵形式计算解析解,states_num是MRP的状态数 '''
rewards = np.array(rewards).reshape((-1, 1)) #将rewards写成列向量形式
value = np.dot(np.linalg.inv(np.eye(states_num, states_num) - gamma * P),
rewards)
return value
V = compute(P, rewards, gamma, 6)
print("MRP中每个状态价值分别为\n", V)
MRP中每个状态价值分别为
[[-2.01950168]
[-2.21451846]
[ 1.16142785]
[10.53809283]
[ 3.58728554]
[ 0. ]]
补充:
np.eye()
返回的是一个二维的数组(N,M),对角线的地方为1,其余的地方为0.
np.linalg.inv()
矩阵求逆
2.3 解析法计算MDP中每个状态价值
S = ["s1", "s2", "s3", "s4", "s5"] # 状态集合
A = ["保持s1", "前往s1", "前往s2", "前往s3", "前往s4", "前往s5", "概率前往"] # 动作集合
# 状态转移函数
P = {
"s1-保持s1-s1": 1.0,
"s1-前往s2-s2": 1.0,
"s2-前往s1-s1": 1.0,
"s2-前往s3-s3": 1.0,
"s3-前往s4-s4": 1.0,
"s3-前往s5-s5": 1.0,
"s4-前往s5-s5": 1.0,
"s4-概率前往-s2": 0.2,
"s4-概率前往-s3": 0.4,
"s4-概率前往-s4": 0.4,
}
# 奖励函数
R = {
"s1-保持s1": -1,
"s1-前往s2": 0,
"s2-前往s1": -1,
"s2-前往s3": -2,
"s3-前往s4": -2,
"s3-前往s5": 0,
"s4-前往s5": 10,
"s4-概率前往": 1,
}
gamma = 0.5 # 折扣因子
MDP = (S, A, P, R, gamma)
# 策略1,随机策略
Pi_1 = {
"s1-保持s1": 0.5,
"s1-前往s2": 0.5,
"s2-前往s1": 0.5,
"s2-前往s3": 0.5,
"s3-前往s4": 0.5,
"s3-前往s5": 0.5,
"s4-前往s5": 0.5,
"s4-概率前往": 0.5,
}
# 策略2
Pi_2 = {
"s1-保持s1": 0.6,
"s1-前往s2": 0.4,
"s2-前往s1": 0.3,
"s2-前往s3": 0.7,
"s3-前往s4": 0.5,
"s3-前往s5": 0.5,
"s4-前往s5": 0.1,
"s4-概率前往": 0.9,
}
# 把输入的两个字符串通过“-”连接,便于使用上述定义的P、R变量
def join(str1, str2):
return str1 + '-' + str2
gamma = 0.5
# 转化后的MRP的状态转移矩阵
P_from_mdp_to_mrp = [
[0.5, 0.5, 0.0, 0.0, 0.0],
[0.5, 0.0, 0.5, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.5, 0.5],
[0.0, 0.1, 0.2, 0.2, 0.5],
[0.0, 0.0, 0.0, 0.0, 1.0],
]
P_from_mdp_to_mrp = np.array(P_from_mdp_to_mrp)
R_from_mdp_to_mrp = [-0.5, -1.5, -1.0, 5.5, 0]
V = compute(P_from_mdp_to_mrp, R_from_mdp_to_mrp, gamma, 5)
print("MDP中每个状态价值分别为\n", V)
MDP中每个状态价值分别为
[[-1.22555411]
[-1.67666232]
[ 0.51890482]
[ 6.0756193 ]
[ 0. ]]
2.4 使用蒙特卡洛方法计算MDP的状态价值
def sample(MDP, Pi, timestep_max, number):
''' 采样函数,策略Pi,限制最长时间步timestep_max,总共采样序列数number '''
S, A, P, R, gamma = MDP
episodes = []
for _ in range(number):
episode = []
timestep = 0
s = S[np.random.randint(4)] # 随机选择一个除s5以外的状态s作为起点
# 当前状态为终止状态或者时间步太长时,一次采样结束
while s != "s5" and timestep <= timestep_max:
timestep += 1
rand, temp = np.random.rand(), 0
# 在状态s下根据策略选择动作
for a_opt in A:
temp += Pi.get(join(s, a_opt), 0)
if temp > rand:
a = a_opt
r = R.get(join(s, a), 0)
break
rand, temp = np.random.rand(), 0
# 根据状态转移概率得到下一个状态s_next
for s_opt in S:
temp += P.get(join(join(s, a), s_opt), 0)
if temp > rand:
s_next = s_opt
break
episode.append((s, a, r, s_next)) # 把(s,a,r,s_next)元组放入序列中
s = s_next # s_next变成当前状态,开始接下来的循环
episodes.append(episode)
return episodes
# 采样5次,每个序列最长不超过1000步
episodes = sample(MDP, Pi_1, 20, 5)
print('第一条序列\n', episodes[0])
print('第二条序列\n', episodes[1])
print('第五条序列\n', episodes[4])
第一条序列
[('s1', '前往s2', 0, 's2'), ('s2', '前往s3', -2, 's3'), ('s3', '前往s5', 0, 's5')]
第二条序列
[('s4', '概率前往', 1, 's4'), ('s4', '前往s5', 10, 's5')]
第五条序列
[('s2', '前往s3', -2, 's3'), ('s3', '前往s4', -2, 's4'), ('s4', '前往s5', 10, 's5')]
# 对所有采样序列计算所有状态的价值
def MC(episodes, V, N, gamma):
for episode in episodes:
G = 0
for i in range(len(episode) - 1, -1, -1): #一个序列从后往前计算
(s, a, r, s_next) = episode[i]
G = r + gamma * G
N[s] = N[s] + 1
V[s] = V[s] + (G - V[s]) / N[s]
timestep_max = 20
# 采样1000次,可以自行修改
episodes = sample(MDP, Pi_1, timestep_max, 1000)
gamma = 0.5
V = {"s1": 0, "s2": 0, "s3": 0, "s4": 0, "s5": 0}
N = {"s1": 0, "s2": 0, "s3": 0, "s4": 0, "s5": 0}
MC(episodes, V, N, gamma)
print("使用蒙特卡洛方法计算MDP的状态价值为\n", V)
使用蒙特卡洛方法计算MDP的状态价值为
{'s1': -1.228923788722258, 's2': -1.6955696284402704, 's3': 0.4823809701532294, 's4': 5.967514743019431, 's5': 0}
补充:
dict.get()
dict.get(a,b) : a是键值key,如果存在dict存在键值a,则函数返回dict[a];否则返回b,如果没有定义b参数,则返回None
举例:
m={'a':1,'b':2,'center':3}
m.get('a',100)
1
m={'a':1,'b':2,'center':3}
m.get(2,100)
100
m={'a':1,'b':2,'center':3}
m.get(2)
None
2.5 估算策略的占用度量
def occupancy(episodes, s, a, timestep_max, gamma):
''' 计算状态动作对(s,a)出现的频率,以此来估算策略的占用度量 '''
rho = 0
total_times = np.zeros(timestep_max) # 记录每个时间步t各被经历过几次
occur_times = np.zeros(timestep_max) # 记录(s_t,a_t)=(s,a)的次数
for episode in episodes:
for i in range(len(episode)):
(s_opt, a_opt, r, s_next) = episode[i]
total_times[i] += 1
if s == s_opt and a == a_opt:
occur_times[i] += 1
for i in reversed(range(timestep_max)):
if total_times[i]:
rho += gamma**i * occur_times[i] / total_times[i]
return (1 - gamma) * rho
gamma = 0.5
timestep_max = 1000
episodes_1 = sample(MDP, Pi_1, timestep_max, 1000)
episodes_2 = sample(MDP, Pi_2, timestep_max, 1000)
rho_1 = occupancy(episodes_1, "s4", "概率前往", timestep_max, gamma)
rho_2 = occupancy(episodes_2, "s4", "概率前往", timestep_max, gamma)
print(rho_1, rho_2)
# 0.112567796310472 0.23199480615618912
0.112567796310472 0.23199480615618912
更多推荐
所有评论(0)