🌓

Value-based Method of RL

2020-12-17 | ml rl | Hits | 302 Words | 1 Minutes

Contents

1. Deep Q-Network
2. Apply DQN to Play Game
3. 状态价值函数的估算
1. 3.1. Monte-Carlo Policy Evaluation
2. 3.2. 时序差分方法 Temporal-difference (TD) -- 单步更新

Deep Q-Network

Goal: Win the game (≈ maximize the total reward.)

Question: If we know $𝑄^{\star}(𝑠|𝑎)$ , what is the best action?

Obviously, the best action is $a_{t} = argmax_{a}Q^{\star}(s_{t},a)$

Challenge: We do not know $𝑄^{\star}(𝑠|𝑎)$ .

Solution: Deep Q Network (DQN)
Use neural network $𝑄(𝑠,𝑎;𝐰)$ to approximate $𝑄^{\star}(𝑠|𝑎)$ , DQN 输入当前状态 S，输出动作空间

20201215201250

Apply DQN to Play Game

20201215201448

观察环境, 获取状态 $s_{t}$ , 也就是 Observation
向 DQN 输入状态 $s_{t}$ ，获得使其最大化的动作 $a_{t}$
环境接受到 agent 的动作影响，通过状态转移函数 $s_{t+1}~p(\cdot|s_{t},a_{t})$ 获取下一个状态 $s_{t+1}$
环境同时给出本轮的 reward

状态价值函数的估算

Monte-Carlo Policy Evaluation

20201215204309 20201215203942

时序差分方法 Temporal-difference (TD) -- 单步更新

Can I update the model before finishing the trip？
Can I get a better 𝐰 as soon as I arrived at DC?

That's TD learning!

20201215204718

TD error

20201215204809

Apply TD learning to DQN

想要使用 TD 算法，必须等式左边有一项，右边有两项，右边两项中有一项为真实观测到的。

20201215205216

简要的推导证明：

20201215205708

Train DQN using TD learning

20201215211053

$Q(s_{t+1},a_{t+1}, w_{t})$ 的值实际上等于把 a 当作变量求得的最大化 Q 值

One iteration of TD learning

20201215211400