Deep Q-Network

Goal: Win the game (≈ maximize the total reward.)

Question: If we know 𝑄(𝑠|𝑎), what is the best action?

  • Obviously, the best action is at=argmaxaQ(st,a)

Challenge: We do not know 𝑄(𝑠|𝑎).

  • Solution: Deep Q Network (DQN)
  • Use neural network 𝑄(𝑠,𝑎;𝐰) to approximate 𝑄(𝑠|𝑎), DQN 输入当前状态 S,输出动作空间
2020121520125020201215201250

Apply DQN to Play Game

2020121520144820201215201448
  1. 观察环境, 获取状态 st, 也就是 Observation
  2. 向 DQN 输入状态 st,获得使其最大化的动作 at
  3. 环境接受到 agent 的动作影响,通过状态转移函数 st+1 p(|st,at) 获取下一个状态 st+1
  4. 环境同时给出本轮的 reward

状态价值函数的估算

Monte-Carlo Policy Evaluation

2020121520430920201215204309 2020121520394220201215203942

时序差分方法 Temporal-difference (TD) -- 单步更新

  • Can I update the model before finishing the trip?
  • Can I get a better 𝐰 as soon as I arrived at DC?

That's TD learning!

2020121520471820201215204718

TD error

2020121520480920201215204809

Apply TD learning to DQN

想要使用 TD 算法,必须等式左边有一项,右边有两项,右边两项中有一项为真实观测到的。

2020121520521620201215205216

简要的推导证明:

2020121520570820201215205708

Train DQN using TD learning

2020121521105320201215211053

Q(st+1,at+1,wt) 的值实际上等于把 a 当作变量求得的最大化 Q 值

One iteration of TD learning

2020121521140020201215211400