Policy function ๐œ‹(๐‘Ž|๐‘ ) ็”จๆฅๆŒ‡ๅฏผ agent ๅŽป่ฟๅŠจ๏ผŒๅฎƒๆŽฅๅ—ไธ€ไธช็Šถๆ€ s ไฝœไธบ่พ“ๅ…ฅ๏ผŒ่พ“ๅ‡บๆ‰€ๆœ‰ๅŠจไฝœ็š„ๆฆ‚็Ž‡๏ผŒagent ไปŽๆ‰€ๆœ‰ๅŠจไฝœไธญ้‡‡ๆ ท้€‰ๅ–ไธ€ไธชๅŠจไฝœ a ๆ‰ง่กŒใ€‚

Can we directly learn a policy function ๐œ‹(๐‘Ž|๐‘ )?

  • If there are only a few states and actions, then yes, we can.
  • Draw a table (matrix) and learn the entries.
  • What if there are too many (or infinite) states or actions?

Policy Network ๐œ‹(๐‘Ž|๐‘ ;๐›‰)

่ฟ‘ไผผๅ‡ฝๆ•ฐๅธธ็”จ็š„ๆ˜ฏ็บฟๆ€งๅ›žๅฝ’ๅ’Œ็ฅž็ป็ฝ‘็ปœใ€‚

Policy network: Use a neural net to approximate ๐œ‹(๐‘Ž|๐‘ ;๐›‰).

  • Use policy network ๐œ‹(๐‘Ž|๐‘ ;๐›‰) to approximate ๐œ‹(๐‘Ž|๐‘ ;๐›‰).
  • ๐›‰: trainable parameters of the neural net

\(\sum_{a \in \mathcal{A}} \pi\left(\left.a\right|{s} ; \boldsymbol{\theta}\right)=1\)

State-value function

\(V_{\pi}\left(s_{t}\right)=\mathbb{E}_{A}\left[Q_{\pi}\left(s_{t}, A\right)\right]=\sum_{a} \pi\left(a \mid s_{t}\right) \cdot Q_{\pi}\left(s_{t}, a\right)\)

Approximate state-value function

  • Approximate policy function \(\pi(a|s_{t})\) by policy network \(\pi(a|s_{t};\theta)\).
  • Approximate value function \(V_{\pi}\left(s_{t}\right)\) by: \(V\left(s_{t} ; \boldsymbol{\theta}\right)=\sum_{a} \pi\left(a \mid s_{t} ; \boldsymbol{\theta}\right) \cdot Q_{\pi}\left(s_{t}, a\right)\)

Policy-based learning: Learn ๐›‰ that maximizes \(J(\boldsymbol{\theta})=\mathbb{E}_{S}[V(S ; \boldsymbol{\theta})]\)

Policy gradient ascent to improve ๐›‰:

  • Observe state s
  • Update policy by: \(\theta \leftarrow \theta + \beta \cdot \frac{\partial V(s;\theta)}{\partial \theta}\)
  • Policy gradient: \(\frac{\partial V(s;\theta)}{\partial \theta}\)

Policy gradient

form_1
form_2

Calculate Policy Gradient for Discrete Actions

discrete_action

Calculate Policy Gradient for Continuous Actions

continuous_action

Update policy network using policy gradient

algorithm

Two Options:

options

Summary

summary