Policy-based Method of RL
Policy function ๐(๐|๐ ) ็จๆฅๆๅฏผ agent ๅป่ฟๅจ๏ผๅฎๆฅๅไธไธช็ถๆ s ไฝไธบ่พๅ ฅ๏ผ่พๅบๆๆๅจไฝ็ๆฆ็๏ผagent ไปๆๆๅจไฝไธญ้ๆ ท้ๅไธไธชๅจไฝ a ๆง่กใ
Can we directly learn a policy function ๐(๐|๐ )?
- If there are only a few states and actions, then yes, we can.
- Draw a table (matrix) and learn the entries.
- What if there are too many (or infinite) states or actions?
Policy Network ๐(๐|๐ ;๐)
่ฟไผผๅฝๆฐๅธธ็จ็ๆฏ็บฟๆงๅๅฝๅ็ฅ็ป็ฝ็ปใ
Policy network: Use a neural net to approximate ๐(๐|๐ ;๐).
- Use policy network ๐(๐|๐ ;๐) to approximate ๐(๐|๐ ;๐).
- ๐: trainable parameters of the neural net
\(\sum_{a \in \mathcal{A}} \pi\left(\left.a\right|{s} ; \boldsymbol{\theta}\right)=1\)
State-value function
\(V_{\pi}\left(s_{t}\right)=\mathbb{E}_{A}\left[Q_{\pi}\left(s_{t}, A\right)\right]=\sum_{a} \pi\left(a \mid s_{t}\right) \cdot Q_{\pi}\left(s_{t}, a\right)\)
Approximate state-value function
- Approximate policy function \(\pi(a|s_{t})\) by policy network \(\pi(a|s_{t};\theta)\).
- Approximate value function \(V_{\pi}\left(s_{t}\right)\) by: \(V\left(s_{t} ; \boldsymbol{\theta}\right)=\sum_{a} \pi\left(a \mid s_{t} ; \boldsymbol{\theta}\right) \cdot Q_{\pi}\left(s_{t}, a\right)\)
Policy-based learning: Learn ๐ that maximizes \(J(\boldsymbol{\theta})=\mathbb{E}_{S}[V(S ; \boldsymbol{\theta})]\)
Policy gradient ascent to improve ๐:
- Observe state s
- Update policy by: \(\theta \leftarrow \theta + \beta \cdot \frac{\partial V(s;\theta)}{\partial \theta}\)
- Policy gradient: \(\frac{\partial V(s;\theta)}{\partial \theta}\)
Policy gradient


Calculate Policy Gradient for Discrete Actions

Calculate Policy Gradient for Continuous Actions

Update policy network using policy gradient

Two Options:

Summary
