[RL] (Spinning Up) Proof for Using Q-Function in Policy Gradient Formula

티스토리 뷰

Study/AI

[RL] (Spinning Up) Proof for Using Q-Function in Policy Gradient Formula

생각많은 소심남 2019. 5. 23. 23:07

(OpenAI Spinning Up 글을 개인적으로 정리했습니다. 원본)

Extra Material — Spinning Up documentation

spinningup.openai.com

이번 글에서는 finite-horizon undiscounted return 상태에서 다음 식을 증명하고자 한다.

$$ \nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}} \Big[ \sum_{t=0}^{T} \big( \nabla_{\theta} \log \pi_{\theta} (a_{t} | s_{t} ) \big) Q^{\pi_{\theta}}(s_{t}, a_{t}) \Big] $$

infinite-horizon discounted return 상황에서도 같은 증명을 적용할 수 있다.

위 식에 대한 증명은 law of iterated expectation 의 영향을 받는다. 우선 이전에 다뤘던 reward-to-go 형식의 policy gradient 수식을 다시 써보고자 한다. ( 여기서 $\hat{R_{t}} = \sum_{t'=t}^{T} R(S_{t}, a_{t}, s_{t+1}) $ 으로 축약했다.)

$$ \begin{align} \nabla_{\theta} J(\pi\_{\theta}) &= E_{\tau \sim \pi_{\theta}} \bigg[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta} (a_{t}|s_{t}) \hat{R_{\theta}} \bigg] \\ &= \sum_{t=0}^{T} E_{\tau \sim \pi_{\theta}} \Big[ \nabla_{\theta} \log \pi_{\theta} (a_{t}|s_{t}) \hat{R_{t}} \Big] \end{align} $$

조금 표현이 복잡해지긴 하지만 t 시간가지 수집된 trajectory를 $ \tau_{:t} = (s_{0}, a_{0}, ..., s_{t}, a_{t}) $ 라고 하고 $ \tau_{t:} $는 그 이후의 trajectory라고 정의를 해보자. 그러면 law of iterated expectation에 따라서 위의 수식을 다음과 같이 나눠서 볼 수 있다.

$$ \nabla_{\theta} J(\pi_{\theta}) = \sum_{t=0}^{T} E_{\tau_{:t} \sim \pi_{\theta}} \Big[ E_{\tau_{t:} \sim \pi_{\theta}} \Big[ \nabla_{\theta} \log \pi_{\theta} (a_{t} | s_{t}) \hat{R_{t}} | \tau_{:t} \Big] \Big] $$

이 때 inner expectation 식에 있는 grad-log-prob 항은 상수가 된다. (해당 항은 $s_{t}$와 $a_{t}$와 관련되어 있는데, inner expectation은 $\tau_{:t}$ (즉, $s_{t}, a_{t}$ 로 이뤄진 trajectory) 가 고정된 상태이기 때문이다. 그래서 expectation 바깥으로 빼보면 다음과 같이 정리해볼 수 있다.

$$ \nabla_{\theta} J(\pi_{\theta}) = \sum_{t=0}^{T} E_{\tau_{:t} \sim \pi_{\theta}} \Big[ \nabla_{\theta} \log \pi_{\theta} (a_{t} | s_{t}) E_{\tau_{t:} \sim \pi_{\theta}} \Big[ \hat{R_{t}} | \tau_{:t} \Big] \Big] $$

Markov Decision Process에 따르면, future는 가장 최근의 state와 action에 의해서만 달라진다고 정의되어 있다. 결과적으로 inner expectation, time $t$ 까지 과거를 모두 통틀어서 조건이 형성된 상태에서 future까지의 expecation은 가장 최근의 timestep $ (s_{t}, a_{t}) $의 조건만 반영된 expectation과 동일하다는 것을 알 수 있다.

$$ E_{\tau_{t:} \sim \pi_{\theta}} \Big[ \hat{R_{t}} | \tau_{:t} \Big] = E_{\tau_{t:} \sim \pi_{\theta}} \Big[ \hat{R_{t}} | s_{t}, a_{t} \Big] $$

여기서 policy $ \pi_{\theta} $에 대한 state-action value fuction인 $Q^{\pi_{\theta}}(s_{t}, a_{t})$ 의 정의를 다시 살펴보면, state $s_{t}$에서 action $a_{t}$를 취한 상태에서, on-policy $\pi_{\theta}$에 의해 나머지 trajectory에 대한 action을 취한 것에 대한 expected return임을 알 수 있고, 이를 통해 결과를 바로 확인할 수 있게 된다.

저작자표시 비영리 변경금지

'Study > AI' 카테고리의 다른 글

[ML] Averaged Perceptron / Pegasos (0)	2019.07.01
[RL] Policy Gradient Algorithms (23)	2019.06.17
[ML][DS] ColumnTransformer를 활용한 Column Align (0)	2019.05.30
[RL] (Spinning Up) Proof for Don't Let the Past Distract You (0)	2019.05.23
[RL] (Spinning Up) Intro to Policy Optimization (0)	2019.05.22
[RL] (Spinning up) Kinds of RL Algorithms (0)	2019.05.21
[RL] (Spinning Up) Key concepts in RL (0)	2019.05.20

공유하기 링크

페이스북
카카오스토리
트위터

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

chans_jupyter

TAG more

« 2024/04 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

글 보관함

자신에 대한 고찰

티스토리 뷰

[RL] (Spinning Up) Proof for Using Q-Function in Policy Gradient Formula

'Study > AI' 카테고리의 다른 글

티스토리툴바