## Introduction

This lecture talks about methods that optimise policy directly. Instead of working with value function as we consider so far, we seek experience and use the experience to update our policy in the direction that makes it better.

In the last lecture, we approximated the value or action-value function using parameters \(\theta\), \[ V_\theta(s)\approx V^\pi(s)\\ Q_\theta(s, a)\approx Q^\pi(s, a) \] A policy was generated directly from the value function using \(\epsilon\)-greedy.

In this lecture we will directly parametrise the policy \[ \pi_\theta(s, a)=\mathbb{P}[a|s, \theta] \] We will focus again on \(\color{red}{\mbox{model-free}}\) reinforcement learning.

**Table of Contents**

- Finite Difference Policy Gradient
- Monte-Carlo Policy Gradient
- Actor-Critic Policy Gradient
- Summary of Policy Gradient Algorithms

**Value-Based and Policy-Based RL**

- Value Based
- Learnt Value Function
- Implicit policy (e.g. \(\epsilon\)-greedy)

- Policy Based
- No Value Function
- Learnt Policy

- Actor-Critic
- Learnt Value Function
- Learnt Policy

**Advantages of Policy-Based RL**

Advantages:

- Better convergence properties
- Effective in high-dimensional or contimuous action spaces (
*without computing max*) - Can learn stochastic policies

Disadvantages:

- Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance

Deterministic policy or taking max is not also the best. Take the rock-paper-scissors game for example.

Consider policies *iterated* rock-paper-scissors

- A deterministic policy is easily exploited
- A uniform random policy is optimal (according to Nash equilibrium)

**Aliased Gridworld Example**

The agent cannot differentiate the grey states.

Consider features of the following form (for all N, E, S, W) \[
\phi(s, a)=1(\mbox{wall to N, a = move E})
\] Compare value-based RL, using an approximate value function \[
Q_\theta(s, a)=f(\phi(s, a), \theta)
\] To policy-based RL, using a parametrised policy \[
\pi_\theta(s, a)=g(\phi(s, a), \theta)
\] Since the agent cannot differentiate the grey states given the feature, if you take a **deterministic** policy, you must pick the same action at two grey states.

Under aliasing, an optimal \(\color{red}{\mbox{deterministic}}\) policy will either

- move W in both grey states (as shown by red arrows)
- move E in both grey states

Either way, it can get stuck and never reach the money.

Value-based RL learns a near-deterministic policy, so it will traverse the corridor for a long time.

An optimal \(\color{red}{\mbox{stochastic}}\) policy will randomly move E or W in grey states: \[ \pi_\theta(\mbox{wall to N and S, move E}) = 0.5\\ \pi_\theta(\mbox{wall to N and S, move W}) = 0.5\\ \] It will reach the goal state in a few steps with high probability. Policy-based RL can learn the optimal stochastic policy.

These examples show that a stochastic policy can be better than the deterministic policy, especially in the case that the MDP is **partialy observed** or cannot fully represent the state.

**Policy Objective Functions**

Goal: given policy \(\pi_\theta(s, a)\) with parameters \(\theta\), find best \(\theta\). But how do we measure the quality of a policy \(\pi_\theta\)?

In episodic environments we can use the

**start value**\[ J_1(\theta)=V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1] \]In continuing environments we can use the

**average value**\[ J_{av}v(\theta)=\sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s) \]Or the

**average reward per time-step** \[ J_{av}R(\theta)=\sum_s d^{\pi_\theta}(s)\sum_a\pi_\theta(s, a)\mathcal{R}^a_s \]

where \(d^{\pi_\theta}(s)\) is

**stationary distribution**of Markov chain for \(\pi_\theta\).

**Policy Optimisation**

Policy based reinforcement learning is an **optimisation** problem. Find \(\theta\) that maximises \(J(\theta)\).

Some approaches do not use gradient

- Hill climbing
- Simplex / amoeba / Nelder Mead
- Genetic algorithms

However, greater efficiency often possible using gradient

- Gradient descent
- Conjugate gradient
- Quasi-newton

We focus on gradient descent, many extensions possible. And on methods that exploit sequential structure.

## Finite Difference Policy Gradient

**Policy Gradient**

Let \(J(\theta)\) be any policy objective function. Policy gradient algorithms search for a local maximum in \(J(\theta)\) by ascending the gradient of the policy, w.r.t. parameters \(\theta\) \[ \triangle\theta = \alpha\nabla_\theta J(\theta) \] Where \(\bigtriangledown_\theta J(\theta)\) is the \(\color{red}{\mbox{policy gradient}}\), \[ \nabla_\theta J(\theta)=\begin{pmatrix} \frac{\partial J(\theta)}{\partial \theta_1} \\ \vdots\\ \frac{\partial J(\theta)}{\partial \theta_n} \end{pmatrix} \] and \(\alpha\) is a step-size parameter.

**Computing Gradients By Finite Differences (Numerical)**

To evaluate policy gradient of \(\pi_\theta(s, a)\).

- For each dimension \(k\in[1, n]\):
Estimate \(k\)th partial derivative of objective function w.r.t. \(\theta\)

By perturbing \(\theta\) by small amount \(\epsilon\) in \(k\)th dimension \[ \frac{\partial J(\theta)}{\partial \theta_k}\approx \frac{J(\theta+\epsilon u_k)-J(\theta)}{\epsilon} \] where \(u_k\) is unit vector with 1 in \(k\)th component, 0 elsewhere

- Uses \(n\) evaluations to compute policy gradient in \(n\) dimensions

This is a simple, noisy, inefficient, but sometimes effective method. It works for **arbitrary** policies, even if policy is **not** differentiable.

The algorithm is efficient when the dimension of \(\theta\) is low.

## Monte-Carlo Policy Gradient

**Score Function**

We now compute the policy gradient *analytically*.

Assume policy \(\pi_\theta\) is differentiable whenever it is non-zero and we know the gradient \(\nabla_\theta\pi_\theta(s, a)\).

\(\color{red}{\mbox{Likelihood ratios}}\) exploit the following identity \[ \begin{align} \nabla_\theta\pi_\theta(s, a) & =\pi_\theta(s, a) \frac{\nabla_\theta\pi_\theta(s, a) }{\pi_\theta(s, a) } \\ & = \pi_\theta(s, a) \nabla_\theta\log \pi_\theta(s, a) \\ \end{align} \] The \(\color{red}{\mbox{score function}}\) is $(s, a) $. Let's take two examples to see what the score function looks like.

*Softmax Policy*

We will use a softmax policy as a running example. Weight actions using linear combination of features \(\phi(s, a)^T\theta\). Probability of action is proportional to exponentiated weight: \[ \pi_\theta(s, a)\varpropto e^{\phi(s, a)^T\theta} \] The score function is \[ \nabla_\theta\log\pi_\theta(s, a)=\phi(s, a)-\mathbb{E}_{\pi_\theta}[\phi(s, \cdot)] \] (Intuition: log gradient = the feature for the action that we actually took minus the average feature for all actions.)

*Gaussian Policy*

In continuous action spaces, a Gaussian policy is natural.

- Mean is a linear combination of state features \(\mu(s) = \phi(s)^T\theta\).
- Variance may be fixed \(\sigma^2\), or can also parametrised

Policy is Gaussian, \(a\sim \mathcal{N}(\mu(s), \sigma^2)\). The score function is \[ \nabla_\theta\log\pi_\theta(s, a)=\frac{(a-\mu(s))\phi(s)}{\sigma^2} \] So far we just have a sense of what does the score function look like. Now we step into policy gradient theorem.

**One-Step MDPs**

Consider a simple class of one-step MDPs:

- Starting in state \(s\sim d(s)\)
- Terminating after one time-step with reward \(r=\mathcal{R}_{s,a}\)

Use likelihood ratios to compute the policy gradient \[ \begin{align} J(\theta) &=\mathbb{E}_{\pi_\theta}[r]\\ &=\sum_{s\in\mathcal{S}}d(s)\sum_{a\in\mathcal{A}}\pi_\theta(s, a)\mathcal{R}_{s,a} \end{align} \]

\[ \begin{align} \nabla_\theta J(\theta) &=\sum_{s\in\mathcal{S}}d(s)\sum_{a\in\mathcal{A}}\pi_\theta(s, a)\nabla_\theta\log\pi_\theta(s, a)\mathcal{R}_{s,a}\\ &=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s, a)r] \end{align} \]

The policy gradient theorem generalises the likelihood ratio approach to multi-step MDPs.

- Replaces instantaneous reward \(r\) with long-term value \(Q^\pi(s, a)\)

Policy gradient theorem applies to start state objective, average reward, and average value objective.

Theorem

For any differentiable policy \(\pi_\theta(s,a)\), for any of the policy objective functions mentioned earlier, the policy gradient is \[ \nabla_\theta J(\theta)=\color{red}{\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s, a)Q^{\pi_\theta}(s, a)]} \]

**Demonstration**

Settings: The initial state \(s_0\) is sampled from distribution \(\rho_0\). A trajectory \(\tau = (s_0, a_0, s_1, a_1, ..., s_{t+1})\) is sampled from policy \(\pi_\theta\).

The target function would be \[ J(\theta) = E_{\tau\sim\pi}[R(\tau)] \] The probability of trajectory \(\tau\) is sampled from \(\pi\) is \[ P(\tau|\theta) = \rho_0(s_0)+\prod_{t=0}^TP(s_{t+1}|s_t, a_t)\pi_\theta(a_t|s_t) \] Using the log prob trick: \[ \triangledown_\theta P(\tau|\theta) = P(\tau|\theta)\triangledown_\theta\log P(\tau|\theta) \] Expand the trajectory: \[ \begin{align} \require{cancel}\triangledown_\theta \log P(\tau|\theta) &= \cancel{\triangledown_\theta \log\rho_0(s_0)}+\sum_{t=0}^T\cancel{\triangledown_\theta \log P(s_{t+1}|s_t,a_t)}+ \triangledown_\theta\log\pi_\theta(a_t|s_t)\\ &= \sum_{t=0}^T\triangledown_\theta\log\pi_\theta(a_t|s_t) \end{align} \] The gradient of target function \[ \begin{align} \triangledown_\theta J(\theta) &= \triangledown_\theta E_{\tau\sim\pi}[R(\tau)]\\ &= \int_\tau \triangledown_\theta P(\tau|\theta)R(\tau)\\ &= \int_\tau P(\tau|\theta)\triangledown_\theta \log P(\tau|\theta)R(\tau)\\ &= E_{\tau\sim\pi}[\triangledown_\theta \log P(\tau|\theta)R(\tau)]\\ &= E_{\tau\sim\pi}[\sum_{t=0}^T\triangledown_\theta\log\pi_\theta(a_t|s_t)R(\tau)]\\ &= E_{\tau\sim\pi}[\sum_{t=0}^T \color{red}{\Phi_t}\triangledown_\theta\log\pi_\theta(a_t|s_t)] \end{align} \]

**Monte-Carlo Policy Gradient (REINFORCE)**

Update parameters by stochastic gradient ascent using policy gradient theorem. And using return \(v_t\) as an **unbiased sample** of \(Q^{\pi_\theta}(s_t,a_t)\): \[
\triangle\theta_t=\alpha\nabla_\theta\log\pi_\theta(s_t, a_t)v_t
\]

(Note: MCPG is slow.)

## Actor-Critic Policy Gradient

**Reducing Variance Using a Critic**

Monte-Carlo policy gradient still has high variance, we use a \(\color{red}{critic}\) to estimate the action-value function: \[ Q_w(s, a)\approx Q^{\pi_\theta}(s, a) \] Actor-critic algorithms maintain two sets of parameters:

- Critic: Updates action-value function parameters \(w\)
- Actor: Updates policy parameters \(\theta\), in direction suggested by critic

Actor-critic algorithms follow an *approximate* policy gradient: \[
\nabla_\theta J(\theta)\approx \mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s, a)Q_w(s, a)]\\
\triangle\theta= \alpha\nabla_\theta\log\pi_\theta(s, a)Q_w(s, a)
\] The critic is solving a familiar problem: policy evaluation. This problem was explored in previous lectures:

- Monte-Carlo policy evaluation
- Temporal-Difference learning
- TD(\(\lambda\))
- Least Squares policy evaluation

Simple actor-critic algorithm based on action-value critic using linear value function approximation. \(Q_w(s, a)=\phi(s,a)^Tw\)

- Critic: Updates \(w\) by linear TD(0)
- Actor: Updates \(\theta\) by policy gradient

**Bias in Actor-Critic Algorithms**

Approximating the policy gradient introduces bias. A biased policy gradient may not find the right solution. Luckily, if we choose value function approximation carefully, then we can avoid introducing any bias. That is we can still follow the exact policy gradient.

Compatible Function Approximation TheoremIf the following two conditions are satisdied:

Value function approximator is

compatibleto the policy \[ \nabla_w Q_w(s, a)=\nabla_\theta \log\pi_\theta(s, a) \]Value function parameters \(w\) minimise the mean-squared error \[ \epsilon=\mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s, a)-Q_w(s, a))^2] \]

Then the policy gradient is exact, \[ \nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s,a)Q_w(s,a)] \]

**Trick: Reducing Variance Using a Baseline**

We substract a baseline function \(B(s)\) from the policy gradient. This can **reduce variance, without changing expectation**: \[
\begin{align}
\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s,a)B(s)]&=\sum_{s\in\mathcal{S}}d^{\pi_\theta}(s)\sum_a\nabla_\theta\pi_\theta(s,a)B(s)\\
&= \sum_{s\in\mathcal{S}}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in\mathcal{A}}\pi_\theta(s,a)\\
&= \sum_{s\in\mathcal{S}}d^{\pi_\theta}B(s)\nabla_\theta 1 \\
&=0
\end{align}
\] A good baseline is the state value function \(B(s)=V^{\pi_\theta}(s)\). So we can rewrite the policy gradient using the \(\color{red}{\mbox{advantage function}}\ A^{\pi_\theta}(s,a)\). \[
A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\
\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s,a)\color{red}{A^{\pi_\theta}(s,a)}]
\] where \(V^{\pi_\theta}(s)\) is the state value function of \(s\).

**Intuition**: The advantage function \(A^{\pi_\theta}(s,a)\) tells us how much better than usual is it to take action \(a\).

**Estimating the Advantage Function**

How do we know the state value function \(V\)?

One way to do that is to estimate both \(V^{\pi_\theta}(s)\) and \(Q^{\pi_\theta}(s,a)\). Using two function approximators and two parameter vectors, \[ V_v(s)\approx V^{\pi_\theta}(s)\\ Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\ A(s,a)=Q_w(s,a)-V_v(s) \] And updating both value functions by e.g. TD learning.

Another way is to use the TD error to compute the policy gradient. For the true value function \(V^{\pi_\theta}(s)\), the TD error \(\delta^{\pi_\theta}\) \[ \delta^{\pi_\theta}=r+\gamma V^{\pi_\theta}(s')-V^{\pi_\theta}(s) \] is an unbiased estimate of the advantage function: \[ \begin{align} \mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}|s, a] &= \mathbb{E}_{\pi_\theta}[r+\gamma V^{\pi_\theta}(s')|s, a]-V^{\pi_\theta}(s)\\ &= Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)\\ &= \color{red}{A^{\pi_\theta}(s,a)} \end{align} \] So we can use the TD error to compute the policy gradient \[ \nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s,a)\color{red}{\delta^{\pi_\theta}}] \] In practice we can use an approximate TD error: \[ \delta_v=r+\gamma V_v(s')-V_v(s) \] This approach only requires one set of critic parameters \(v\).

**Critics and Actors at Different Time-Scales**

Critic can estimate value function \(V_\theta(s)\) from many targets at different time-scales

For MC, the target is return \(v_t\) \[ \triangle \theta=\alpha(\color{red}{v_t}-V_\theta(s))\phi(s) \]

For TD(0), the target is the TD target \(r+\gamma V(s')\) \[ \triangle \theta=\alpha(\color{red}{r+\gamma V(s')}-V_\theta(s))\phi(s) \]

For forward-view TD(\(\lambda\)), the target is the return \(_vt^\lambda\) \[ \triangle \theta=\alpha(\color{red}{v_t^\lambda}-V_\theta(s))\phi(s) \]

For backward-view TD(\(\lambda\)), we use eligibility traces \[ \begin{align} \delta_t &= r_{t+1}+\gamma V(s_{t+1})-V(s_t) \\ e_t& = \gamma\lambda e_{t-1} +\phi(s_t) \\ \triangle\theta&=\alpha\delta_te_t \end{align} \]

The policy gradient can also be estimated at many time-scales \[ \nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s,a)\color{red}{A^{\pi_\theta}(s,a)}] \]

MC policy gradient uses error from complete return \[ \triangle\theta=\alpha(\color{red}{v_t}-V_v(s_t))\nabla_\theta\log\pi_\theta(s_t,a_t) \]

Actor-critic policy gradient uses the one-step TD error \[ \triangle\theta=\alpha(\color{red}{r+\gamma V_v(s_{t+1})}-V_v(s_t))\nabla_\theta\log\pi_\theta(s_t,a_t) \]

Just like forward-view TD(\(\lambda\)), we can mix over time-scale \[ \triangle \theta=\alpha(\color{red}{v_t^\lambda}-V_v(s_t))\nabla_\theta\log\pi_\theta(s_t,a_t) \] where \(v_t^\lambda-V_v(s_t)\) is a biased estimate of advantage function.

Like backward-view TD(\(\lambda\)), we can also use eligibility traces by substituting \(\phi(s)=\nabla_\theta\log\pi_\theta(s,a)\) \[ \begin{align} \delta_t &= r_{t+1}+\gamma V_v(s_{t+1})-V_v(s_t) \\ e_{t+1}& = \gamma\lambda e_{t} +\nabla_\theta\log\pi_\theta(s,a) \\ \triangle\theta&=\alpha\delta_te_t \end{align} \]

## Summary of Policy Gradient Algorithms

The policy gradient has many equivalent forms

Each leads a stochastic gradient ascent algorithm. Critic uses policy evaluation to estimate \(Q^\pi(s, a)\), \(A^\pi(s, a)\) or \(V^\pi(s)\).