policy

Concept in [[decision-theory]], specifically [[markov-decision-process]]

For history $h_t(s_{1:t},a_{1:t})$ that accumulates all past states $s$ and actions $a$ , the policy is written as $\pi_t(h_t)$ . For a stationary process, we can simply write this as $\pi(s)$ , omitting the time-dependence.

The definition of an optimal policy $\pi*$ is one that maximizes the [[utility]]:

$\pi*(s) = \mathrm{arg max}_\pi U^\pi(s)$

Deterministic policies

Stochastic policies

#needs-expanding

Backlinks

markov-decision-process

A [[policy]] maps states to actions. In an MDP, we assume the next state depends only on the current, so we can write $\pi(s)$, also denoted as a *stationary policy*.