ML Reviews

policy

Concept in [[decision-theory]], specifically [[markov-decision-process]]

For history ht(s1:t,a1:t)h_t(s_{1:t},a_{1:t}) that accumulates all past states ss and actions aa, the policy is written as πt(ht)\pi_t(h_t). For a stationary process, we can simply write this as π(s)\pi(s), omitting the time-dependence.

The definition of an optimal policy π\pi* is one that maximizes the [[utility]]:

π(s)=argmaxπUπ(s)\pi*(s) = \mathrm{arg max}_\pi U^\pi(s)

Deterministic policies

Stochastic policies

#needs-expanding