ML Reviews

markov-decision-process

  • Decide to take action aa, based on current state (only) sts_t, for a reward rtr_t
  • stSs_t \in S, comprising all possible states (which may not be finite)
  • Decisions made based on transition probabilities, T(ss,a)T(s' \vert s, a), that maps actions taken on the current state influencing the next state

The T(ss,a)T(s' \vert s, a) I don't fully understand yet, but I perceive it as something that is connected to the idea of exploration versus exploitation: the balance between making optimal decisions and discovering new state space. You can change T(ss,a)T(s' \vert s,a) such that you offer some chance to make the sub-optimal choice, i.e. wander off the trail.

Finite horizons

  • The reward function comprises decomposable timesteps:
    • ...treated as components in an additively decomposed utility function. In a finite horizon problem with nn decisions, the [[utility]] associated with a sequence of rewards r1:nr_{1:n} is given by t=1nrt\sum_{t=1}^{n} r_t

    • In other words, maximize the expected long-term cumulative reward
    • Also sometimes referred to as the [[return]]

Infinite horizons

For problems that do not have a finite number of decisions, tt can go forever. There are two ways to stop rtr_t\rightarrow\infty:

  • A time-dependent discount function: t=1γt1rt\sum_{t=1}^{\infty}\gamma^{t-1}r_t
  • A time-averaged reward: limn1nt=1nrt\lim_{n\to\infty}\frac{1}{n}\sum_{t=1}^{n} r_t

Apparently there isn't much difference between the two, although the latter doesn't need to tune the discount rate, however the former is computationally easier.

Policy

A [[policy]] maps states to actions. In an MDP, we assume the next state depends only on the current, so we can write π(s)\pi(s), also denoted as a stationary policy.