markov-decision-process

Decide to take action $a$ , based on current state (only) $s_t$ , for a reward $r_t$
$s_t \in S$ , comprising all possible states (which may not be finite)
Decisions made based on transition probabilities, $T(s' \vert s, a)$ , that maps actions taken on the current state influencing the next state

The $T(s' \vert s, a)$ I don't fully understand yet, but I perceive it as something that is connected to the idea of exploration versus exploitation: the balance between making optimal decisions and discovering new state space. You can change $T(s' \vert s,a)$ such that you offer some chance to make the sub-optimal choice, i.e. wander off the trail.

Finite horizons

The reward function comprises decomposable timesteps:
- ...treated as components in an additively decomposed utility function. In a finite horizon problem with $n$ decisions, the [[utility]] associated with a sequence of rewards $r_{1:n}$ is given by $\sum_{t=1}^{n} r_t$
- In other words, maximize the expected long-term cumulative reward
- Also sometimes referred to as the [[return]]

Infinite horizons

For problems that do not have a finite number of decisions, $t$ can go forever. There are two ways to stop $r_t\rightarrow\infty$ :

A time-dependent discount function: $\sum_{t=1}^{\infty}\gamma^{t-1}r_t$
A time-averaged reward: $\lim_{n\to\infty}\frac{1}{n}\sum_{t=1}^{n} r_t$

Apparently there isn't much difference between the two, although the latter doesn't need to tune the discount rate, however the former is computationally easier.

Policy

A [[policy]] maps states to actions. In an MDP, we assume the next state depends only on the current, so we can write $\pi(s)$ , also denoted as a stationary policy.

Backlinks

Causal Navigation by Continuous time Neural Networks

- Causal structures as directed acyclic graphs; see [[bayesian-network]]. The model implements a function $f_i$ for node/variable $X_i$, that maps parent nodes ($\mathrm{PA}_i$) of $X_i$ and stochastic variables $U_i$ into the state $X_i$. In other words, a function that takes into account past events to predict the current state: a [[markov-decision-process]].

decision-theory

- [[markov-decision-process]]

bayesian-network

- Each node corresponds to a variable, with the directed edges their probabilities (transition probabilities like in [[markov-decision-process]]?)

policy

Concept in [[decision-theory]], specifically [[markov-decision-process]]

return

Another word for the sum of rewards in [[markov-decision-process]], with the goal of maximizing this (typically).