markov-decision-process
- Decide to take action , based on current state (only) , for a reward
- , comprising all possible states (which may not be finite)
- Decisions made based on transition probabilities, , that maps actions taken on the current state influencing the next state
The I don't fully understand yet, but I perceive it as something that is connected to the idea of exploration versus exploitation: the balance between making optimal decisions and discovering new state space. You can change such that you offer some chance to make the sub-optimal choice, i.e. wander off the trail.
Finite horizons
- The reward function comprises decomposable timesteps:
...treated as components in an additively decomposed utility function. In a finite horizon problem with decisions, the [[utility]] associated with a sequence of rewards is given by
- In other words, maximize the expected long-term cumulative reward
- Also sometimes referred to as the [[return]]
Infinite horizons
For problems that do not have a finite number of decisions, can go forever. There are two ways to stop :
- A time-dependent discount function:
- A time-averaged reward:
Apparently there isn't much difference between the two, although the latter doesn't need to tune the discount rate, however the former is computationally easier.
Policy
A [[policy]] maps states to actions. In an MDP, we assume the next state depends only on the current, so we can write , also denoted as a stationary policy.