linear bandit

A model in which behavior is represented by a d-dimensional vector and the reward is the inner product of the d-dimensional unknown parameters with noise $X_{i} (t) = θ^{⊤} a_{i} + ϵ (t)$ The basic bandit corresponds to the case where this action vector is [one-hot

Reinforcement Learning

This page is auto-translated from /nishio/線形バンディット using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.

🪴 Quartz 4.0

linear bandit

Graph View

Backlinks