• Theory and Algorithms for the Bandit Problem p.38 Thompson extraction
  • Bayesian estimation] of expected value
  • Choose the action with the probability of being the maximum expected value of each action (random dither)
  • However, instead of doing this “probability of being the maximum expected value” calculation, use the [random-choice algorithm
  • Since it is Bayesian, a distribution of expected values is obtained. Sampling from this distribution
  • Select the action that had the largest value as a result of sampling
  • This will make it possible to “choose that action with the probability of being the maximum expected value.

https://hagino3000.blogspot.com/2015/07/thompson-sampling.html https://hagino3000.blogspot.com/2016/12/linear-bandit.html

Reinforcement Learning


This page is auto-translated from /nishio/トンプソンサンプリング using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.