본문 바로가기

논문

Multi-armed bandit problem

레버가 여러 개 달린 슬롯머신을 말하는구나. bandit 이 뭔가 했네…

각 레버마다 다른 이득을 얻을 수 있고, 초기에는 어떤 값이 나올지 모르는데 하면 할 수록 시스템을 알아가게 되기 때문에

Exploration & Exploitation 측면으로 많이 접근한다.

위키피디아 링크는 아래와 같음.

http://en.wikipedia.org/wiki/Multi-armed_bandit

1952년에 Herbert Robbins 가 Fomulate 했다.

Common bandit strategies 는 아래와 같다는데, 이건 좀 봐야 알겠네.

Common bandit strategies

Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the three broad categories detailed below.

[edit] Semi-uniform strategies

Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a greedy behavior where the best lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.

  • Epsilon-greedy strategy: The best lever is selected for a proportion 1 − ε of the trials, and another lever is randomly selected (with uniform probability) for a proportion ε. A typical parameter value might be ε = 0.1, but this can vary widely depending on circumstances and predilections.
  • Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase. For N trials in total, the exploration phase occupies εN trials and the exploitation phase (1 − ε)N trials. During the exploration phase, a lever is randomly selected (with uniform probability); during the exploitation phase, the best lever is always selected.
  • Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of ε decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish.
  • Adaptive epsilon-greedy strategy based on value differences (VDBE): Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress instead of manual tuning (Tokic, 2010). High changes in the value estimates lead to a high epsilon (exploration); low value changes to a low epsilon (exploitation).
[edit] Probability matching strategies

Probability matching strategies reflect the idea that the number of pulls for a given lever should match its actual probability of being the optimal lever.

[edit] Pricing strategies

Pricing strategies establish a price for each lever. The lever of highest price is always pulled.

'논문' 카테고리의 다른 글

패킷 전송 지연, 윈도우 사이즈, 손실 확률, 큐 계산  (0) 2013.02.22