Notice
Recent Posts
«   2019/09   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30
Tags more
Archives
Today
13
Total
107,580
관리 메뉴

## Multi-armed bandit problem 본문

논문

### Multi-armed bandit problem

모릅니다 2011.01.07 02:35

레버가 여러 개 달린 슬롯머신을 말하는구나. bandit 이 뭔가 했네…

각 레버마다 다른 이득을 얻을 수 있고, 초기에는 어떤 값이 나올지 모르는데 하면 할 수록 시스템을 알아가게 되기 때문에

Exploration & Exploitation 측면으로 많이 접근한다.

위키피디아 링크는 아래와 같음.

http://en.wikipedia.org/wiki/Multi-armed_bandit

1952년에 Herbert Robbins 가 Fomulate 했다.

Common bandit strategies 는 아래와 같다는데, 이건 좀 봐야 알겠네.

#### Common bandit strategies

Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the three broad categories detailed below.

#####  Semi-uniform strategies

Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a greedy behavior where the best lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.

• Epsilon-greedy strategy: The best lever is selected for a proportion 1 − ε of the trials, and another lever is randomly selected (with uniform probability) for a proportion ε. A typical parameter value might be ε = 0.1, but this can vary widely depending on circumstances and predilections.
• Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase. For N trials in total, the exploration phase occupies εN trials and the exploitation phase (1 − ε)N trials. During the exploration phase, a lever is randomly selected (with uniform probability); during the exploitation phase, the best lever is always selected.
• Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of ε decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish.
• Adaptive epsilon-greedy strategy based on value differences (VDBE): Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress instead of manual tuning (Tokic, 2010). High changes in the value estimates lead to a high epsilon (exploration); low value changes to a low epsilon (exploitation).
#####  Probability matching strategies

Probability matching strategies reflect the idea that the number of pulls for a given lever should match its actual probability of being the optimal lever.

#####  Pricing strategies

Pricing strategies establish a price for each lever. The lever of highest price is always pulled.

#### '논문' 카테고리의 다른 글

 패킷 전송 지연, 윈도우 사이즈, 손실 확률, 큐 계산  (0) 2013.02.22 2011.01.07
공유하기 링크