Clustered Bandits

https://arxiv.org/abs/1206.4169

Few types of users, each with a specific response to different arms
User enters the system: type unknown to the decision maker
- Can take advantage of knowing that only few types exist and cluster the users according to their response to the arms

Challenge 1: The decisions made maybe context-specific --> contextual version of the classical MAB
Challenge 2: The number of contexts may be quite large, and the number of observation per context maybe quite small
- I.e. many user, while each user may only be interacting with the system for relatively small number of purchases
- Want to: exploit latent low-dimensional structure in high dimensional problem
  - identify a few features that capture most of the heterogeneity of contexts, or to cluster the contexts into a few groups with similar characteristics.
Approach
- Estimate low-dimensional structure from high-dimensional data offline (i.e. based on previously collected data about the context)
- The inferred low-dimensional representation is used in solving the online contextual bandit problem

Approach 1: Exploration, clustering, exploitation

Assume that we know the parameter set (expected reward vector under type x), but not the exact type of each arriving user
Explore over an initial set of users, then we cluster from this data, and finally we run that algorithm for new users

Proposed algorithm 1) Unif - Clustering - UCB-ET(delta)

For the first M0 users, sample the arms uniformly at random
After a large enough number of pilot users, perform a clustering algorithm to obtain N estimated parameter vectors
Then, run the UCB-ET(delta) algorithm for the new (non-pilot) users

Explain of UCB-KT algorithm: upper confidence bound with known types
- Let theta_t be the empirical reward vector up to time t
- Given a parameter x, define B(x) as the set of parameters for which the optimal arms are better than the optimal arm for x, and the reward distribution under the optimal arm for x are the same
- Epsilon: elite arms, which include only arms that are optimal for some parameter x

Approach 2: online clustering

Continuously cluster as we learn.
Suppose first we know the type of each arriving user (but do not know the parameter vector for each type), then the problem is reduced to N separate bandit problems, one for each type
In each time step, given past observations, we use a clustering technique to divide users into N groups

Last updated 4 years ago

Was this helpful?