Clustered Bandits
https://arxiv.org/abs/1206.4169
Abstract
Few types of users, each with a specific response to different arms
User enters the system: type unknown to the decision maker
Can take advantage of knowing that only few types exist and cluster the users according to their response to the arms
Introduction
Challenge 1: The decisions made maybe context-specific --> contextual version of the classical MAB
Challenge 2: The number of contexts may be quite large, and the number of observation per context maybe quite small
I.e. many user, while each user may only be interacting with the system for relatively small number of purchases
Want to: exploit latent low-dimensional structure in high dimensional problem
identify a few features that capture most of the heterogeneity of contexts, or to cluster the contexts into a few groups with similar characteristics.
Approach
Estimate low-dimensional structure from high-dimensional data offline (i.e. based on previously collected data about the context)
The inferred low-dimensional representation is used in solving the online contextual bandit problem
Detail discussion
Approach 1: Exploration, clustering, exploitation
Assume that we know the parameter set (expected reward vector under type x), but not the exact type of each arriving user
Explore over an initial set of users, then we cluster from this data, and finally we run that algorithm for new users
Proposed algorithm 1) Unif - Clustering - UCB-ET(delta)
For the first M0 users, sample the arms uniformly at random
After a large enough number of pilot users, perform a clustering algorithm to obtain N estimated parameter vectors
Then, run the UCB-ET(delta) algorithm for the new (non-pilot) users
Explain of UCB-KT algorithm: upper confidence bound with known types
Let theta_t be the empirical reward vector up to time t
Given a parameter x, define B(x) as the set of parameters for which the optimal arms are better than the optimal arm for x, and the reward distribution under the optimal arm for x are the same
Epsilon: elite arms, which include only arms that are optimal for some parameter x
Input: estimated parameter vectors of each type
Approach 2: online clustering
Continuously cluster as we learn.
Suppose first we know the type of each arriving user (but do not know the parameter vector for each type), then the problem is reduced to N separate bandit problems, one for each type
In each time step, given past observations, we use a clustering technique to divide users into N groups
Last updated
Was this helpful?