# Clustered Bandits

https://arxiv.org/abs/1206.4169

### Abstract

Few types of users, each with a specific response to different arms

User enters the system: type unknown to the decision maker

Can take advantage of knowing that only few types exist and cluster the users according to their response to the arms

### Introduction

**Challenge 1**: The decisions made maybe context-specific --> contextual version of the classical MAB**Challenge 2**: The number of contexts may be quite large, and the number of observation per context maybe quite smallI.e. many user, while each user may only be interacting with the system for relatively small number of purchases

Want to: exploit latent low-dimensional structure in high dimensional problem

identify a few features that capture most of the heterogeneity of contexts, or to cluster the contexts into a few groups with similar characteristics.

Approach

Estimate low-dimensional structure from high-dimensional data offline (i.e. based on previously collected data about the context)

The inferred low-dimensional representation is used in solving the online contextual bandit problem

### Detail discussion

Approach 1: Exploration, clustering, exploitation

Assume that we know the parameter set (expected reward vector under type x), but not the exact type of each arriving user

Explore over an initial set of users, then we cluster from this data, and finally we run that algorithm for new users

Proposed algorithm 1) Unif - Clustering - UCB-ET(delta)

For the first M0 users, sample the arms uniformly at random

After a large enough number of pilot users, perform a clustering algorithm to obtain N estimated parameter vectors

Then, run the UCB-ET(delta) algorithm for the new (non-pilot) users

Explain of UCB-KT algorithm: upper confidence bound with known types

Let theta_t be the empirical reward vector up to time t

Given a parameter x, define B(x) as the set of parameters for which the optimal arms are better than the optimal arm for x, and the reward distribution under the optimal arm for x are the same

Epsilon: elite arms, which include only arms that are optimal for some parameter x

Input: estimated parameter vectors of each type

Approach 2: online clustering

Continuously cluster as we learn.

Suppose first we know the type of each arriving user (but do not know the parameter vector for each type), then the problem is reduced to N separate bandit problems, one for each type

In each time step, given past observations, we use a clustering technique to divide users into N groups

Last updated