Clustered Bandits

https://arxiv.org/abs/1206.4169

Abstract

  • Few types of users, each with a specific response to different arms

  • User enters the system: type unknown to the decision maker

    • Can take advantage of knowing that only few types exist and cluster the users according to their response to the arms

Introduction

  • Challenge 1: The decisions made maybe context-specific --> contextual version of the classical MAB

  • Challenge 2: The number of contexts may be quite large, and the number of observation per context maybe quite small

    • I.e. many user, while each user may only be interacting with the system for relatively small number of purchases

    • Want to: exploit latent low-dimensional structure in high dimensional problem

      • identify a few features that capture most of the heterogeneity of contexts, or to cluster the contexts into a few groups with similar characteristics.

  • Approach

    • Estimate low-dimensional structure from high-dimensional data offline (i.e. based on previously collected data about the context)

    • The inferred low-dimensional representation is used in solving the online contextual bandit problem

Detail discussion

Approach 1: Exploration, clustering, exploitation

  • Assume that we know the parameter set (expected reward vector under type x), but not the exact type of each arriving user

  • Explore over an initial set of users, then we cluster from this data, and finally we run that algorithm for new users

Proposed algorithm 1) Unif - Clustering - UCB-ET(delta)

  • For the first M0 users, sample the arms uniformly at random

  • After a large enough number of pilot users, perform a clustering algorithm to obtain N estimated parameter vectors

  • Then, run the UCB-ET(delta) algorithm for the new (non-pilot) users

  • Explain of UCB-KT algorithm: upper confidence bound with known types

    • Let theta_t be the empirical reward vector up to time t

    • Given a parameter x, define B(x) as the set of parameters for which the optimal arms are better than the optimal arm for x, and the reward distribution under the optimal arm for x are the same

    • Epsilon: elite arms, which include only arms that are optimal for some parameter x

  • Input: estimated parameter vectors of each type

Approach 2: online clustering

  • Continuously cluster as we learn.

  • Suppose first we know the type of each arriving user (but do not know the parameter vector for each type), then the problem is reduced to N separate bandit problems, one for each type

  • In each time step, given past observations, we use a clustering technique to divide users into N groups

Last updated