Contexual Bandits and Reinforcement Learning

  • Contextual bandits

    • Can think of it as an extension of multi-armed bandits, or as a simplified version of reinforcement learning

  • The multi-armed bandit algorithm outputs an action but doesn’t use any information about the state of the environment (context).

  • The contextual bandit extends the model by making the decision conditional on the state of the environment.

    • With such model, you not only optimize decision based on previous observations, but you also personalize decisions for every situation.

    • The algorithm

      • observes a context

      • makes a decision, choosing one action from a number of alternative actions

      • observes an outcome of that decision.

      • An outcome defines a reward

    • The goal is to maximize average reward.

Compared with Reinforcement learning?

  • Can think about RL as an extension of contextual bandits

  • You still have an agent (policy) that takes action based on the state of the environment, observes a reward. The difference is that the agent can take multiple consecutive actions, and reward information is sparse.

    • The model will use the state of the chess board as context, will decide on which moves to make, but it will get a reward only at the very end of the game: win, loss, or draw.

    • The sparsity of reward information makes it harder to train the model. You encounter a problem of credit assignment problem: how to assign credit or blame individual actions.

Multi-armed bandit problem

  • Exploration v.s Exploitation

    • Limited resources allocated towards different choices

    • Gathering information about the choices v.s using that information to maximize rewards

  • Stochastic (include randomness)

  • Class

    • Epsilon strategies: use epsilon as the threshold

      • Exploration phase followed by exploitation stage

      • Large enough to gather enough information to accurately reveal the winner, but small enough that we don't waste too much time on exploring

      • Epsilon-Greedy

        • Make a choice before we pull any lever

        • 90% percent of time to pull, 10% to pull random

        • Throughout the run, divide explore / exploit. Flexibility to switch

    • UCB

      • More time explore, more certain (confidence interval)

        • 0-10, 4-6

        • 4-6: higher confidence in fewer trails

      • Keeps track of which option would benefit more or less from exploration and exploit accordingly

      • Takes samples of each option's rewards, it consider the variance of the observed samples to also calculate the upper confidence bound

        • Bound: statistical potential for reward

        • Each round, the slot machine with the highest potential reward is chosen (dual benefit)

          • The option for which exploration would yield the most actionable new information

          • Likely that such a slot machine has already ranked highly among our winners, if not the top

      • Most popular form: UCB1

        • Uses Chernoff-Hoeffding Inequality to achieve much faster confidence bounds calculation for arbitrary distributions

    • Bandit variants

      • Infinite Arms (theoretically)

      • Variable Arms

        • Presume that distribution of rewards from each slot machine is constant (?)

        • In real world, this might change over time

      • Contextual Bandits

        • What if the rewards were dependent on some other external variables that we could also measure

        • Correlations between known states and reward expectations

      • Combinatorial Bandits

      • Dueling Bandits

      • Continuous Bandits

      • Adversarial Arms

      • Strategic Arms

Last updated