Contexual Bandits and Reinforcement Learning

Contextual bandits
- Can think of it as an extension of multi-armed bandits, or as a simplified version of reinforcement learning
The multi-armed bandit algorithm outputs an action but doesn’t use any information about the state of the environment (context).
The contextual bandit extends the model by making the decision conditional on the state of the environment.
- With such model, you not only optimize decision based on previous observations, but you also personalize decisions for every situation.
- The algorithm
  - observes a context
  - makes a decision, choosing one action from a number of alternative actions
  - observes an outcome of that decision.
  - An outcome defines a reward
- The goal is to maximize average reward.

Compared with Reinforcement learning?

Can think about RL as an extension of contextual bandits
You still have an agent (policy) that takes action based on the state of the environment, observes a reward. The difference is that the agent can take multiple consecutive actions, and reward information is sparse.
- The model will use the state of the chess board as context, will decide on which moves to make, but it will get a reward only at the very end of the game: win, loss, or draw.
- The sparsity of reward information makes it harder to train the model. You encounter a problem of credit assignment problem: how to assign credit or blame individual actions.

Multi-armed bandit problem

Exploration v.s Exploitation
- Limited resources allocated towards different choices
- Gathering information about the choices v.s using that information to maximize rewards
Stochastic (include randomness)
Class
- Epsilon strategies: use epsilon as the threshold
  - Exploration phase followed by exploitation stage
  - Large enough to gather enough information to accurately reveal the winner, but small enough that we don't waste too much time on exploring
  - Epsilon-Greedy
    Make a choice before we pull any lever
    90% percent of time to pull, 10% to pull random
    Throughout the run, divide explore / exploit. Flexibility to switch
- UCB
  - More time explore, more certain (confidence interval)
    0-10, 4-6
    4-6: higher confidence in fewer trails
  - Keeps track of which option would benefit more or less from exploration and exploit accordingly
  - Takes samples of each option's rewards, it consider the variance of the observed samples to also calculate the upper confidence bound
    Bound: statistical potential for reward
    Each round, the slot machine with the highest potential reward is chosen (dual benefit)
    The option for which exploration would yield the most actionable new information
    Likely that such a slot machine has already ranked highly among our winners, if not the top
  - Most popular form: UCB1
    Uses Chernoff-Hoeffding Inequality to achieve much faster confidence bounds calculation for arbitrary distributions
- Bandit variants
  - Infinite Arms (theoretically)
  - Variable Arms
    Presume that distribution of rewards from each slot machine is constant (?)
    In real world, this might change over time
  - Contextual Bandits
    What if the rewards were dependent on some other external variables that we could also measure
    Correlations between known states and reward expectations
  - Combinatorial Bandits
  - Dueling Bandits
  - Continuous Bandits
  - Adversarial Arms
  - Strategic Arms

PreviousImportant Sampling NextReinforcement Learning for Caching with Space-Time Popularity Dynamics

Last updated 3 years ago

Was this helpful?