Contexual Bandits and Reinforcement Learning
Last updated
Was this helpful?
Last updated
Was this helpful?
Contextual bandits
Can think of it as an extension of multi-armed bandits, or as a simplified version of reinforcement learning
The multi-armed bandit algorithm outputs an action but doesn’t use any information about the state of the environment (context).
The contextual bandit extends the model by making the decision conditional on the state of the environment.
With such model, you not only optimize decision based on previous observations, but you also personalize decisions for every situation.
The algorithm
observes a context
makes a decision, choosing one action from a number of alternative actions
observes an outcome of that decision.
An outcome defines a reward
The goal is to maximize average reward.
Can think about RL as an extension of contextual bandits
You still have an agent (policy) that takes action based on the state of the environment, observes a reward. The difference is that the agent can take multiple consecutive actions, and reward information is sparse.
The model will use the state of the chess board as context, will decide on which moves to make, but it will get a reward only at the very end of the game: win, loss, or draw.
The sparsity of reward information makes it harder to train the model. You encounter a problem of credit assignment problem: how to assign credit or blame individual actions.
Exploration v.s Exploitation
Limited resources allocated towards different choices
Gathering information about the choices v.s using that information to maximize rewards
Stochastic (include randomness)
Class
Epsilon strategies: use epsilon as the threshold
Exploration phase followed by exploitation stage
Large enough to gather enough information to accurately reveal the winner, but small enough that we don't waste too much time on exploring
Epsilon-Greedy
Make a choice before we pull any lever
90% percent of time to pull, 10% to pull random
Throughout the run, divide explore / exploit. Flexibility to switch
UCB
More time explore, more certain (confidence interval)
0-10, 4-6
4-6: higher confidence in fewer trails
Keeps track of which option would benefit more or less from exploration and exploit accordingly
Takes samples of each option's rewards, it consider the variance of the observed samples to also calculate the upper confidence bound
Bound: statistical potential for reward
Each round, the slot machine with the highest potential reward is chosen (dual benefit)
The option for which exploration would yield the most actionable new information
Likely that such a slot machine has already ranked highly among our winners, if not the top
Most popular form: UCB1
Uses Chernoff-Hoeffding Inequality to achieve much faster confidence bounds calculation for arbitrary distributions
Bandit variants
Infinite Arms (theoretically)
Variable Arms
Presume that distribution of rewards from each slot machine is constant (?)
In real world, this might change over time
Contextual Bandits
What if the rewards were dependent on some other external variables that we could also measure
Correlations between known states and reward expectations
Combinatorial Bandits
Dueling Bandits
Continuous Bandits
Adversarial Arms
Strategic Arms