# Contexual Bandits and Reinforcement Learning

Contextual bandits

Can think of it as an extension of multi-armed bandits, or as a simplified version of reinforcement learning

**The multi-armed bandit**algorithm outputs an action but doesn’t use any information about the state of the environment (context).**The contextual bandit**extends the model by making the decision conditional on the state of the environment.With such model, you not only optimize decision based on previous observations, but you also personalize decisions for every situation.

The algorithm

observes a context

makes a decision, choosing one action from a number of alternative actions

observes an outcome of that decision.

An outcome defines a reward

The goal is to maximize average reward.

#### Compared with Reinforcement learning?

Can think about RL as an extension of contextual bandits

You still have an agent (policy) that takes action based on the state of the environment, observes a reward. The difference is that the agent can take multiple consecutive actions, and reward information is sparse.

The model will use the state of the chess board as context, will decide on which moves to make, but it will get a reward only at the very end of the game: win, loss, or draw.

The sparsity of reward information makes it harder to train the model. You encounter a problem of credit assignment problem: how to assign credit or blame individual actions.

### Multi-armed bandit problem

Exploration v.s Exploitation

Limited resources allocated towards different choices

Gathering information about the choices v.s using that information to maximize rewards

Stochastic (include randomness)

Class

Epsilon strategies: use epsilon as the threshold

Exploration phase followed by exploitation stage

Large enough to gather enough information to accurately reveal the winner, but small enough that we don't waste too much time on exploring

Epsilon-Greedy

Make a choice before we pull any lever

90% percent of time to pull, 10% to pull random

Throughout the run, divide explore / exploit. Flexibility to switch

UCB

More time explore, more certain (confidence interval)

0-10, 4-6

4-6: higher confidence in fewer trails

Keeps track of which option would benefit more or less from exploration and exploit accordingly

Takes samples of each option's rewards, it consider the variance of the observed samples to also calculate the upper confidence bound

Bound: statistical potential for reward

Each round, the slot machine with the highest potential reward is chosen (dual benefit)

The option for which exploration would yield the most actionable new information

Likely that such a slot machine has already ranked highly among our winners, if not the top

Most popular form: UCB1

Uses Chernoff-Hoeffding Inequality to achieve much faster confidence bounds calculation for arbitrary distributions

Bandit variants

Infinite Arms (theoretically)

Variable Arms

Presume that distribution of rewards from each slot machine is constant (?)

In real world, this might change over time

Contextual Bandits

What if the rewards were dependent on some other external variables that we could also measure

Correlations between known states and reward expectations

Combinatorial Bandits

Dueling Bandits

Continuous Bandits

Adversarial Arms

Strategic Arms

Last updated