# Contexual Bandits and Reinforcement Learning

* Contextual bandits
  * Can think of it as an extension of multi-armed bandits, or as a simplified version of reinforcement learning&#x20;
* **The multi-armed bandit** algorithm outputs an action but doesn’t use any information about the state of the environment (context).
* **The contextual bandit** extends the model by making the decision conditional on the state of the environment.
  * With such model, you not only optimize decision based on previous observations, but you also personalize decisions for every situation.
  * The algorithm&#x20;
    * observes a context
    * makes a decision, choosing one action from a number of alternative actions
    * observes an outcome of that decision.&#x20;
    * An outcome defines a reward&#x20;
  * The goal is to maximize average reward.

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Me0NNcLi0MeIuZccJwz%2F-Me0Nz-Fk21AA359ceYx%2Fimage.png?alt=media\&token=4c1db4d2-bf1f-46fb-8e8e-d2cc84bc0991)

#### Compared with Reinforcement learning?

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Me0NNcLi0MeIuZccJwz%2F-Me0P3XtNBgJtbG0Vs7X%2Fimage.png?alt=media\&token=23185204-454e-41b1-978a-b2349b7d2fa7)

* Can think about RL as an extension of contextual bandits&#x20;
* You still have an agent (policy) that takes action based on the state of the environment, observes a reward. The difference is that the agent can take multiple consecutive actions, and reward information is sparse.&#x20;
  * The model will use the state of the chess board as context, will decide on which moves to make, but it will get a reward only at the very end of the game: win, loss, or draw.&#x20;
  * The sparsity of reward information makes it harder to train the model. You encounter a problem of credit assignment problem: how to assign credit or blame individual actions.

### Multi-armed bandit problem

* Exploration v.s Exploitation&#x20;
  * Limited resources allocated towards different choices&#x20;
  * Gathering information about the choices v.s using that information to maximize rewards&#x20;
* Stochastic (include randomness)&#x20;
* Class&#x20;
  * Epsilon strategies: use epsilon as the threshold&#x20;
    * Exploration phase followed by exploitation stage&#x20;
    * Large enough to gather enough information to accurately reveal the winner, but small enough that we don't waste too much time on exploring&#x20;
    * Epsilon-Greedy&#x20;
      * Make a choice before we pull any lever&#x20;
      * 90% percent of time to pull, 10% to pull random&#x20;
      * Throughout the run, divide explore / exploit. Flexibility to switch
  * UCB&#x20;
    * More time explore, more certain (confidence interval)&#x20;
      * 0-10, 4-6&#x20;
      * 4-6: higher confidence in fewer trails&#x20;
    * Keeps track of which option would benefit more or less from exploration and exploit accordingly&#x20;
    * Takes samples of each option's rewards, it consider the variance of the observed samples to also calculate the upper confidence bound&#x20;
      * Bound: statistical potential for reward&#x20;
      * Each round, the slot machine with the highest potential reward is chosen (dual benefit)
        * The option for which exploration would yield the most actionable new information&#x20;
        * Likely that such a slot machine has already ranked highly among our winners, if not the top&#x20;
    * Most popular form: UCB1&#x20;
      * Uses Chernoff-Hoeffding Inequality to achieve much faster confidence bounds calculation for arbitrary distributions&#x20;
  * Bandit variants&#x20;
    * Infinite Arms (theoretically)&#x20;
    * Variable Arms&#x20;
      * Presume that distribution of rewards from each slot machine is constant (?)
      * In real world, this might change over time&#x20;
    * Contextual Bandits&#x20;
      * What if the rewards were dependent on some other external variables that we could also measure&#x20;
      * Correlations between known states and reward expectations&#x20;
    * Combinatorial Bandits&#x20;
    * Dueling Bandits&#x20;
    * Continuous Bandits&#x20;
    * Adversarial Arms&#x20;
    * Strategic Arms&#x20;
