# Contexual Bandits and Reinforcement Learning

* Contextual bandits
  * Can think of it as an extension of multi-armed bandits, or as a simplified version of reinforcement learning&#x20;
* **The multi-armed bandit** algorithm outputs an action but doesn’t use any information about the state of the environment (context).
* **The contextual bandit** extends the model by making the decision conditional on the state of the environment.
  * With such model, you not only optimize decision based on previous observations, but you also personalize decisions for every situation.
  * The algorithm&#x20;
    * observes a context
    * makes a decision, choosing one action from a number of alternative actions
    * observes an outcome of that decision.&#x20;
    * An outcome defines a reward&#x20;
  * The goal is to maximize average reward.

![](/files/-Me0Nz-Fk21AA359ceYx)

#### Compared with Reinforcement learning?

![](/files/-Me0P3XtNBgJtbG0Vs7X)

* Can think about RL as an extension of contextual bandits&#x20;
* You still have an agent (policy) that takes action based on the state of the environment, observes a reward. The difference is that the agent can take multiple consecutive actions, and reward information is sparse.&#x20;
  * The model will use the state of the chess board as context, will decide on which moves to make, but it will get a reward only at the very end of the game: win, loss, or draw.&#x20;
  * The sparsity of reward information makes it harder to train the model. You encounter a problem of credit assignment problem: how to assign credit or blame individual actions.

### Multi-armed bandit problem

* Exploration v.s Exploitation&#x20;
  * Limited resources allocated towards different choices&#x20;
  * Gathering information about the choices v.s using that information to maximize rewards&#x20;
* Stochastic (include randomness)&#x20;
* Class&#x20;
  * Epsilon strategies: use epsilon as the threshold&#x20;
    * Exploration phase followed by exploitation stage&#x20;
    * Large enough to gather enough information to accurately reveal the winner, but small enough that we don't waste too much time on exploring&#x20;
    * Epsilon-Greedy&#x20;
      * Make a choice before we pull any lever&#x20;
      * 90% percent of time to pull, 10% to pull random&#x20;
      * Throughout the run, divide explore / exploit. Flexibility to switch
  * UCB&#x20;
    * More time explore, more certain (confidence interval)&#x20;
      * 0-10, 4-6&#x20;
      * 4-6: higher confidence in fewer trails&#x20;
    * Keeps track of which option would benefit more or less from exploration and exploit accordingly&#x20;
    * Takes samples of each option's rewards, it consider the variance of the observed samples to also calculate the upper confidence bound&#x20;
      * Bound: statistical potential for reward&#x20;
      * Each round, the slot machine with the highest potential reward is chosen (dual benefit)
        * The option for which exploration would yield the most actionable new information&#x20;
        * Likely that such a slot machine has already ranked highly among our winners, if not the top&#x20;
    * Most popular form: UCB1&#x20;
      * Uses Chernoff-Hoeffding Inequality to achieve much faster confidence bounds calculation for arbitrary distributions&#x20;
  * Bandit variants&#x20;
    * Infinite Arms (theoretically)&#x20;
    * Variable Arms&#x20;
      * Presume that distribution of rewards from each slot machine is constant (?)
      * In real world, this might change over time&#x20;
    * Contextual Bandits&#x20;
      * What if the rewards were dependent on some other external variables that we could also measure&#x20;
      * Correlations between known states and reward expectations&#x20;
    * Combinatorial Bandits&#x20;
    * Dueling Bandits&#x20;
    * Continuous Bandits&#x20;
    * Adversarial Arms&#x20;
    * Strategic Arms&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/specific-work/wisr-group/cache/index/contexual-bandits-and-reinforcement-learning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
