{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring Data\n",
    "\n",
    "*Written by Todd Gureckis and Brenden Lake*\n",
    "\n",
    "So far in this course we've learned about using Jupyter and Python. We have also used Pandas dataframes and learned a bit about plotting data.\n",
    "\n",
    "In this homework, we are pulling those skills into practice to try to \"explore\" a data set.  At this point we are not really going to be doing much in terms of inferential statistics.  Our goals are just to be able to formulate a question and then try to take the steps necessary to compute descriptive statistics and plots that might give us a sense of the answer.  You might call it \"Answering Questions with Data.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we load the basic packages we will be using in this tutorial.  Remeber how we import the modules using an abbreviated name (`import XC as YY`).  This is to reduce the amount of text we type when we use the functions.\n",
    "\n",
    "**Note**: `%matplotlib inline` is an example of something specific to Jupyter call 'cell magic' and enables plotting *within* the notebook and not opening a separate window."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-04-12T04:17:24.542921Z",
     "start_time": "2019-04-12T04:17:23.189489Z"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline \n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import io\n",
    "from datetime import datetime\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reminders of basic pandas functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a reminder of some of the pandas basics, let's revisit the data set of professor salaries we have considered over the last couple of classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data = pd.read_csv('salary.csv', sep = ',', header='infer')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the name of the dataframe is now called `salary_data` instead of `df`.  It could have been called anything as it is just our variable to work with.  However, you'll want to be careful with your nameing when copying commands and stuff from the past."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### peek at the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get the size of the dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In rows, columns format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Access a single column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data[['salary']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute some descriptive statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data[['salary']].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data[['salary']].count()  # how many rows are there?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### creating new column based on the values of others"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data['pubperyear'] = 0\n",
    "salary_data['pubperyear'] = salary_data['publications']/salary_data['years']\n",
    "salary_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting only certain rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sub_df=salary_data.query('salary > 90000 & salary < 100000')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sub_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sub_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get the unique values of a columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data['departm'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many unique department values are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data['departm'].unique().size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(salary_data['departm'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Breaking the data into subgroups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "male_df = salary_data.query('gender == 0').reset_index(drop=True)\n",
    "female_df = salary_data.query('gender == 1').reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recombining subgroups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.concat([female_df, male_df],axis = 0).reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scatter plot two columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.regplot(x=salary_data.age, y=salary_data.salary)\n",
    "ax.set_title('Salary and age')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Histogram of a column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.displot(salary_data['salary'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also combine two different histograms on the same plot to compared them more easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(10,3),sharey=True,sharex=True)\n",
    "sns.histplot(male_df[\"salary\"], ax=ax1,\n",
    "             bins=range(male_df[\"salary\"].min(),male_df[\"salary\"].max(), 5000),\n",
    "             kde=False,\n",
    "             color=\"b\")\n",
    "ax1.set_title(\"Salary distribution for males\")\n",
    "sns.histplot(female_df[\"salary\"], ax=ax2,\n",
    "             bins=range(female_df[\"salary\"].min(),female_df[\"salary\"].max(), 5000),\n",
    "             kde=False,\n",
    "             color=\"r\")\n",
    "ax2.set_title(\"Salary distribution for females\")\n",
    "ax1.set_ylabel(\"Number of users in age group\")\n",
    "for ax in (ax1,ax2):\n",
    "    sns.despine(ax=ax)\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Group the salary column using the department name and compute the mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data.groupby('departm')['salary'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Group the age column using the department name and compute the modal age of the faculty"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's check the age of everyone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_data.age.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, there are a few people who don't have an age so we'll need to drop them using `.dropna()` before computing the mode."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there doesn't seem to me a mode function provided by default we can write our own custom function and use it as a descriptive statistics using the `.apply()` command.  Here is an example of how that works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def my_mode(my_array):\n",
    "    counts = np.bincount(my_array)\n",
    "    mode = np.argmax(counts)\n",
    "    return mode, counts[mode]\n",
    "\n",
    "# wee need to drop the \n",
    "salary_data.dropna().groupby('departm')['age'].apply(my_mode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OkCupid Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next set of analyses concerns a large, public dataset of OKCupid profile information (from 60000 profiles). This is a little different than your traditional data from a psychological study, but it's still behavioral data and we thought it would be interesting to explore.\n",
    "\n",
    "This dataset has been [published](https://www.tandfonline.com/doi/abs/10.1080/10691898.2015.11889737) with permission in the [Journal of Statistics Education](https://www.tandfonline.com/toc/ujse20/current), Volume 23, Number 2 (2015) by Albert Y. Kim et al. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "profile_data = pd.read_csv('https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles_revised.csv.zip?raw=true', compression='zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\" role=\"alert\">\n",
    "  <strong>Question 1</strong> <br>\n",
    "    Plot a historam of the distribution of ages for both men and women on OkCupid.  Provide a brief (1-2 sentence summary of what you see).\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\" role=\"alert\">\n",
    "  <strong>Your Answer</strong> <br>\n",
    "  Delete this text and put your answer here.  The code for your analysis should appear in the cells below.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\" role=\"alert\">\n",
    "  <strong>Question 2</strong> <br>\n",
    "    Find the mean, median and modal age for men and women in this dataset.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\" role=\"alert\">\n",
    "  <strong>Your Answer</strong> <br>\n",
    "  Delete this text and put your answer here.  The code for your analysis should appear in the cells below.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\" role=\"alert\">\n",
    "  <strong>Question 3</strong> <br>\n",
    "    Plot a histogram of the height of women and men in this dataset.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\" role=\"alert\">\n",
    "  <strong>Your Answer</strong> <br>\n",
    "  Delete this text and put your answer here.  The code for your analysis should appear in the cells below.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\" role=\"alert\">\n",
    "  <strong>Question 4</strong> <br>\n",
    "    Propose a new type of analysis you'd like to see.  Work with the instructor or TA to try to accomplish it.  Be reasonable!\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\" role=\"alert\">\n",
    "  <strong>Your Answer</strong> <br>\n",
    "  Delete this text and put your answer here.  The code for your analysis should appear in the cells below.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Large-scale behavioral dataset of risky choice\n",
    "\n",
    "Next, we will work with a massive dataset regarding human decision making under risky choice. This comes from a recent paper by Peterson et al., which __I suggest you read__ before continuing:\n",
    "\n",
    "- Peterson, J. C., Bourgin, D. D., Agrawal, M., Reichman, D., & Griffiths, T. L. (2021). [Using large-scale experiments and machine learning to discover theories of human decision-making](http://cocosci.princeton.edu/jpeterson/papers/peterson2021-science.pdf). _Science(372):1209-1214_.\n",
    "\n",
    "The following notebook are adapted from this [github repository](https://github.com/jcpeterson/choices13k) which contains the Choices 13K dataset collected in the paper.\n",
    "\n",
    "### Data Loading\n",
    "\n",
    "First, let's load the dataset and do some pre-processing. Don't worry too much about what the following code is doing, but the key dataframe you need afterwards is called `df` and shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Problem</th>\n",
       "      <th>Feedback</th>\n",
       "      <th>n</th>\n",
       "      <th>Block</th>\n",
       "      <th>Ha</th>\n",
       "      <th>pHa</th>\n",
       "      <th>La</th>\n",
       "      <th>Hb</th>\n",
       "      <th>pHb</th>\n",
       "      <th>Lb</th>\n",
       "      <th>LotShapeB</th>\n",
       "      <th>LotNumB</th>\n",
       "      <th>Amb</th>\n",
       "      <th>Corr</th>\n",
       "      <th>bRate</th>\n",
       "      <th>bRate_std</th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>0.95</td>\n",
       "      <td>-1</td>\n",
       "      <td>23</td>\n",
       "      <td>0.05</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.626667</td>\n",
       "      <td>0.384460</td>\n",
       "      <td>[[0.9500000000000001, 21.0], [0.05, 23.0]]</td>\n",
       "      <td>[[0.9500000000000001, 26.0], [0.05, -1.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>14</td>\n",
       "      <td>0.60</td>\n",
       "      <td>-18</td>\n",
       "      <td>8</td>\n",
       "      <td>0.25</td>\n",
       "      <td>-5</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>-1</td>\n",
       "      <td>0.493333</td>\n",
       "      <td>0.413118</td>\n",
       "      <td>[[0.75, -5.0], [0.25, 8.0]]</td>\n",
       "      <td>[[0.6000000000000001, 14.0], [0.4, -18.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>True</td>\n",
       "      <td>17</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.611765</td>\n",
       "      <td>0.432843</td>\n",
       "      <td>[[1.0, 1.0]]</td>\n",
       "      <td>[[0.5, 2.0], [0.5, 0.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>True</td>\n",
       "      <td>18</td>\n",
       "      <td>3</td>\n",
       "      <td>37</td>\n",
       "      <td>0.05</td>\n",
       "      <td>8</td>\n",
       "      <td>87</td>\n",
       "      <td>0.25</td>\n",
       "      <td>-31</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.222222</td>\n",
       "      <td>0.387383</td>\n",
       "      <td>[[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]]</td>\n",
       "      <td>[[0.05, 37.0], [0.9500000000000001, 8.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "      <td>15</td>\n",
       "      <td>1</td>\n",
       "      <td>26</td>\n",
       "      <td>1.00</td>\n",
       "      <td>26</td>\n",
       "      <td>45</td>\n",
       "      <td>0.75</td>\n",
       "      <td>-36</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.586667</td>\n",
       "      <td>0.450185</td>\n",
       "      <td>[[0.25, -36.0], [0.375, 41.0], [0.1875, 43.0],...</td>\n",
       "      <td>[[1.0, 26.0], [0.0, 26.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14563</th>\n",
       "      <td>13002</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>3</td>\n",
       "      <td>30</td>\n",
       "      <td>1.00</td>\n",
       "      <td>30</td>\n",
       "      <td>42</td>\n",
       "      <td>0.80</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>0.367619</td>\n",
       "      <td>0.302731</td>\n",
       "      <td>[[0.199999999999999, 0.0], [0.8, 42.0]]</td>\n",
       "      <td>[[1.0, 30.0], [0.0, 30.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14564</th>\n",
       "      <td>13003</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>5</td>\n",
       "      <td>70</td>\n",
       "      <td>0.50</td>\n",
       "      <td>-42</td>\n",
       "      <td>18</td>\n",
       "      <td>0.80</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.760000</td>\n",
       "      <td>0.364104</td>\n",
       "      <td>[[0.199999999999999, 7.0], [0.8, 18.0]]</td>\n",
       "      <td>[[0.5, 70.0], [0.5, -42.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14565</th>\n",
       "      <td>13004</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>0.40</td>\n",
       "      <td>-17</td>\n",
       "      <td>31</td>\n",
       "      <td>0.40</td>\n",
       "      <td>-34</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.666667</td>\n",
       "      <td>0.367747</td>\n",
       "      <td>[[0.6000000000000001, -34.0], [0.0125, 28.5], ...</td>\n",
       "      <td>[[0.4, 8.0], [0.6000000000000001, -17.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14566</th>\n",
       "      <td>13005</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>2</td>\n",
       "      <td>89</td>\n",
       "      <td>0.50</td>\n",
       "      <td>-49</td>\n",
       "      <td>45</td>\n",
       "      <td>0.50</td>\n",
       "      <td>-12</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.386667</td>\n",
       "      <td>0.381476</td>\n",
       "      <td>[[0.5, -12.0], [0.5, 45.0]]</td>\n",
       "      <td>[[0.5, 89.0], [0.5, -49.0]]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14567</th>\n",
       "      <td>13006</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>3</td>\n",
       "      <td>25</td>\n",
       "      <td>0.80</td>\n",
       "      <td>25</td>\n",
       "      <td>70</td>\n",
       "      <td>0.50</td>\n",
       "      <td>-4</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>0.626667</td>\n",
       "      <td>0.361478</td>\n",
       "      <td>[[0.5, -4.0], [0.5, 70.0]]</td>\n",
       "      <td>[[0.8, 25.0], [0.199999999999999, 25.0]]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>14568 rows × 18 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       Problem  Feedback   n  Block  Ha   pHa  La  Hb   pHb  Lb  LotShapeB  \\\n",
       "0            1      True  15      2  26  0.95  -1  23  0.05  21          0   \n",
       "1            2      True  15      4  14  0.60 -18   8  0.25  -5          0   \n",
       "2            3      True  17      4   2  0.50   0   1  1.00   1          0   \n",
       "3            4      True  18      3  37  0.05   8  87  0.25 -31          1   \n",
       "4            5     False  15      1  26  1.00  26  45  0.75 -36          2   \n",
       "...        ...       ...  ..    ...  ..   ...  ..  ..   ...  ..        ...   \n",
       "14563    13002      True  15      3  30  1.00  30  42  0.80   0          0   \n",
       "14564    13003      True  15      5  70  0.50 -42  18  0.80   7          0   \n",
       "14565    13004      True  15      5   8  0.40 -17  31  0.40 -34          1   \n",
       "14566    13005      True  15      2  89  0.50 -49  45  0.50 -12          0   \n",
       "14567    13006      True  15      3  25  0.80  25  70  0.50  -4          0   \n",
       "\n",
       "       LotNumB    Amb  Corr     bRate  bRate_std  \\\n",
       "0            1  False     0  0.626667   0.384460   \n",
       "1            1   True    -1  0.493333   0.413118   \n",
       "2            1  False     0  0.611765   0.432843   \n",
       "3            2  False     0  0.222222   0.387383   \n",
       "4            5  False     0  0.586667   0.450185   \n",
       "...        ...    ...   ...       ...        ...   \n",
       "14563        1   True     0  0.367619   0.302731   \n",
       "14564        1  False     0  0.760000   0.364104   \n",
       "14565        6  False     0  0.666667   0.367747   \n",
       "14566        1  False     0  0.386667   0.381476   \n",
       "14567        1   True     0  0.626667   0.361478   \n",
       "\n",
       "                                                       B  \\\n",
       "0             [[0.9500000000000001, 21.0], [0.05, 23.0]]   \n",
       "1                            [[0.75, -5.0], [0.25, 8.0]]   \n",
       "2                                           [[1.0, 1.0]]   \n",
       "3          [[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]]   \n",
       "4      [[0.25, -36.0], [0.375, 41.0], [0.1875, 43.0],...   \n",
       "...                                                  ...   \n",
       "14563            [[0.199999999999999, 0.0], [0.8, 42.0]]   \n",
       "14564            [[0.199999999999999, 7.0], [0.8, 18.0]]   \n",
       "14565  [[0.6000000000000001, -34.0], [0.0125, 28.5], ...   \n",
       "14566                        [[0.5, -12.0], [0.5, 45.0]]   \n",
       "14567                         [[0.5, -4.0], [0.5, 70.0]]   \n",
       "\n",
       "                                                A  \n",
       "0      [[0.9500000000000001, 26.0], [0.05, -1.0]]  \n",
       "1      [[0.6000000000000001, 14.0], [0.4, -18.0]]  \n",
       "2                        [[0.5, 2.0], [0.5, 0.0]]  \n",
       "3       [[0.05, 37.0], [0.9500000000000001, 8.0]]  \n",
       "4                      [[1.0, 26.0], [0.0, 26.0]]  \n",
       "...                                           ...  \n",
       "14563                  [[1.0, 30.0], [0.0, 30.0]]  \n",
       "14564                 [[0.5, 70.0], [0.5, -42.0]]  \n",
       "14565   [[0.4, 8.0], [0.6000000000000001, -17.0]]  \n",
       "14566                 [[0.5, 89.0], [0.5, -49.0]]  \n",
       "14567    [[0.8, 25.0], [0.199999999999999, 25.0]]  \n",
       "\n",
       "[14568 rows x 18 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c13k = pd.read_csv('https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_selections.csv')\n",
    "c13k_problems = pd.read_json(\"https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_problems.json\", orient='index')\n",
    "df = c13k.join(c13k_problems, how=\"left\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Understanding the dataframe\n",
    "\n",
    "The rows in  `df` represent many different problem types (labeled with their problem ID), which provides participants with a choice between two risky Gambles A and B. The paper ran a massive experiment that studies over different 13,000 problems... wow!\n",
    "\n",
    "Many different human participants were shown each problem type. The  proportion of participants that choose Gamble B over Gamble A is listed in the column `bRate`. The information shown in each column is described in the table below.\n",
    "\n",
    "For our purposes, the most important columns will be `Ha`, `pHa`, `La`, `Hb`, `pHb`, `Lb`, and `bRate`.\n",
    "\n",
    "<br>\n",
    "\n",
    "<center>\n",
    "\n",
    "|   Column  |    Data Type    | Description                                                                                                                                                                                                                      |\n",
    "|:---------:|:---------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
    "|  Problem  |     Integer     | A unique internal problem ID                                                                                                                                                                                                    |\n",
    "|  Feedback |     Boolean     | Whether the participants were given feedback about the reward they received and missed out on after making their<br>selection.                                                                                                       |\n",
    "|     n     |     Integer     | The number of subjects run on the current problem                                                                                                                                                                                |\n",
    "|   Block   | {1, 2, 3, 4, 5} | The \"block ID\" for the current problem. For problems with no feedback, Block is always 1. Otherwise, Block ID was<br>sampled uniformly at random from {2, 3, 4, 5}.                                                                 |\n",
    "|     Ha    |     Integer     | The first outcome of gamble A\n",
    "                                      |\n",
    "|    pHa    |      Float      | The probability of Ha in gamble A                                                                                   |\n",
    "|     La    |     Integer     | The second outcome of gamble A (occurs with probability 1-pHa).                                                     |\n",
    "|     Hb    |     Integer     | The expected value of the good outcomes in gamble B                                                                                                                                                                                         |\n",
    "|    pHb    |      Float      | The probability of outcome Hb in gamble B                                                                                                                                                                                             |\n",
    "|     Lb    |      Float      | The bad outcome for gamble B (occurs with probability 1-pHb)                                                                                                                                                                   |\n",
    "| LotShapeB |   {0, 1, 2, 3}  | The shape of the lottery distribution for gamble B. A value of 1 indicates the distribution is symmetric around its mean,<br>2 indicates right-skewed, 3 indicates left-skewed, and 0 indicates undefined (i.e., if LotNumB = 1).    |\n",
    "|  LotNumB  |     Integer     | The number of possible outcomes in the gamble B lottery                                                                                                                                                                               |\n",
    "|    Amb    |     Boolean     | Whether the decision maker was able to see the probabilities of the outcomes in Gamble B. 1 indicates the participant<br>received no information concerning the probabilities, and 0 implies complete information and no ambiguity. |\n",
    "|    Corr   |    {-1, 0, 1}   | Whether there is a correlation (negative, zero, or positive) between the payoffs of the two gambles.                                                                                                                            |\n",
    "|   bRate   |    Float        | The ratio of gamble B selections to total selections for MTurk subjects.                                                                                                                             |\n",
    "| bRate_std |    Float        | The standard deviation of the ratio of gamble B to total selections for MTurk subjects.                                                                                                       |\n",
    "| A |    list   | The different outcomes in Gamble A, [[prob1, outcome1], [prob2, outcome2],...]                                                                                                       |\n",
    "| B |    list   | The different outcomes in Gamble B, [[prob1, outcome1], [prob2, outcome2],...]  \n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example risk gambles\n",
    "\n",
    "Next, we'll show how to display one of the gambles in a more human-readable format. Let's check out the very first gamble in the list. First, we'll need to define a function `print_problem` to help us format the problem nicely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_problem(problem_df, problem_index):\n",
    "    \"\"\"\n",
    "    Print a slightly more readable representation of the gamble information \n",
    "    for the gamble associated with index `problem_index` in `problem_df`.\n",
    "    \"\"\"\n",
    "    entry = problem_df.loc[problem_index]\n",
    "    gA, gB = entry.A, entry.B\n",
    "    gambleA = pd.DataFrame(gA, columns=[\"Probability\", \"Payout\"]).sort_values(\"Probability\", ascending=False)\n",
    "    gambleB = pd.DataFrame(gB, columns=[\"Probability\", \"Payout\"]).sort_values(\"Probability\", ascending=False)\n",
    "    linesA = gambleA[[\"Payout\", \"Probability\"]].to_string().split(\"\\n\")\n",
    "    linesB = gambleB[[\"Payout\", \"Probability\"]].to_string().split(\"\\n\")\n",
    "    \n",
    "    cols = [\"Problem\", \"Feedback\", \"n\", \"Block\", \"bRate\", \"bRate_std\"]\n",
    "    print(\"{:^55}\".format(f\"Problem {entry.Problem}, Feedback = {entry.Feedback}\"))\n",
    "    print(\"{:^55}\".format(f\"n = {entry.n}, bRate = {entry.bRate:.4f}, std: {entry.bRate_std:.4f}\"))\n",
    "    print(f\"\\n{'Gamble A':^25} {'Gamble B':>20}\")\n",
    "    for i in range(max(len(linesA), len(linesB))):\n",
    "        a_str = \"\" if i >= len(linesA) else linesA[i]\n",
    "        b_str = \"\" if i >= len(linesB) else linesB[i]\n",
    "        print(f\"{a_str:<25} {b_str:>25}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's visualize the first problem in the dataframe `df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              Problem 1, Feedback = True               \n",
      "          n = 15, bRate = 0.6267, std: 0.3845          \n",
      "\n",
      "        Gamble A                      Gamble B\n",
      "   Payout  Probability          Payout  Probability\n",
      "0    26.0         0.95       0    21.0         0.95\n",
      "1    -1.0         0.05       1    23.0         0.05\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Problem</th>\n",
       "      <th>Feedback</th>\n",
       "      <th>n</th>\n",
       "      <th>Block</th>\n",
       "      <th>Ha</th>\n",
       "      <th>pHa</th>\n",
       "      <th>La</th>\n",
       "      <th>Hb</th>\n",
       "      <th>pHb</th>\n",
       "      <th>Lb</th>\n",
       "      <th>LotShapeB</th>\n",
       "      <th>LotNumB</th>\n",
       "      <th>Amb</th>\n",
       "      <th>Corr</th>\n",
       "      <th>bRate</th>\n",
       "      <th>bRate_std</th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>15</td>\n",
       "      <td>2</td>\n",
       "      <td>26</td>\n",
       "      <td>0.95</td>\n",
       "      <td>-1</td>\n",
       "      <td>23</td>\n",
       "      <td>0.05</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.626667</td>\n",
       "      <td>0.38446</td>\n",
       "      <td>[[0.9500000000000001, 21.0], [0.05, 23.0]]</td>\n",
       "      <td>[[0.9500000000000001, 26.0], [0.05, -1.0]]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Problem  Feedback   n  Block  Ha   pHa  La  Hb   pHb  Lb  LotShapeB  \\\n",
       "0        1      True  15      2  26  0.95  -1  23  0.05  21          0   \n",
       "\n",
       "   LotNumB    Amb  Corr     bRate  bRate_std  \\\n",
       "0        1  False     0  0.626667    0.38446   \n",
       "\n",
       "                                            B  \\\n",
       "0  [[0.9500000000000001, 21.0], [0.05, 23.0]]   \n",
       "\n",
       "                                            A  \n",
       "0  [[0.9500000000000001, 26.0], [0.05, -1.0]]  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print_problem(df,0)\n",
    "df.iloc[[0]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For problem 1, this means there were 15 participants that made a selection for the problem. About 62% chose Gamble B (__bRate__), and the payout structure of each gamble is shown too. People preferred Gamble B which paid out 21 dollars with probability 0.95, and 23 dollars with probability 0.05.\n",
    "\n",
    "Other problems have a more complicated structure for Gamble B, as you can see in Problem 4 below (index 3). Here, there are 3 potential payouts for problem B.\n",
    "\n",
    "In the dataset generally, Gamble A always has just 2 outcomes, where __Ha__ is the payout of the good outcome (which one gets with probability __pHa__)  and __La__ is the bad outcome (with probability 1-__pHa__). Gamble B, alternatively, can more than 2 outcomes (in the case below, it has 3). __Hb__ is the expected value (weighted average) of the good outcomes (in the case below, 87), and __Lb__ is the bad outcome (-31). The probability of getting one of the better outcomes in B is  probability __pHb__ and the bad outcome is probability 1-__pHb__."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              Problem 4, Feedback = True               \n",
      "          n = 18, bRate = 0.2222, std: 0.3874          \n",
      "\n",
      "        Gamble A                      Gamble B\n",
      "   Payout  Probability          Payout  Probability\n",
      "1     8.0         0.95       0   -31.0        0.750\n",
      "0    37.0         0.05       1    86.5        0.125\n",
      "                             2    87.5        0.125\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Problem</th>\n",
       "      <th>Feedback</th>\n",
       "      <th>n</th>\n",
       "      <th>Block</th>\n",
       "      <th>Ha</th>\n",
       "      <th>pHa</th>\n",
       "      <th>La</th>\n",
       "      <th>Hb</th>\n",
       "      <th>pHb</th>\n",
       "      <th>Lb</th>\n",
       "      <th>LotShapeB</th>\n",
       "      <th>LotNumB</th>\n",
       "      <th>Amb</th>\n",
       "      <th>Corr</th>\n",
       "      <th>bRate</th>\n",
       "      <th>bRate_std</th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>True</td>\n",
       "      <td>18</td>\n",
       "      <td>3</td>\n",
       "      <td>37</td>\n",
       "      <td>0.05</td>\n",
       "      <td>8</td>\n",
       "      <td>87</td>\n",
       "      <td>0.25</td>\n",
       "      <td>-31</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>0.222222</td>\n",
       "      <td>0.387383</td>\n",
       "      <td>[[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]]</td>\n",
       "      <td>[[0.05, 37.0], [0.9500000000000001, 8.0]]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Problem  Feedback   n  Block  Ha   pHa  La  Hb   pHb  Lb  LotShapeB  \\\n",
       "3        4      True  18      3  37  0.05   8  87  0.25 -31          1   \n",
       "\n",
       "   LotNumB    Amb  Corr     bRate  bRate_std  \\\n",
       "3        2  False     0  0.222222   0.387383   \n",
       "\n",
       "                                               B  \\\n",
       "3  [[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]]   \n",
       "\n",
       "                                           A  \n",
       "3  [[0.05, 37.0], [0.9500000000000001, 8.0]]  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print_problem(df,3)\n",
    "df.iloc[[3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Maximizing expected value\n",
    "\n",
    "A classic economic theory is that people make choices to maximize the expected payoff. The theory posits that people are rational actors, making the choices (in this case, gambles) with the highest expected value. By higher expected value, we mean the gamble will pay out more money in the long run if it was chosen over and over again.\n",
    "\n",
    "Here is the equation for the expected value of Gamble A:\n",
    "\n",
    "$$ EV_A = Ha * pHa + La * (1-pHa) $$\n",
    "\n",
    "Similarly, here is the equation for the expected value of Gamble B:\n",
    "\n",
    "$$ EV_B = Hb * pHb + Lb * (1-pHb) $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\" role=\"alert\">\n",
    "  <strong>Question 5</strong> <br>\n",
    "    Here we are going to evaluate the theory that people make choices consistent with maximizing expected value. We will do so by comparing the theory to the real behavioral data. You will need to do the following:\n",
    " <ul>\n",
    "     <li>Add two new columns <b>EVA</b> and <b>EVB</b> to the dataframe df that compute the expected value of each Gamble A and Gamble B.</li>\n",
    "     <li>Add a new column <b>B_win</b> to df that thresholds <b>bRate</b>>0.5 to indicate whether the majority of participants chose Gamble B</li>\n",
    "  <li>For what percentage of problems was the majority choice consistent with the gamble that has highest expected value? </li>\n",
    " </ul> \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\" role=\"alert\">\n",
    "  <strong>Your Answer</strong> <br>\n",
    "Your code should appear below.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\" role=\"alert\">\n",
    "  <strong>Question 6</strong> <br>\n",
    "    Given the results of Question 5, how does the classical economic theory (maximizing expected value) fare as a theory of human decisions under risky choice? Please write a paragraph or two about your thoughts.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\" role=\"alert\">\n",
    "  <strong>Your Answer</strong> <br>\n",
    "  Delete this text and put your answer here.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\" role=\"alert\">\n",
    "  <strong>Question 7</strong> <br>\n",
    "    <ul> \n",
    "    <li>Write code to find the problem with the most popular Gamble B (there might be ties, don't worry about that. picking any one problem is fine)</li>\n",
    "    <li>Write code to find the problem with the least popular Gamble B</li>\n",
    "    <li>Briefly analyze in a few sentences. Does it match your intuition that the Gamble B in problem 2 is a lot worse than the corresponding Gamble A? Analyze in terms of expected value.</li>\n",
    "    </ul>\n",
    "</div>\n",
    "\n",
    "_Hint_: check out pandas functions `argmin` and `argmax`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\" role=\"alert\">\n",
    "  <strong>Your Answer</strong> <br>\n",
    "  Delete this text and put your answer here.  The code for your analysis should appear in the cells below.\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernel_info": {
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  },
  "nteract": {
   "version": "0.15.0"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {
    "height": "44px",
    "width": "252px"
   },
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "299px"
   },
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}