Exploring Data

Written by Todd Gureckis and Brenden Lake

So far in this course we’ve learned about using Jupyter and Python. We have also used Pandas dataframes and learned a bit about plotting data.

In this homework, we are pulling those skills into practice to try to “explore” a data set. At this point we are not really going to be doing much in terms of inferential statistics. Our goals are just to be able to formulate a question and then try to take the steps necessary to compute descriptive statistics and plots that might give us a sense of the answer. You might call it “Answering Questions with Data.”

First, we load the basic packages we will be using in this tutorial. Remeber how we import the modules using an abbreviated name (import XC as YY). This is to reduce the amount of text we type when we use the functions.

Note: %matplotlib inline is an example of something specific to Jupyter call ‘cell magic’ and enables plotting within the notebook and not opening a separate window.

%matplotlib inline 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
from datetime import datetime
import random

Reminders of basic pandas functions

As a reminder of some of the pandas basics, let’s revisit the data set of professor salaries we have considered over the last couple of classes.

salary_data = pd.read_csv('salary.csv', sep = ',', header='infer')

Notice that the name of the dataframe is now called salary_data instead of df. It could have been called anything as it is just our variable to work with. However, you’ll want to be careful with your nameing when copying commands and stuff from the past.

peek at the data

salary_data.head()

Get the size of the dataframe

In rows, columns format

salary_data.shape

Access a single column

salary_data[['salary']]

Compute some descriptive statistics

salary_data[['salary']].describe()
salary_data[['salary']].count()  # how many rows are there?

creating new column based on the values of others

salary_data['pubperyear'] = 0
salary_data['pubperyear'] = salary_data['publications']/salary_data['years']
salary_data.head()

Selecting only certain rows

sub_df=salary_data.query('salary > 90000 & salary < 100000')
sub_df
sub_df.describe()

Get the unique values of a columns

salary_data['departm'].unique()

How many unique department values are there?

salary_data['departm'].unique().size

or

len(salary_data['departm'].unique())

Breaking the data into subgroups

male_df = salary_data.query('gender == 0').reset_index(drop=True)
female_df = salary_data.query('gender == 1').reset_index(drop=True)

Recombining subgroups

pd.concat([female_df, male_df],axis = 0).reset_index(drop=True)

Scatter plot two columns

ax = sns.regplot(x=salary_data.age, y=salary_data.salary)
ax.set_title('Salary and age')

Histogram of a column

sns.displot(salary_data['salary'])

You can also combine two different histograms on the same plot to compared them more easily.

fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(10,3),sharey=True,sharex=True)
sns.histplot(male_df["salary"], ax=ax1,
             bins=range(male_df["salary"].min(),male_df["salary"].max(), 5000),
             kde=False,
             color="b")
ax1.set_title("Salary distribution for males")
sns.histplot(female_df["salary"], ax=ax2,
             bins=range(female_df["salary"].min(),female_df["salary"].max(), 5000),
             kde=False,
             color="r")
ax2.set_title("Salary distribution for females")
ax1.set_ylabel("Number of users in age group")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

Group the salary column using the department name and compute the mean

salary_data.groupby('departm')['salary'].mean()

Group the age column using the department name and compute the modal age of the faculty

First let’s check the age of everyone.

salary_data.age.unique()

Ok, there are a few people who don’t have an age so we’ll need to drop them using .dropna() before computing the mode.

Since there doesn’t seem to me a mode function provided by default we can write our own custom function and use it as a descriptive statistics using the .apply() command. Here is an example of how that works.

def my_mode(my_array):
    counts = np.bincount(my_array)
    mode = np.argmax(counts)
    return mode, counts[mode]

# wee need to drop the 
salary_data.dropna().groupby('departm')['age'].apply(my_mode)

OkCupid Data

The next set of analyses concerns a large, public dataset of OKCupid profile information (from 60000 profiles). This is a little different than your traditional data from a psychological study, but it’s still behavioral data and we thought it would be interesting to explore.

This dataset has been published with permission in the Journal of Statistics Education, Volume 23, Number 2 (2015) by Albert Y. Kim et al.

profile_data = pd.read_csv('https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles_revised.csv.zip?raw=true', compression='zip')

Large-scale behavioral dataset of risky choice

Next, we will work with a massive dataset regarding human decision making under risky choice. This comes from a recent paper by Peterson et al., which I suggest you read before continuing:

The following notebook are adapted from this github repository which contains the Choices 13K dataset collected in the paper.

Data Loading

First, let’s load the dataset and do some pre-processing. Don’t worry too much about what the following code is doing, but the key dataframe you need afterwards is called df and shown below.

c13k = pd.read_csv('https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_selections.csv')
c13k_problems = pd.read_json("https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_problems.json", orient='index')
df = c13k.join(c13k_problems, how="left")
df
Problem Feedback n Block Ha pHa La Hb pHb Lb LotShapeB LotNumB Amb Corr bRate bRate_std B A
0 1 True 15 2 26 0.95 -1 23 0.05 21 0 1 False 0 0.626667 0.384460 [[0.9500000000000001, 21.0], [0.05, 23.0]] [[0.9500000000000001, 26.0], [0.05, -1.0]]
1 2 True 15 4 14 0.60 -18 8 0.25 -5 0 1 True -1 0.493333 0.413118 [[0.75, -5.0], [0.25, 8.0]] [[0.6000000000000001, 14.0], [0.4, -18.0]]
2 3 True 17 4 2 0.50 0 1 1.00 1 0 1 False 0 0.611765 0.432843 [[1.0, 1.0]] [[0.5, 2.0], [0.5, 0.0]]
3 4 True 18 3 37 0.05 8 87 0.25 -31 1 2 False 0 0.222222 0.387383 [[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]] [[0.05, 37.0], [0.9500000000000001, 8.0]]
4 5 False 15 1 26 1.00 26 45 0.75 -36 2 5 False 0 0.586667 0.450185 [[0.25, -36.0], [0.375, 41.0], [0.1875, 43.0],... [[1.0, 26.0], [0.0, 26.0]]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14563 13002 True 15 3 30 1.00 30 42 0.80 0 0 1 True 0 0.367619 0.302731 [[0.199999999999999, 0.0], [0.8, 42.0]] [[1.0, 30.0], [0.0, 30.0]]
14564 13003 True 15 5 70 0.50 -42 18 0.80 7 0 1 False 0 0.760000 0.364104 [[0.199999999999999, 7.0], [0.8, 18.0]] [[0.5, 70.0], [0.5, -42.0]]
14565 13004 True 15 5 8 0.40 -17 31 0.40 -34 1 6 False 0 0.666667 0.367747 [[0.6000000000000001, -34.0], [0.0125, 28.5], ... [[0.4, 8.0], [0.6000000000000001, -17.0]]
14566 13005 True 15 2 89 0.50 -49 45 0.50 -12 0 1 False 0 0.386667 0.381476 [[0.5, -12.0], [0.5, 45.0]] [[0.5, 89.0], [0.5, -49.0]]
14567 13006 True 15 3 25 0.80 25 70 0.50 -4 0 1 True 0 0.626667 0.361478 [[0.5, -4.0], [0.5, 70.0]] [[0.8, 25.0], [0.199999999999999, 25.0]]

14568 rows × 18 columns

Understanding the dataframe

The rows in df represent many different problem types (labeled with their problem ID), which provides participants with a choice between two risky Gambles A and B. The paper ran a massive experiment that studies over different 13,000 problems… wow!

Many different human participants were shown each problem type. The proportion of participants that choose Gamble B over Gamble A is listed in the column bRate. The information shown in each column is described in the table below.

For our purposes, the most important columns will be Ha, pHa, La, Hb, pHb, Lb, and bRate.


Column

Data Type

Description

Problem

Integer

A unique internal problem ID

Feedback

Boolean

Whether the participants were given feedback about the reward they received and missed out on after making their
selection.

n

Integer

The number of subjects run on the current problem

Block

{1, 2, 3, 4, 5}

The “block ID” for the current problem. For problems with no feedback, Block is always 1. Otherwise, Block ID was
sampled uniformly at random from {2, 3, 4, 5}.

Ha

Integer

The first outcome of gamble A

                                  |

| pHa | Float | The probability of Ha in gamble A | | La | Integer | The second outcome of gamble A (occurs with probability 1-pHa). | | Hb | Integer | The expected value of the good outcomes in gamble B | | pHb | Float | The probability of outcome Hb in gamble B | | Lb | Float | The bad outcome for gamble B (occurs with probability 1-pHb) | | LotShapeB | {0, 1, 2, 3} | The shape of the lottery distribution for gamble B. A value of 1 indicates the distribution is symmetric around its mean,
2 indicates right-skewed, 3 indicates left-skewed, and 0 indicates undefined (i.e., if LotNumB = 1). | | LotNumB | Integer | The number of possible outcomes in the gamble B lottery | | Amb | Boolean | Whether the decision maker was able to see the probabilities of the outcomes in Gamble B. 1 indicates the participant
received no information concerning the probabilities, and 0 implies complete information and no ambiguity. | | Corr | {-1, 0, 1} | Whether there is a correlation (negative, zero, or positive) between the payoffs of the two gambles. | | bRate | Float | The ratio of gamble B selections to total selections for MTurk subjects. | | bRate_std | Float | The standard deviation of the ratio of gamble B to total selections for MTurk subjects. | | A | list | The different outcomes in Gamble A, [[prob1, outcome1], [prob2, outcome2],…] | | B | list | The different outcomes in Gamble B, [[prob1, outcome1], [prob2, outcome2],…]

Example risk gambles

Next, we’ll show how to display one of the gambles in a more human-readable format. Let’s check out the very first gamble in the list. First, we’ll need to define a function print_problem to help us format the problem nicely.

def print_problem(problem_df, problem_index):
    """
    Print a slightly more readable representation of the gamble information 
    for the gamble associated with index `problem_index` in `problem_df`.
    """
    entry = problem_df.loc[problem_index]
    gA, gB = entry.A, entry.B
    gambleA = pd.DataFrame(gA, columns=["Probability", "Payout"]).sort_values("Probability", ascending=False)
    gambleB = pd.DataFrame(gB, columns=["Probability", "Payout"]).sort_values("Probability", ascending=False)
    linesA = gambleA[["Payout", "Probability"]].to_string().split("\n")
    linesB = gambleB[["Payout", "Probability"]].to_string().split("\n")
    
    cols = ["Problem", "Feedback", "n", "Block", "bRate", "bRate_std"]
    print("{:^55}".format(f"Problem {entry.Problem}, Feedback = {entry.Feedback}"))
    print("{:^55}".format(f"n = {entry.n}, bRate = {entry.bRate:.4f}, std: {entry.bRate_std:.4f}"))
    print(f"\n{'Gamble A':^25} {'Gamble B':>20}")
    for i in range(max(len(linesA), len(linesB))):
        a_str = "" if i >= len(linesA) else linesA[i]
        b_str = "" if i >= len(linesB) else linesB[i]
        print(f"{a_str:<25} {b_str:>25}")

Now let’s visualize the first problem in the dataframe df.

print_problem(df,0)
df.iloc[[0]]
              Problem 1, Feedback = True               
          n = 15, bRate = 0.6267, std: 0.3845          

        Gamble A                      Gamble B
   Payout  Probability          Payout  Probability
0    26.0         0.95       0    21.0         0.95
1    -1.0         0.05       1    23.0         0.05
Problem Feedback n Block Ha pHa La Hb pHb Lb LotShapeB LotNumB Amb Corr bRate bRate_std B A
0 1 True 15 2 26 0.95 -1 23 0.05 21 0 1 False 0 0.626667 0.38446 [[0.9500000000000001, 21.0], [0.05, 23.0]] [[0.9500000000000001, 26.0], [0.05, -1.0]]

For problem 1, this means there were 15 participants that made a selection for the problem. About 62% chose Gamble B (bRate), and the payout structure of each gamble is shown too. People preferred Gamble B which paid out 21 dollars with probability 0.95, and 23 dollars with probability 0.05.

Other problems have a more complicated structure for Gamble B, as you can see in Problem 4 below (index 3). Here, there are 3 potential payouts for problem B.

In the dataset generally, Gamble A always has just 2 outcomes, where Ha is the payout of the good outcome (which one gets with probability pHa) and La is the bad outcome (with probability 1-pHa). Gamble B, alternatively, can more than 2 outcomes (in the case below, it has 3). Hb is the expected value (weighted average) of the good outcomes (in the case below, 87), and Lb is the bad outcome (-31). The probability of getting one of the better outcomes in B is probability pHb and the bad outcome is probability 1-pHb.

print_problem(df,3)
df.iloc[[3]]
              Problem 4, Feedback = True               
          n = 18, bRate = 0.2222, std: 0.3874          

        Gamble A                      Gamble B
   Payout  Probability          Payout  Probability
1     8.0         0.95       0   -31.0        0.750
0    37.0         0.05       1    86.5        0.125
                             2    87.5        0.125
Problem Feedback n Block Ha pHa La Hb pHb Lb LotShapeB LotNumB Amb Corr bRate bRate_std B A
3 4 True 18 3 37 0.05 8 87 0.25 -31 1 2 False 0 0.222222 0.387383 [[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]] [[0.05, 37.0], [0.9500000000000001, 8.0]]

Maximizing expected value

A classic economic theory is that people make choices to maximize the expected payoff. The theory posits that people are rational actors, making the choices (in this case, gambles) with the highest expected value. By higher expected value, we mean the gamble will pay out more money in the long run if it was chosen over and over again.

Here is the equation for the expected value of Gamble A:

\[ EV_A = Ha * pHa + La * (1-pHa) \]

Similarly, here is the equation for the expected value of Gamble B:

\[ EV_B = Hb * pHb + Lb * (1-pHb) \]

Hint: check out pandas functions argmin and argmax.