Exploring Data¶
Written by Todd Gureckis and Brenden Lake
So far in this course we’ve learned about using Jupyter and Python. We have also used Pandas dataframes and learned a bit about plotting data.
In this homework, we are pulling those skills into practice to try to “explore” a data set. At this point we are not really going to be doing much in terms of inferential statistics. Our goals are just to be able to formulate a question and then try to take the steps necessary to compute descriptive statistics and plots that might give us a sense of the answer. You might call it “Answering Questions with Data.”
First, we load the basic packages we will be using in this tutorial. Remeber how we import the modules using an abbreviated name (import XC as YY
). This is to reduce the amount of text we type when we use the functions.
Note: %matplotlib inline
is an example of something specific to Jupyter call ‘cell magic’ and enables plotting within the notebook and not opening a separate window.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
from datetime import datetime
import random
Reminders of basic pandas functions¶
As a reminder of some of the pandas basics, let’s revisit the data set of professor salaries we have considered over the last couple of classes.
salary_data = pd.read_csv('salary.csv', sep = ',', header='infer')
Notice that the name of the dataframe is now called salary_data
instead of df
. It could have been called anything as it is just our variable to work with. However, you’ll want to be careful with your nameing when copying commands and stuff from the past.
peek at the data¶
salary_data.head()
Access a single column¶
salary_data[['salary']]
Compute some descriptive statistics¶
salary_data[['salary']].describe()
salary_data[['salary']].count() # how many rows are there?
creating new column based on the values of others¶
salary_data['pubperyear'] = 0
salary_data['pubperyear'] = salary_data['publications']/salary_data['years']
salary_data.head()
Selecting only certain rows¶
sub_df=salary_data.query('salary > 90000 & salary < 100000')
sub_df
sub_df.describe()
Get the unique values of a columns¶
salary_data['departm'].unique()
How many unique department values are there?
salary_data['departm'].unique().size
or
len(salary_data['departm'].unique())
Breaking the data into subgroups¶
male_df = salary_data.query('gender == 0').reset_index(drop=True)
female_df = salary_data.query('gender == 1').reset_index(drop=True)
Recombining subgroups¶
pd.concat([female_df, male_df],axis = 0).reset_index(drop=True)
Scatter plot two columns¶
ax = sns.regplot(x=salary_data.age, y=salary_data.salary)
ax.set_title('Salary and age')
Histogram of a column¶
sns.displot(salary_data['salary'])
You can also combine two different histograms on the same plot to compared them more easily.
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(10,3),sharey=True,sharex=True)
sns.histplot(male_df["salary"], ax=ax1,
bins=range(male_df["salary"].min(),male_df["salary"].max(), 5000),
kde=False,
color="b")
ax1.set_title("Salary distribution for males")
sns.histplot(female_df["salary"], ax=ax2,
bins=range(female_df["salary"].min(),female_df["salary"].max(), 5000),
kde=False,
color="r")
ax2.set_title("Salary distribution for females")
ax1.set_ylabel("Number of users in age group")
for ax in (ax1,ax2):
sns.despine(ax=ax)
fig.tight_layout()
Group the salary column using the department name and compute the mean¶
salary_data.groupby('departm')['salary'].mean()
Group the age column using the department name and compute the modal age of the faculty¶
First let’s check the age of everyone.
salary_data.age.unique()
Ok, there are a few people who don’t have an age so we’ll need to drop them using .dropna()
before computing the mode.
Since there doesn’t seem to me a mode function provided by default we can write our own custom function and use it as a descriptive statistics using the .apply()
command. Here is an example of how that works.
def my_mode(my_array):
counts = np.bincount(my_array)
mode = np.argmax(counts)
return mode, counts[mode]
# wee need to drop the
salary_data.dropna().groupby('departm')['age'].apply(my_mode)
OkCupid Data¶
The next set of analyses concerns a large, public dataset of OKCupid profile information (from 60000 profiles). This is a little different than your traditional data from a psychological study, but it’s still behavioral data and we thought it would be interesting to explore.
This dataset has been published with permission in the Journal of Statistics Education, Volume 23, Number 2 (2015) by Albert Y. Kim et al.
profile_data = pd.read_csv('https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles_revised.csv.zip?raw=true', compression='zip')
Plot a historam of the distribution of ages for both men and women on OkCupid. Provide a brief (12 sentence summary of what you see).
Delete this text and put your answer here. The code for your analysis should appear in the cells below.
Find the mean, median and modal age for men and women in this dataset.
Delete this text and put your answer here. The code for your analysis should appear in the cells below.
Plot a histogram of the height of women and men in this dataset.
Delete this text and put your answer here. The code for your analysis should appear in the cells below.
Propose a new type of analysis you'd like to see. Work with the instructor or TA to try to accomplish it. Be reasonable!
Delete this text and put your answer here. The code for your analysis should appear in the cells below.
Largescale behavioral dataset of risky choice¶
Next, we will work with a massive dataset regarding human decision making under risky choice. This comes from a recent paper by Peterson et al., which I suggest you read before continuing:
Peterson, J. C., Bourgin, D. D., Agrawal, M., Reichman, D., & Griffiths, T. L. (2021). Using largescale experiments and machine learning to discover theories of human decisionmaking. Science(372):12091214.
The following notebook are adapted from this github repository which contains the Choices 13K dataset collected in the paper.
Data Loading¶
First, let’s load the dataset and do some preprocessing. Don’t worry too much about what the following code is doing, but the key dataframe you need afterwards is called df
and shown below.
c13k = pd.read_csv('https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_selections.csv')
c13k_problems = pd.read_json("https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_problems.json", orient='index')
df = c13k.join(c13k_problems, how="left")
df
Problem  Feedback  n  Block  Ha  pHa  La  Hb  pHb  Lb  LotShapeB  LotNumB  Amb  Corr  bRate  bRate_std  B  A  

0  1  True  15  2  26  0.95  1  23  0.05  21  0  1  False  0  0.626667  0.384460  [[0.9500000000000001, 21.0], [0.05, 23.0]]  [[0.9500000000000001, 26.0], [0.05, 1.0]] 
1  2  True  15  4  14  0.60  18  8  0.25  5  0  1  True  1  0.493333  0.413118  [[0.75, 5.0], [0.25, 8.0]]  [[0.6000000000000001, 14.0], [0.4, 18.0]] 
2  3  True  17  4  2  0.50  0  1  1.00  1  0  1  False  0  0.611765  0.432843  [[1.0, 1.0]]  [[0.5, 2.0], [0.5, 0.0]] 
3  4  True  18  3  37  0.05  8  87  0.25  31  1  2  False  0  0.222222  0.387383  [[0.75, 31.0], [0.125, 86.5], [0.125, 87.5]]  [[0.05, 37.0], [0.9500000000000001, 8.0]] 
4  5  False  15  1  26  1.00  26  45  0.75  36  2  5  False  0  0.586667  0.450185  [[0.25, 36.0], [0.375, 41.0], [0.1875, 43.0],...  [[1.0, 26.0], [0.0, 26.0]] 
...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... 
14563  13002  True  15  3  30  1.00  30  42  0.80  0  0  1  True  0  0.367619  0.302731  [[0.199999999999999, 0.0], [0.8, 42.0]]  [[1.0, 30.0], [0.0, 30.0]] 
14564  13003  True  15  5  70  0.50  42  18  0.80  7  0  1  False  0  0.760000  0.364104  [[0.199999999999999, 7.0], [0.8, 18.0]]  [[0.5, 70.0], [0.5, 42.0]] 
14565  13004  True  15  5  8  0.40  17  31  0.40  34  1  6  False  0  0.666667  0.367747  [[0.6000000000000001, 34.0], [0.0125, 28.5], ...  [[0.4, 8.0], [0.6000000000000001, 17.0]] 
14566  13005  True  15  2  89  0.50  49  45  0.50  12  0  1  False  0  0.386667  0.381476  [[0.5, 12.0], [0.5, 45.0]]  [[0.5, 89.0], [0.5, 49.0]] 
14567  13006  True  15  3  25  0.80  25  70  0.50  4  0  1  True  0  0.626667  0.361478  [[0.5, 4.0], [0.5, 70.0]]  [[0.8, 25.0], [0.199999999999999, 25.0]] 
14568 rows × 18 columns
Understanding the dataframe¶
The rows in df
represent many different problem types (labeled with their problem ID), which provides participants with a choice between two risky Gambles A and B. The paper ran a massive experiment that studies over different 13,000 problems… wow!
Many different human participants were shown each problem type. The proportion of participants that choose Gamble B over Gamble A is listed in the column bRate
. The information shown in each column is described in the table below.
For our purposes, the most important columns will be Ha
, pHa
, La
, Hb
, pHb
, Lb
, and bRate
.
Column 
Data Type 
Description 

Problem 
Integer 
A unique internal problem ID 
Feedback 
Boolean 
Whether the participants were given feedback about the reward they received and missed out on after making their 
n 
Integer 
The number of subjects run on the current problem 
Block 
{1, 2, 3, 4, 5} 
The “block ID” for the current problem. For problems with no feedback, Block is always 1. Otherwise, Block ID was 
Ha 
Integer 
The first outcome of gamble A 

 pHa  Float  The probability of Ha in gamble A 
 La  Integer  The second outcome of gamble A (occurs with probability 1pHa). 
 Hb  Integer  The expected value of the good outcomes in gamble B 
 pHb  Float  The probability of outcome Hb in gamble B 
 Lb  Float  The bad outcome for gamble B (occurs with probability 1pHb) 
 LotShapeB  {0, 1, 2, 3}  The shape of the lottery distribution for gamble B. A value of 1 indicates the distribution is symmetric around its mean,
2 indicates rightskewed, 3 indicates leftskewed, and 0 indicates undefined (i.e., if LotNumB = 1). 
 LotNumB  Integer  The number of possible outcomes in the gamble B lottery 
 Amb  Boolean  Whether the decision maker was able to see the probabilities of the outcomes in Gamble B. 1 indicates the participant
received no information concerning the probabilities, and 0 implies complete information and no ambiguity. 
 Corr  {1, 0, 1}  Whether there is a correlation (negative, zero, or positive) between the payoffs of the two gambles. 
 bRate  Float  The ratio of gamble B selections to total selections for MTurk subjects. 
 bRate_std  Float  The standard deviation of the ratio of gamble B to total selections for MTurk subjects. 
 A  list  The different outcomes in Gamble A, [[prob1, outcome1], [prob2, outcome2],…] 
 B  list  The different outcomes in Gamble B, [[prob1, outcome1], [prob2, outcome2],…]
Example risk gambles¶
Next, we’ll show how to display one of the gambles in a more humanreadable format. Let’s check out the very first gamble in the list. First, we’ll need to define a function print_problem
to help us format the problem nicely.
def print_problem(problem_df, problem_index):
"""
Print a slightly more readable representation of the gamble information
for the gamble associated with index `problem_index` in `problem_df`.
"""
entry = problem_df.loc[problem_index]
gA, gB = entry.A, entry.B
gambleA = pd.DataFrame(gA, columns=["Probability", "Payout"]).sort_values("Probability", ascending=False)
gambleB = pd.DataFrame(gB, columns=["Probability", "Payout"]).sort_values("Probability", ascending=False)
linesA = gambleA[["Payout", "Probability"]].to_string().split("\n")
linesB = gambleB[["Payout", "Probability"]].to_string().split("\n")
cols = ["Problem", "Feedback", "n", "Block", "bRate", "bRate_std"]
print("{:^55}".format(f"Problem {entry.Problem}, Feedback = {entry.Feedback}"))
print("{:^55}".format(f"n = {entry.n}, bRate = {entry.bRate:.4f}, std: {entry.bRate_std:.4f}"))
print(f"\n{'Gamble A':^25} {'Gamble B':>20}")
for i in range(max(len(linesA), len(linesB))):
a_str = "" if i >= len(linesA) else linesA[i]
b_str = "" if i >= len(linesB) else linesB[i]
print(f"{a_str:<25} {b_str:>25}")
Now let’s visualize the first problem in the dataframe df
.
print_problem(df,0)
df.iloc[[0]]
Problem 1, Feedback = True
n = 15, bRate = 0.6267, std: 0.3845
Gamble A Gamble B
Payout Probability Payout Probability
0 26.0 0.95 0 21.0 0.95
1 1.0 0.05 1 23.0 0.05
Problem  Feedback  n  Block  Ha  pHa  La  Hb  pHb  Lb  LotShapeB  LotNumB  Amb  Corr  bRate  bRate_std  B  A  

0  1  True  15  2  26  0.95  1  23  0.05  21  0  1  False  0  0.626667  0.38446  [[0.9500000000000001, 21.0], [0.05, 23.0]]  [[0.9500000000000001, 26.0], [0.05, 1.0]] 
For problem 1, this means there were 15 participants that made a selection for the problem. About 62% chose Gamble B (bRate), and the payout structure of each gamble is shown too. People preferred Gamble B which paid out 21 dollars with probability 0.95, and 23 dollars with probability 0.05.
Other problems have a more complicated structure for Gamble B, as you can see in Problem 4 below (index 3). Here, there are 3 potential payouts for problem B.
In the dataset generally, Gamble A always has just 2 outcomes, where Ha is the payout of the good outcome (which one gets with probability pHa) and La is the bad outcome (with probability 1pHa). Gamble B, alternatively, can more than 2 outcomes (in the case below, it has 3). Hb is the expected value (weighted average) of the good outcomes (in the case below, 87), and Lb is the bad outcome (31). The probability of getting one of the better outcomes in B is probability pHb and the bad outcome is probability 1pHb.
print_problem(df,3)
df.iloc[[3]]
Problem 4, Feedback = True
n = 18, bRate = 0.2222, std: 0.3874
Gamble A Gamble B
Payout Probability Payout Probability
1 8.0 0.95 0 31.0 0.750
0 37.0 0.05 1 86.5 0.125
2 87.5 0.125
Problem  Feedback  n  Block  Ha  pHa  La  Hb  pHb  Lb  LotShapeB  LotNumB  Amb  Corr  bRate  bRate_std  B  A  

3  4  True  18  3  37  0.05  8  87  0.25  31  1  2  False  0  0.222222  0.387383  [[0.75, 31.0], [0.125, 86.5], [0.125, 87.5]]  [[0.05, 37.0], [0.9500000000000001, 8.0]] 
Maximizing expected value¶
A classic economic theory is that people make choices to maximize the expected payoff. The theory posits that people are rational actors, making the choices (in this case, gambles) with the highest expected value. By higher expected value, we mean the gamble will pay out more money in the long run if it was chosen over and over again.
Here is the equation for the expected value of Gamble A:
Similarly, here is the equation for the expected value of Gamble B:
Here we are going to evaluate the theory that people make choices consistent with maximizing expected value. We will do so by comparing the theory to the real behavioral data. You will need to do the following:
 Add two new columns EVA and EVB to the dataframe df that compute the expected value of each Gamble A and Gamble B.
 Add a new column B_win to df that thresholds bRate>0.5 to indicate whether the majority of participants chose Gamble B
 For what percentage of problems was the majority choice consistent with the gamble that has highest expected value?
Your code should appear below.
Given the results of Question 5, how does the classical economic theory (maximizing expected value) fare as a theory of human decisions under risky choice? Please write a paragraph or two about your thoughts.
Delete this text and put your answer here.
 Write code to find the problem with the most popular Gamble B (there might be ties, don't worry about that. picking any one problem is fine)
 Write code to find the problem with the least popular Gamble B
 Briefly analyze in a few sentences. Does it match your intuition that the Gamble B in problem 2 is a lot worse than the corresponding Gamble A? Analyze in terms of expected value.
Hint: check out pandas functions argmin
and argmax
.
Delete this text and put your answer here. The code for your analysis should appear in the cells below.