Exploring Data¶

Written by Todd Gureckis and Brenden Lake

So far in this course we’ve learned about using Jupyter and Python. We have also used Pandas dataframes and learned a bit about plotting data.

In this homework, we are pulling those skills into practice to try to “explore” a data set. At this point we are not really going to be doing much in terms of inferential statistics. Our goals are just to be able to formulate a question and then try to take the steps necessary to compute descriptive statistics and plots that might give us a sense of the answer. You might call it “Answering Questions with Data.”

First, we load the basic packages we will be using in this tutorial. Remeber how we import the modules using an abbreviated name (import XC as YY). This is to reduce the amount of text we type when we use the functions.

Note: %matplotlib inline is an example of something specific to Jupyter call ‘cell magic’ and enables plotting within the notebook and not opening a separate window.

%matplotlib inline 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
from datetime import datetime
import random

Reminders of basic pandas functions¶

As a reminder of some of the pandas basics, let’s revisit the data set of professor salaries we have considered over the last couple of classes.

salary_data = pd.read_csv('salary.csv', sep = ',', header='infer')

Notice that the name of the dataframe is now called salary_data instead of df. It could have been called anything as it is just our variable to work with. However, you’ll want to be careful with your nameing when copying commands and stuff from the past.

peek at the data¶

salary_data.head()

Get the size of the dataframe¶

In rows, columns format

salary_data.shape

Access a single column¶

salary_data[['salary']]

Compute some descriptive statistics¶

salary_data[['salary']].describe()

salary_data[['salary']].count()  # how many rows are there?

creating new column based on the values of others¶

salary_data['pubperyear'] = 0
salary_data['pubperyear'] = salary_data['publications']/salary_data['years']
salary_data.head()

Selecting only certain rows¶

sub_df=salary_data.query('salary > 90000 & salary < 100000')

sub_df

sub_df.describe()

Get the unique values of a columns¶

salary_data['departm'].unique()

How many unique department values are there?

salary_data['departm'].unique().size

or

len(salary_data['departm'].unique())

Breaking the data into subgroups¶

male_df = salary_data.query('gender == 0').reset_index(drop=True)
female_df = salary_data.query('gender == 1').reset_index(drop=True)

Recombining subgroups¶

pd.concat([female_df, male_df],axis = 0).reset_index(drop=True)

Scatter plot two columns¶

ax = sns.regplot(x=salary_data.age, y=salary_data.salary)
ax.set_title('Salary and age')

Histogram of a column¶

sns.displot(salary_data['salary'])

You can also combine two different histograms on the same plot to compared them more easily.

fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(10,3),sharey=True,sharex=True)
sns.histplot(male_df["salary"], ax=ax1,
             bins=range(male_df["salary"].min(),male_df["salary"].max(), 5000),
             kde=False,
             color="b")
ax1.set_title("Salary distribution for males")
sns.histplot(female_df["salary"], ax=ax2,
             bins=range(female_df["salary"].min(),female_df["salary"].max(), 5000),
             kde=False,
             color="r")
ax2.set_title("Salary distribution for females")
ax1.set_ylabel("Number of users in age group")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

Group the salary column using the department name and compute the mean¶

salary_data.groupby('departm')['salary'].mean()

OkCupid Data¶

The next set of analyses concerns a large, public dataset of OKCupid profile information (from 60000 profiles). This is a little different than your traditional data from a psychological study, but it’s still behavioral data and we thought it would be interesting to explore.

This dataset has been published with permission in the Journal of Statistics Education, Volume 23, Number 2 (2015) by Albert Y. Kim et al.

profile_data = pd.read_csv('https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles_revised.csv.zip?raw=true', compression='zip')

Question 1
Plot a historam of the distribution of ages for both men and women on OkCupid. Provide a brief (1-2 sentence summary of what you see).

Your Answer
Delete this text and put your answer here. The code for your analysis should appear in the cells below.

Question 2
Find the mean, median and modal age for men and women in this dataset.

Your Answer
Delete this text and put your answer here. The code for your analysis should appear in the cells below.

Question 3
Plot a histogram of the height of women and men in this dataset.

Your Answer
Delete this text and put your answer here. The code for your analysis should appear in the cells below.

Question 4
Propose a new type of analysis you'd like to see. Work with the instructor or TA to try to accomplish it. Be reasonable!

Your Answer
Delete this text and put your answer here. The code for your analysis should appear in the cells below.

Large-scale behavioral dataset of risky choice¶

Next, we will work with a massive dataset regarding human decision making under risky choice. This comes from a recent paper by Peterson et al., which I suggest you read before continuing:

Peterson, J. C., Bourgin, D. D., Agrawal, M., Reichman, D., & Griffiths, T. L. (2021). Using large-scale experiments and machine learning to discover theories of human decision-making. Science(372):1209-1214.

The following notebook are adapted from this github repository which contains the Choices 13K dataset collected in the paper.

Data Loading¶

First, let’s load the dataset and do some pre-processing. Don’t worry too much about what the following code is doing, but the key dataframe you need afterwards is called df and shown below.

c13k = pd.read_csv('https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_selections.csv')
c13k_problems = pd.read_json("https://raw.githubusercontent.com/jcpeterson/choices13k/main/c13k_problems.json", orient='index')
df = c13k.join(c13k_problems, how="left")
df

	Problem	Feedback	n	Block	Ha	pHa	La	Hb	pHb	Lb	LotShapeB	LotNumB	Amb	Corr	bRate	bRate_std	B	A
0	1	True	15	2	26	0.95	-1	23	0.05	21	0	1	False	0	0.626667	0.384460	[[0.9500000000000001, 21.0], [0.05, 23.0]]	[[0.9500000000000001, 26.0], [0.05, -1.0]]
1	2	True	15	4	14	0.60	-18	8	0.25	-5	0	1	True	-1	0.493333	0.413118	[[0.75, -5.0], [0.25, 8.0]]	[[0.6000000000000001, 14.0], [0.4, -18.0]]
2	3	True	17	4	2	0.50	0	1	1.00	1	0	1	False	0	0.611765	0.432843	[[1.0, 1.0]]	[[0.5, 2.0], [0.5, 0.0]]
3	4	True	18	3	37	0.05	8	87	0.25	-31	1	2	False	0	0.222222	0.387383	[[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]]	[[0.05, 37.0], [0.9500000000000001, 8.0]]
4	5	False	15	1	26	1.00	26	45	0.75	-36	2	5	False	0	0.586667	0.450185	[[0.25, -36.0], [0.375, 41.0], [0.1875, 43.0],...	[[1.0, 26.0], [0.0, 26.0]]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14563	13002	True	15	3	30	1.00	30	42	0.80	0	0	1	True	0	0.367619	0.302731	[[0.199999999999999, 0.0], [0.8, 42.0]]	[[1.0, 30.0], [0.0, 30.0]]
14564	13003	True	15	5	70	0.50	-42	18	0.80	7	0	1	False	0	0.760000	0.364104	[[0.199999999999999, 7.0], [0.8, 18.0]]	[[0.5, 70.0], [0.5, -42.0]]
14565	13004	True	15	5	8	0.40	-17	31	0.40	-34	1	6	False	0	0.666667	0.367747	[[0.6000000000000001, -34.0], [0.0125, 28.5], ...	[[0.4, 8.0], [0.6000000000000001, -17.0]]
14566	13005	True	15	2	89	0.50	-49	45	0.50	-12	0	1	False	0	0.386667	0.381476	[[0.5, -12.0], [0.5, 45.0]]	[[0.5, 89.0], [0.5, -49.0]]
14567	13006	True	15	3	25	0.80	25	70	0.50	-4	0	1	True	0	0.626667	0.361478	[[0.5, -4.0], [0.5, 70.0]]	[[0.8, 25.0], [0.199999999999999, 25.0]]

14568 rows × 18 columns

Understanding the dataframe¶

The rows in df represent many different problem types (labeled with their problem ID), which provides participants with a choice between two risky Gambles A and B. The paper ran a massive experiment that studies over different 13,000 problems… wow!

Many different human participants were shown each problem type. The proportion of participants that choose Gamble B over Gamble A is listed in the column bRate. The information shown in each column is described in the table below.

For our purposes, the most important columns will be Ha, pHa, La, Hb, pHb, Lb, and bRate.

Column	Data Type	Description
Problem	Integer	A unique internal problem ID
Feedback	Boolean	Whether the participants were given feedback about the reward they received and missed out on after making their selection.
n	Integer	The number of subjects run on the current problem
Block	{1, 2, 3, 4, 5}	The “block ID” for the current problem. For problems with no feedback, Block is always 1. Otherwise, Block ID was sampled uniformly at random from {2, 3, 4, 5}.
Ha	Integer	The first outcome of gamble A

| pHa | Float | The probability of Ha in gamble A | | La | Integer | The second outcome of gamble A (occurs with probability 1-pHa). | | Hb | Integer | The expected value of the good outcomes in gamble B | | pHb | Float | The probability of outcome Hb in gamble B | | Lb | Float | The bad outcome for gamble B (occurs with probability 1-pHb) | | LotShapeB | {0, 1, 2, 3} | The shape of the lottery distribution for gamble B. A value of 1 indicates the distribution is symmetric around its mean,
2 indicates right-skewed, 3 indicates left-skewed, and 0 indicates undefined (i.e., if LotNumB = 1). | | LotNumB | Integer | The number of possible outcomes in the gamble B lottery | | Amb | Boolean | Whether the decision maker was able to see the probabilities of the outcomes in Gamble B. 1 indicates the participant
received no information concerning the probabilities, and 0 implies complete information and no ambiguity. | | Corr | {-1, 0, 1} | Whether there is a correlation (negative, zero, or positive) between the payoffs of the two gambles. | | bRate | Float | The ratio of gamble B selections to total selections for MTurk subjects. | | bRate_std | Float | The standard deviation of the ratio of gamble B to total selections for MTurk subjects. | | A | list | The different outcomes in Gamble A, [[prob1, outcome1], [prob2, outcome2],…] | | B | list | The different outcomes in Gamble B, [[prob1, outcome1], [prob2, outcome2],…]

Example risk gambles¶

Next, we’ll show how to display one of the gambles in a more human-readable format. Let’s check out the very first gamble in the list. First, we’ll need to define a function print_problem to help us format the problem nicely.

def print_problem(problem_df, problem_index):
    """
    Print a slightly more readable representation of the gamble information 
    for the gamble associated with index `problem_index` in `problem_df`.
    """
    entry = problem_df.loc[problem_index]
    gA, gB = entry.A, entry.B
    gambleA = pd.DataFrame(gA, columns=["Probability", "Payout"]).sort_values("Probability", ascending=False)
    gambleB = pd.DataFrame(gB, columns=["Probability", "Payout"]).sort_values("Probability", ascending=False)
    linesA = gambleA[["Payout", "Probability"]].to_string().split("\n")
    linesB = gambleB[["Payout", "Probability"]].to_string().split("\n")
    
    cols = ["Problem", "Feedback", "n", "Block", "bRate", "bRate_std"]
    print("{:^55}".format(f"Problem {entry.Problem}, Feedback = {entry.Feedback}"))
    print("{:^55}".format(f"n = {entry.n}, bRate = {entry.bRate:.4f}, std: {entry.bRate_std:.4f}"))
    print(f"\n{'Gamble A':^25} {'Gamble B':>20}")
    for i in range(max(len(linesA), len(linesB))):
        a_str = "" if i >= len(linesA) else linesA[i]
        b_str = "" if i >= len(linesB) else linesB[i]
        print(f"{a_str:<25} {b_str:>25}")

Now let’s visualize the first problem in the dataframe df.

print_problem(df,0)
df.iloc[[0]]

              Problem 1, Feedback = True               
          n = 15, bRate = 0.6267, std: 0.3845          

        Gamble A                      Gamble B
   Payout  Probability          Payout  Probability
0    26.0         0.95       0    21.0         0.95
1    -1.0         0.05       1    23.0         0.05

	Problem	Feedback	n	Block	Ha	pHa	La	Hb	pHb	Lb	LotShapeB	LotNumB	Amb	Corr	bRate	bRate_std	B	A
0	1	True	15	2	26	0.95	-1	23	0.05	21	0	1	False	0	0.626667	0.38446	[[0.9500000000000001, 21.0], [0.05, 23.0]]	[[0.9500000000000001, 26.0], [0.05, -1.0]]

For problem 1, this means there were 15 participants that made a selection for the problem. About 62% chose Gamble B (bRate), and the payout structure of each gamble is shown too. People preferred Gamble B which paid out 21 dollars with probability 0.95, and 23 dollars with probability 0.05.

Other problems have a more complicated structure for Gamble B, as you can see in Problem 4 below (index 3). Here, there are 3 potential payouts for problem B.

In the dataset generally, Gamble A always has just 2 outcomes, where Ha is the payout of the good outcome (which one gets with probability pHa) and La is the bad outcome (with probability 1-pHa). Gamble B, alternatively, can more than 2 outcomes (in the case below, it has 3). Hb is the expected value (weighted average) of the good outcomes (in the case below, 87), and Lb is the bad outcome (-31). The probability of getting one of the better outcomes in B is probability pHb and the bad outcome is probability 1-pHb.

print_problem(df,3)
df.iloc[[3]]

              Problem 4, Feedback = True               
          n = 18, bRate = 0.2222, std: 0.3874          

        Gamble A                      Gamble B
   Payout  Probability          Payout  Probability
1     8.0         0.95       0   -31.0        0.750
0    37.0         0.05       1    86.5        0.125
                             2    87.5        0.125

	Problem	Feedback	n	Block	Ha	pHa	La	Hb	pHb	Lb	LotShapeB	LotNumB	Amb	Corr	bRate	bRate_std	B	A
3	4	True	18	3	37	0.05	8	87	0.25	-31	1	2	False	0	0.222222	0.387383	[[0.75, -31.0], [0.125, 86.5], [0.125, 87.5]]	[[0.05, 37.0], [0.9500000000000001, 8.0]]

Maximizing expected value¶

A classic economic theory is that people make choices to maximize the expected payoff. The theory posits that people are rational actors, making the choices (in this case, gambles) with the highest expected value. By higher expected value, we mean the gamble will pay out more money in the long run if it was chosen over and over again.

Here is the equation for the expected value of Gamble A:

\[ EV_A = Ha * pHa + La * (1-pHa) \]

Similarly, here is the equation for the expected value of Gamble B:

\[ EV_B = Hb * pHb + Lb * (1-pHb) \]

Question 5
Here we are going to evaluate the theory that people make choices consistent with maximizing expected value. We will do so by comparing the theory to the real behavioral data. You will need to do the following:

Add two new columns EVA and EVB to the dataframe df that compute the expected value of each Gamble A and Gamble B.
Add a new column B_win to df that thresholds bRate>0.5 to indicate whether the majority of participants chose Gamble B
For what percentage of problems was the majority choice consistent with the gamble that has highest expected value?

Your Answer
Your code should appear below.

Question 6
Given the results of Question 5, how does the classical economic theory (maximizing expected value) fare as a theory of human decisions under risky choice? Please write a paragraph or two about your thoughts.

Your Answer
Delete this text and put your answer here.

Question 7

Write code to find the problem with the most popular Gamble B (there might be ties, don't worry about that. picking any one problem is fine)
Write code to find the problem with the least popular Gamble B
Briefly analyze in a few sentences. Does it match your intuition that the Gamble B in problem 2 is a lot worse than the corresponding Gamble A? Analyze in terms of expected value.

Hint: check out pandas functions argmin and argmax.

Your Answer
Delete this text and put your answer here. The code for your analysis should appear in the cells below.

Lab in C&P (Fall 2023)

Exploring Data¶

Reminders of basic pandas functions¶

peek at the data¶

Get the size of the dataframe¶

Access a single column¶

Compute some descriptive statistics¶

creating new column based on the values of others¶

Selecting only certain rows¶

Get the unique values of a columns¶

Breaking the data into subgroups¶

Recombining subgroups¶

Scatter plot two columns¶

Histogram of a column¶

Group the salary column using the department name and compute the mean¶

Group the age column using the department name and compute the modal age of the faculty¶

OkCupid Data¶

Large-scale behavioral dataset of risky choice¶

Data Loading¶

Understanding the dataframe¶

Example risk gambles¶

Maximizing expected value¶