from IPython.core.display import HTML, Markdown, display

import numpy.random as npr
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as smf
import ipywidgets as widgets

import os

Lab 1B: Advanced Regression

Authored by Todd Gureckis, adapted by Brenden Lake

Aspects borrowed from General Assembly’s Data Science course which claims to have adapted materials from Chapter 3 of An Introduction to Statistical Learning

Congratulations, you made it through the first part of the lab, and now understand the basic of simple linear regression! Before we move onto the actual analyses in this lab we need to cover a little further background on what we will call “advanced” topics in regression. However, don’t be afraid! This is not really “advanced” but more like “nuance, details, and stuff you’ll need later”. So this isn’t an exercise in math or abstraction but additional practical skills it is useful to know about in order to be a savvy psychological data scientist!

(replace years in grad school with “week in lab in cognition and perception”)

Checking assumptions

As mentioned in the reading, there are a variety of “assumptions” that need to be met in order for some of the interpretation of regression fits to be valid. Mostly this has to do with interpreting things like the \(p\)-values associated with the regression line itself or with the estimated coefficients. In the realm of just using a fitted model to make predictions, these assumptions matter somewhat less because it will be painfully obvious if your model is wrong: you’ll do poorly at prediction! It is the interpretation of coefficients which always sounds kind of reasonable when you just report the p-value or 95% confidence intervals but can be completely wrong if the assumptions are not met.

One of the more important assumption of linear regression is that the relationship between the predictor and outcome variable is roughly linear. Seems obvious right? But remember the examples mentioned in Chapter 4 called the anscomb quartet which were examples which have exactly the same correlation value (\(r\)) but are clearly showing quite different things?

import seaborn as sns
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", height=4,
           scatter_kws={"s": 50, "alpha": 1})

In each of these examples, you can see a nice line fit to each one but the underlying data is quite different. This illustrates a nice exmaple of data that might “violate” the assumptions of a regression in some way.

# as a reminder to select only data for dataset I we sub-select from the original data frame:
df1 = df[df.dataset=='I']

Multiple regression

All of the examples we considered in the last notebook use a regression equation that looked something like this:

\[y = \beta_0 + \beta_1x\]

which is often known as simple linear regression. What makes it simple is that there is a single predictor (\(x\)) and it enter into the linear equation in a very standard way.

As described in the text, simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:

\(y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n\)

Each \(x\) represents a different feature, and each feature has its own coefficient. For example, for the advertising data we considered in the previous lab:

\(y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper\)

Let’s use Statsmodels to estimate these coefficients:

ad_data = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/advertising.csv')
# create a fitted model with all three features
lm = smf.ols(formula='sales ~ tv + radio + newspaper', data=ad_data).fit()

# print the coefficients
lm.params
lm.summary()

What are a few key things we learn from this output?

  • TV and Radio have significant p-values, whereas Newspaper does not. Thus we reject the null hypothesis for TV and Radio (that there is no association between those features and Sales), and fail to reject the null hypothesis for Newspaper.

  • TV and Radio ad spending are both positively associated with Sales, whereas Newspaper ad spending is slightly negatively associated with Sales. (However, this is irrelevant since we have failed to reject the null hypothesis for Newspaper.)

  • This model has a higher R-squared (0.897) than the simpler models we considered in the last lab section, which means that this model provides a better fit to the data than a model that only includes TV. However, remember that this is a case where comparing adjusted R-squared values might be more appropriate. This is because our more complex model (with more predictors) is more flexible too. The adjusted R-squared helps with this but trying to compare the quality of the fit, controlling for the number of predictors.

Colinearity

Colinearity is a situation that arises only in multiple regression. Here it means that some of the information contained in the various predictors you enter into your multiple regression model can be reconstructes as a linear combination of some of the other predictors.

Remember that the coefficients in multiple regression measure the effect of a unit increase of predictor X with all other predictors held constant. However, it is impossible to measure this effect if one of the other predictors is highly correlated or perhaps even identical to X. The effect that this has on regression estimates is that the coefficients have more uncertainty about them (i.e., the 95% confidence intervals are wider).

The following four dataset (provided in a nice blog post about colinearity by Jan Vanhove) give examples of a strong, weak, non, or nonlinear pattern of colinearity between to predictors. In each case we are interested in the multiple linear regression between the two predictors and the outcome.

\[outcome=\beta_1×predictor1+\beta_2×predictor2+\beta_0\]

Strong - “https://janhove.github.io/datasets/strong_collinearity.csv”
Weak - “https://janhove.github.io/datasets/weak_collinearity.csv”
None - “https://janhove.github.io/datasets/no_collinearity.csv”
Nonlinear - “https://janhove.github.io/datasets/nonlinearity.csv”