\n",
" **Question 1**

\n", " The cell below shows how to extract the data for each of these examples from the larger data frame. Using your statsmodels skills conduct a linear regression between x and y for each of these dataset and verify that they have the same overall R~2 value.\n", "

"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# as a reminder to select only data for dataset I we sub-select from the original data frame:\n",
"df1 = df[df.dataset=='I']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " The cell below shows how to extract the data for each of these examples from the larger data frame. Using your statsmodels skills conduct a linear regression between x and y for each of these dataset and verify that they have the same overall R~2 value.\n", "

\n",
" **Your Answer**

\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.\n", "

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.\n", "

\n",
" **Question 2**

\n", " Dataset II, III, and IV have odd patterns which appear to violate one or more of the assumptions of the linear regression. Referring back to the chapter on regression, provide python code to demonstrate at least one of these violations. As a reminder, the assumption appear in the section \"Assumptions of regression\" and include normality, linearity, homogeneity or variance, uncorrelated predictors, residuals are independent of each other, and no \"bad\" outlier. You will find it helpful to peek at the code provided in the textbook along with the various plots.\n", "

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Dataset II, III, and IV have odd patterns which appear to violate one or more of the assumptions of the linear regression. Referring back to the chapter on regression, provide python code to demonstrate at least one of these violations. As a reminder, the assumption appear in the section \"Assumptions of regression\" and include normality, linearity, homogeneity or variance, uncorrelated predictors, residuals are independent of each other, and no \"bad\" outlier. You will find it helpful to peek at the code provided in the textbook along with the various plots.\n", "

\n",
" **Your Answer**

\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of the examples we considered in the last notebook use a regression equation that looked something like this:\n",
"\n",
"$$y = \\beta_0 + \\beta_1x$$\n",
"\n",
"which is often known as simple linear regression. What makes it simple is that there is a single predictor ($x$) and it enter into the linear equation in a very standard way.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As described in the text, simple linear regression can easily be extended to include multiple features. This is called **multiple linear regression**:\n",
"\n",
"$y = \\beta_0 + \\beta_1x_1 + ... + \\beta_nx_n$\n",
"\n",
"Each $x$ represents a different feature, and each feature has its own coefficient. For example, for the advertising data we considered in the previous lab:\n",
"\n",
"$y = \\beta_0 + \\beta_1 \\times TV + \\beta_2 \\times Radio + \\beta_3 \\times Newspaper$\n",
"\n",
"Let's use Statsmodels to estimate these coefficients:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ad_data = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/advertising.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create a fitted model with all three features\n",
"lm = smf.ols(formula='sales ~ tv + radio + newspaper', data=ad_data).fit()\n",
"\n",
"# print the coefficients\n",
"lm.params"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lm.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are a few key things we learn from this output?\n",
"\n",
"- TV and Radio have significant **p-values**, whereas Newspaper does not. Thus we reject the null hypothesis for TV and Radio (that there is no association between those features and Sales), and fail to reject the null hypothesis for Newspaper.\n",
"- TV and Radio ad spending are both **positively associated** with Sales, whereas Newspaper ad spending is **slightly negatively associated** with Sales. (However, this is irrelevant since we have failed to reject the null hypothesis for Newspaper.)\n",
"- This model has a higher **R-squared** (0.897) than the simpler models we considered in the last lab section, which means that this model provides a better fit to the data than a model that only includes TV. However, remember that this is a case where comparing **adjusted R-squared** values might be more appropriate. This is because our more complex model (with more predictors) is more flexible too. The adjusted R-squared helps with this but trying to compare the quality of the fit, controlling for the number of predictors."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.

\n",
" **Question 3**

\n", " Return to the parenthood data set we considered in the last lab (**https://cims.nyu.edu/~brenden/courses/labincp/data/parenthood.csv**). Re-read this data in (you have a new notebook and kernel now) and conduct a multiple regression that predict grumpiness using by dad sleep and baby sleep. Following the text in the chapter and the example above interpret the coefficients and p-value for this model.\n",
"

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Return to the parenthood data set we considered in the last lab (

\n",
" **Your Answer**

\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.\n", "

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Colinearity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Colinearity is a situation that arises only in multiple regression. Here it means that some of the information contained in the various predictors you enter into your multiple regression model can be reconstructes as a linear combination of some of the other predictors.\n",
"\n",
"Remember that the coefficients in multiple regression measure the effect of a unit increase of predictor X with all other predictors held constant. However, it is impossible to measure this effect if one of the other predictors is highly correlated or perhaps even identical to X. The effect that this has on regression estimates is that the coefficients have more uncertainty about them (i.e., the 95% confidence intervals are wider)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following four dataset (provided in a nice blog post about colinearity by [Jan Vanhove](https://janhove.github.io/analysis/2019/09/11/collinearity)) give examples of a strong, weak, non, or nonlinear pattern of colinearity between to predictors. In each case we are interested in the multiple linear regression between the two predictors and the outcome."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$outcome=\\beta_1Ã—predictor1+\\beta_2Ã—predictor2+\\beta_0$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Strong - \"https://janhove.github.io/datasets/strong_collinearity.csv\" \n",
"Weak - \"https://janhove.github.io/datasets/weak_collinearity.csv\" \n",
"None - \"https://janhove.github.io/datasets/no_collinearity.csv\" \n",
"Nonlinear - \"https://janhove.github.io/datasets/nonlinearity.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.\n", "

\n",
" **Question 4**

\n", " This exercise asks you to use the internet to solve a problem which we haven't explicitly taught you how to solve yet. Thus it is ok to have trouble but also you learn a lot by solving things yourself. Read about the function `PairGrid()` on the seaborn website. The PairGrid() is one method of showing the relationship between different variabels in a dataframe. Plot a pairgrid for each of the four datasets above. What do you notice about the relationship between the predictors. Which ones are likely to have a problem with colinearity when the outcome variable is predicted using the predictors and why?\n", "

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " This exercise asks you to use the internet to solve a problem which we haven't explicitly taught you how to solve yet. Thus it is ok to have trouble but also you learn a lot by solving things yourself. Read about the function `PairGrid()` on the seaborn website. The PairGrid() is one method of showing the relationship between different variabels in a dataframe. Plot a pairgrid for each of the four datasets above. What do you notice about the relationship between the predictors. Which ones are likely to have a problem with colinearity when the outcome variable is predicted using the predictors and why?\n", "

\n",
" **Your Answer**

\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.

\n",
" **Question 5**

\n", " Fit a regression to each of the example and extract both the parameters/coefficients and the 95% confidence intervals. Report them in a single table (you might find it helpful to create a new pandas dataframe to do this) comparing the estimates values and 95% confidence intervals. What is different across the datasets?\n", "

"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", " Fit a regression to each of the example and extract both the parameters/coefficients and the 95% confidence intervals. Report them in a single table (you might find it helpful to create a new pandas dataframe to do this) comparing the estimates values and 95% confidence intervals. What is different across the datasets?\n", "

\n",
" **Your Answer**

\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.

"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
\n", " Delete this text and put your answer here. The text and/or code for your analysis should appear in the cells below.