{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "{note}\n", "This chapter authored by [Todd M. Gureckis](http://gureckislab.org/~gureckis) is released under the [license](/LICENSE.html) for the book. \n", "" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "hide_input" ] }, "outputs": [], "source": [ "from IPython.core.display import HTML, Markdown, display\n", "\n", "import numpy.random as npr\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import statsmodels.formula.api as smf\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "from myst_nb import glue # for the Jupyter book chapter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal in this chapter is to build upon the previous chapter on linear regression to introduce the concept of **_logistic regression_**. A very short summary is that while linear regression is about fitting lines, logistic regression is about fitting a particular type of \"squiggle\" to particular kinds of data. We are going to talk first about why logistic regression is necessary, the types of datasets it applies to, then step through some of the details on fitting a logistic regression model and interpreting it.\n", "\n", "Logistic regression is something you might not have heard about before in a psychology statistics course but it is becoming more common. In addition, logistic regression is a very popular tool in data science and machine learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting binary outcomes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In much of the previous chapter we talked about regression in the context where we aimed to predict data that was on an interval scale (e.g., grumpiness given sleep). However, often we are interested in a discrete, nominal outcome: a case where something happens or it doesn't. We sometimes call these \"binary\" outcomes.\n", "\n", "Examples include:\n", "\n", "- predicting if someone will or will not graduate from high school\n", "- predicting if someone will or will not get a disease\n", "- predicting is a job applicant gets a “good” or “poor” rating on their annual review\n", "- predicting if an infant will look at a stimulus or not in a looking time study\n", "- predicting how a person will answer a true/false question\n", "\n", "Each of these cases the thing that is being predicted is a dichotomous outcome with two levels or values (graduate/not, disease/no disease, good/poor, look/no look, true/false, etc..). These case come up enough in psychological research it is useful to know about the best way to approach the analysis of this data. \n", "\n", "I think a lot of students, fresh off the chapter on linear regression would just get to work trying to perform a normal linear regression on data where the predicted value is binary. Why not? It isn't like the statistics software (like statsmodels) will usually **stop** you from doing this. Why do we need to learn an entirely different type of regression approach to handle these types of data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why do we need logistic regression?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'm about to show you the **wrong** way to analyze data with discrete, binary outcomes. The reason is that doing this analysis the wrong way illustrates a bit why we need a different approach. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "{caution} Don't get confused, this next section is not the right way to do things, and you will see why!\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make things concrete, imagine the following data set which contains some hypothetical data on the high school GPA of several students and if they were admitted to NYU (loaded from 'nyu_admission_fake.csv' in the current folder):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
003.0852831
110.0830080
222.5345931
332.9952161
441.9940280
550.8991870
660.7922510
773.0421231
880.6764430
990.3533590
10102.7414391
11113.8135731
12120.0157930
13132.0487690
14143.2504841
15152.4501040
16162.8870210
17171.1675040
18183.6710961
19192.8583031
20202.1701771
21210.5686800
22221.4933630
23232.6965341
24241.7673330
25251.7360560
26262.4710681
27272.0525530
28282.6015891
29292.4041560
30303.2208931
31312.0865890
32323.6345961
33331.2769440
34340.3618370
35351.2028000
36360.4559370
37373.3147251
38380.1875850
39392.5051491
\n", "
" ], "text/plain": [ " student gpa admit\n", "0 0 3.085283 1\n", "1 1 0.083008 0\n", "2 2 2.534593 1\n", "3 3 2.995216 1\n", "4 4 1.994028 0\n", "5 5 0.899187 0\n", "6 6 0.792251 0\n", "7 7 3.042123 1\n", "8 8 0.676443 0\n", "9 9 0.353359 0\n", "10 10 2.741439 1\n", "11 11 3.813573 1\n", "12 12 0.015793 0\n", "13 13 2.048769 0\n", "14 14 3.250484 1\n", "15 15 2.450104 0\n", "16 16 2.887021 0\n", "17 17 1.167504 0\n", "18 18 3.671096 1\n", "19 19 2.858303 1\n", "20 20 2.170177 1\n", "21 21 0.568680 0\n", "22 22 1.493363 0\n", "23 23 2.696534 1\n", "24 24 1.767333 0\n", "25 25 1.736056 0\n", "26 26 2.471068 1\n", "27 27 2.052553 0\n", "28 28 2.601589 1\n", "29 29 2.404156 0\n", "30 30 3.220893 1\n", "31 31 2.086589 0\n", "32 32 3.634596 1\n", "33 33 1.276944 0\n", "34 34 0.361837 0\n", "35 35 1.202800 0\n", "36 36 0.455937 0\n", "37 37 3.314725 1\n", "38 38 0.187585 0\n", "39 39 2.505149 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nyu_df = pd.read_csv('nyu_admission_fake.csv')\n", "nyu_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The student column is an anonymous identifier, gpa is the grade point average in high school on a 4.0 scale, and admit codes a binary variable about if the student was admitted to NYU (1=admit, 0=not admitted). Taking what we learned from the last chapter let's conduct a typical linear regression analysis to assess how GPA influences if a student was admitted." ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "