{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Format and Structure of Digital Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{note}\n",
    "This chapter authored by [Todd M. Gureckis](http://gureckislab.org/~gureckis) and [Kelsey Moty](http://kelseymoty.github.io) and is released under the [license for this book](../../LICENSE).\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Video Lecture\n",
    "\n",
    "This video provides an complementary overview of this chapter.  There are things mentioned in the chapter not mentioned in the video and vice versa.  Together they give an overview of this unit so please read and watch.\n",
    "\n",
    "<center>\n",
    "<iframe src=\"https://player.vimeo.com/video/520946467\" width=\"640\" height=\"360\" frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen></iframe>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to think about data and organize it"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data is an important concept in science.  The first step is we measure something.   In a [previous chapter we discuss issues in measurement](../../chapters/04/00-researchdesign) including different types of scales, units, etc...  However the main thing is that data is born when a number is assigned to some observation we make.  Here is a lonely single number which maybe measures something:\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/single_number.png\" width=\"150\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Things get more interesting when we make multiple observations and so we have many data.  In order to do anything with more than one number though we start running into the question of how to organize things.  How do we keep track of which number was collected first, second or third for instance?  If we have just a big jumble of numbers we can easily get lost. \n",
    "\n",
    "Before we get too many numbers we need to start thinking abstractly about **organzing** our measurements into some type of collection.  In our previous chapter on the [basics of Python](../../chapters/03/00-python) we discussed the concept of [collections](chapters/03/00-python.html#collections).  Collections are things in the native Python language space that organize multiple numbers, string, or even other collections.  This included [lists](chapters/03/00-python.html#lists) (which organize things in some order), [dictionaries](chapters/03/00-python.html#dictionaries) (which do not preseve order but allow \"looking up\" the value of a number by a special address known as a key), or [sets](chapters/03/00-python.html#sets) (which are unordered collections of things useful for counting, and performing set operations like union, etc...).  \n",
    "\n",
    "\n",
    "For instance if we measured a number repeatedly in time a list might be a useful way to organize them:\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/list.png\" width=\"400\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets imagine the numbers above represent some measurement on a person on three different days (monday, tuesday, wednesday).  It might be their blood pressure or temperature.  Learning a lot about one person is nice and fits cleanly into a list.  However, more often it gets more complex.  What we if instead have multiple people we are recording information from and so the data starts looking two dimensional.  There is maybe 3 people in the study and each has 3 measurements.  In that case we might then organize the data as a list of lists (or a matrix):\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/matrix.png\" width=\"550\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although a matrix is a nice way to organize multiple observations made on multiple people it can get a little bit confusing.  Are the rows or the columns the subjects in the example above?  You have to remember to write that down otherwise you could easily get confused.  What is this data?  What does it describe? For this reason we might look beyond standard Python collections to more complex structures.  \n",
    "\n",
    "For example, you are all likely familiar with the concept of a spreadsheet (e.g., Microsoft Excel or Google Sheets).  These are nicer than matricies because they have named rows and columns and so just by looking at the structure of the spreadsheet you can often learn more about the structure of the data.  Columns names are sometimes known as **metadata** because they are data themselves that *describe* other data.\n",
    "\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/spreadsheet.png\" width=\"550\">\n",
    "</div>\n",
    "\n",
    "This is much nicer because the columns names help us remember what each measurement records and also what is contained in the rows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two dimensional structure of a spreadsheet is generally the most complex types of data that we will deal with.  However, just so you are aware, as data gets really big it can make sense to take other organizational approaches.  For instance, a website that had millions of users reviewing movies might not want to make a long spreadsheet recording each user and the rating they gave and the movie the rating was about.  Stop and think for a moment how you could do that in a spreadsheet.  Perhaps you could make each row a user and each column a movie.  However, as you know Netflix and other sites have hundred of thousands of TV shows and movies and so the data would be really unweildy.  Also most of the data would be \"missing\" because people only provide a rating for a small subset of the total number of movies.  As a result, big websites adopt alternative ways of organizating data including the use of what are known as **relational databases**.  \n",
    "\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/database.png\" width=\"550\">\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Relational databases are made up of multiple spreadsheets (effectively) where each one represents only a smaller piece of the data.  In the figure above the green words are columns of a smaller spreadsheet. This database is thus organized into four separate spreadsheets (or \"tables\"). For instance the ratings table has an unique id number (i.e. row number) and then the rating a person gave, a user id of who gave it, and a movie id for which movie it was about.  Separately there is a movies table which had its own id (unique id for each movie), the title or name of the movie, and a description/summary of the movie.  The orange lines reflect links where the value in one columns of one table connects with or refers to the entries of another one.  This can be a much more efficient way to organize the data. \n",
    "\n",
    "The main point this example brings forward is that the way you organize your data is something you really have to think about and plan to begin with.  I've found that this topic actually is intuitively interesting to many students.  The reason is that we love organizing our homes and offices.  It feels great.  When it comes to data taking the same mentality - the \"fresh\" feeling of being organized, is really key to making good scientific analyses that do not have bugs.  Problems with data organization can even be deadly ([How not to kill people with spreadsheets](https://foreignpolicy.com/2020/10/08/uk-coronavirus-covid-spreadsheet-excel-error-outsourcing/)).  If you get really interested in organizing data, there are [entire books](https://www.amazon.com/Discipline-Organizing-MIT-Press/dp/0262518503) {cite}`Glushko:2013` and fields of study about how to best organize data (e.g., library and information sciences).  The choices you make in how to organize your data at one point in time really influence how easy it can be to do certain analyses later.  More on this later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Common file formats for data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data often come in files with particular formatted structure.  This formatted structure makes it easier for computer programs to read in the data and manipulate it.  In this section we will go over a couple of the common data formats you might encounter in traditional psychological research."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Excel Workbooks (.xls, .xslx files)\n",
    "XLSX file format (or XLS in older versions of Excel) is the default file format used in Excel. Under the hood, Excel Workbooks are built using a highly structured markup language called Extensible Markup Language (XML). Essentially what this means is while you are using Excel's graphic interface to edit your data, XML is adding a bunch of tags behind the scenes to the XLSX file so it knows how to properly format the data each time you open it up. All of these tags in XML are what allow you to change font colors and sizes, add graphs and images, and use formulas. \n",
    "\n",
    "This is what you see when you when you use Excel: \n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/excel_worksheet.png\" width=\"200\">\n",
    "</div>\n",
    "\n",
    "\n",
    "And this is what the exact same Workbook looks like behind the scenes:\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/excel_xml.png\" width=\"550\">\n",
    "</div>\n",
    "\n",
    "But this complexity can also make it difficult to use XLSX files with any other software or programming language that isn't Excel. For one, not all programs can open XLSX files, so these data files can make your data inaccessible to other people. Two, all of this special formatting can sometimes lead to problems when you read in data from XLSX files (e.g., converting variables to weird formats). To avoid issues like these, it is preferable to store your data using plain-text formats (such as CSV and TSV files). \n",
    "\n",
    "\n",
    "#### Comma-separated Values (.csv files)\n",
    "CSV, or Comma-separated value files, on the other hand, are written using a plain-text format. No special tags, just text. The only formatting in CSV files are the commas used to delimit different values for columns and new lines to indicate rows. You can see this if you open up a CSV file using Notepad (or other text editors). \n",
    "\n",
    "\n",
    "Here's the same dataset as before but now as a CSV file:\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/csv_sample.png\" width=\"450\">\n",
    "</div>\n",
    "\n",
    "This means that what you can store within a CSV file is quite limited (no images or special formatting, for example). But it also means that this kind of file is great for storing and sharing data. CSV files can be opened by any software, making it accessible to others. Given that CSV files inherently have no formatting, some of the issues that come with using XLSX files never arise. \n",
    "\n",
    "#### Tab-separated Values (.tsv files)\n",
    "TSV files are very similar to CSV files, except that instead of using a comma to delimit between values in a dataset, it uses tabs (a special type of spacing, the same type you get when you hit the tab key on your computer).\n",
    "\n",
    "```{seealso} \n",
    "It is a little bit beyond the scope of our class, but another file format you might encounter is referred to as JSON which stands for Javascript Object Notation (JSON).  JSON is similar to a python dictionary and is stored in a plain text file like a CSV.  However, JSON is somewhat more flexible than the two-dimensional structure of a spreadsheet. Often one analysis step is to convert JSON into a 2D spreadsheet-type structure.  [Here](https://realpython.com/python-json/) is a helpful guide to JSON in python.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exporting from spreadsheet programs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Google Sheets \n",
    "\n",
    "To export a .csv file from Google Sheets, you click on File > Download > Comma-separated values (.csv). \n",
    "\n",
    "<div style=\"text-align: center\"><img src=\"./images/export_csv_google.png\" width=\"450\"></div>\n",
    "\n",
    "If you created a Google Sheet with multiple sheets, you will have to save each individual sheet as a .csv file because .csv files do not support multiple sheets. To do this, you will have to click on each sheet and go through the save process for each sheet. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Excel\n",
    "\n",
    "To create a .csv file from Excel, click on File (found on the top left of the menu bar).\n",
    "\n",
    "<div style=\"text-align: center\"><img src=\"./images/excel_header.png\" width=\"250\"></div>\n",
    "\n",
    "A new menu will appear. From there, select \"Save as\" and then choose where on your computer you want to save the file. A pop-up window will open up, and from there, you can choose what to name the file and what kind of file type to save it as. To save it as a   will there will be a dropdown menu where you can select which kind of file type you would like to save the file as (labeled \"Save as type\"). \n",
    "\n",
    "<div style=\"text-align: center\"><img src=\"./images/export_csv_excel.png\" width=\"550\"></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Uploading csv files to JupyterHub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many of the datasets you will be working with this semester will already be available for you to use on JupyterHub. However, if you want to work with your own datasets, you will need to upload them yourselves to the cluster.\n",
    "\n",
    "To do that, go to the \"Files\" tab on JupyterHub and use the \"Upload\" button. \n",
    "\n",
    "<div style=\"text-align: center\"><img src=\"./images/jupyter_upload.png\" width=\"450\"></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Pandas library and the concept of a dataframe API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/pandaslogo.png\" width=\"250\">\n",
    "</div>\n",
    "\n",
    "Throughout this class there are several libraries (i.e., tools which extend the capabilities of the core Python language) which we will use a lot.  One of those is [Pandas](https://pandas.pydata.org).  Pandas is a open-source project for Python that provides a data analysis and manipulation tool.  Pandas gives you a way to interact with a dataset organized like a spreadsheet (meaning columns and rows which are possibly named or labeled with meta data) in order to perform sophisticated data analyses.  Pandas is really the \"backbone\" of the Python data science and modeling environment.  Pandas could be thought of as a langauge (or [API](https://en.wikipedia.org/wiki/API)) for unlocking tabular data.  **If you want to become better at data analysis with Python there is no single package I can think of that is worth more of your time learning.**\n",
    "\n",
    "Although there is no required book for this class, \"[Learning the Pandas Library: Python Tools for Data Munging, Analysis and Visualization](https://www.amazon.com/Learning-Pandas-Library-Munging-Analysis/dp/153359824X)\" by Matt Harrison {cite}`Harrison:2016` is highly recommended both because it is short and to the point but also because it explains key Pandas concepts very clearly.  In addition, the Pandas project includes a very helpful [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html) which explains many of the key concepts.\n",
    "\n",
    "```{admonition} Pandas can be confusing at first!\n",
    ":class: tip\n",
    "I was aware of Pandas for several years and never really \"understood it.\"  However, when it clicked it opened a universe of data analysis to me.  It takes some patience in part because data manipulation is a very complex topic at least at the conceptual level.\n",
    "```\n",
    "\n",
    "The organization of this guide is not to give a complete description of everything in Pandas but to kind of give you a sense of how you use Pandas (with code) to do the types of tasks you typically would do in a spreadsheet like Excel or Google Sheets.  In addition, we show a few of the key features of Pandas that are not that easy to do in Excel which are very useful for behavioral data analysis.\n",
    "\n",
    "The first step of understanding Pandas is to load it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two code lines above are the typical way to load in the Pandas library.  [Numpy](https://numpy.org) is a related library that Pandas is built on top of (it really is like a Russian doll where one project enables the next!).  Tradionally, pandas is imported 'as' **pd** and numpy 'as' **np**.  This means that when you are reading code online if you see `pd.somefunction()` you can probably assume that the function is part of the pandas library because the function begins with the `pd.` syntax."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading data (e.g., csvs) into Python using Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before getting further with Pandas it helps to first understand how to read some data into a dataframe object or how to create a dataframe from scratch.  Usually you are reading data in from some other file (e.g., a CSV file) and \"loading\" it into Pandas.  Pandas has <a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/io.html\">many different ways</a> for reading in different file types. Here, we will use <a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\">`pd.read_csv()`</a> because we will mostly be working with .csv files in this class.\n",
    "\n",
    "When reading in your .csv file, there are two things you absolutely have to do:\n",
    "\n",
    "First, **you need to store the data into a variable.** Otherwise, you won't be able to work with your data. Here, we called the dataframe \"df\" but you can name it whatever you want. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# incorrect: won't store your data\n",
    "pd.read_csv('salary.csv')\n",
    "\n",
    "# correct: creates a dataframe called df that you can work with \n",
    "df = pd.read_csv('salary.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, **you need to tell `pd.read_csv` where the file you are trying to import is located.** The path can point to a file located in a folder in your local enviroment (e.g., on your computer, on JupyterHub) or to a file available online. \n",
    "\n",
    "To point to a file available online, you'll put the link to the file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are pointing to a file on JupyterHub (or to a file on your computer, if you had Python downloaded on your computer), you'll need to specify the path to the file. If the .csv file and your notebook are in the same folder, you only have to put the name of the file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('salary.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the file is located in a different folder than the notebook you are working with, you will have to specify which folder(s) that the computer needs to look at to find it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('folder1/folder2/salary.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes, your .csv file might be saved in a folder that's not in the same folder as your notebook. For example, say you have a folder called \"project\". And in that the folder, there was a folder called \"code\" that contained your notebooks/python code, as well as a folder called \"data\" that contained your data (.csv) files. To import your .csv file, you need to use `..` to tell the your computer go up one folder in your directory (get out of the \"code\" folder) in order for it to find the \"data\" folder. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('../data/salary.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, `pd.read_csv` assumes that file uses commas to differentiate between different entries in the file. But you can also explicitly tell `pd.read_csv` that you want to use commas as the delimiter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('salary.csv', sep = \",\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pd.read_csv` also assumes by default that the first row in your .csv files lists the names for each column of data (the column headers). You can explicitly tell `pd.read_csv` to do this by writing:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('salary.csv', sep = \",\", header = 'infer')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes, datasets may have multiple headers. (e.g., both the first and second rows of the dataset list column names). `pd.read_csv` allows you to keep both rows as headers by modifying the `header` argument with a list of integers. Remember that 0 actually means the first row. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('salary.csv', sep = \",\", header = [0,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When creating your own datasets, it's generally best practice to give your columns headers, so that's it easier for people looking at your data (including yourself!) to know what's in the dataset. However, you may occassionally come across datasets that do not have headers. To tell `pd.read_csv` there's no header to import, set `header` to \"None\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('salary.csv', sep = \",\", header = None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{warning} \n",
    "When creating your own dataset, refrain from using characters like space or period (.) in the column names.  This will make things easier for you down the line when using Pandas for statistical modeling.\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Viewing the data in a dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So now we have loaded a dataset into a variable called `df`.  Now we might like to look at the data to check it was properly read in and also to learn more about the structure of this dataset.  Perhaps the simplest method is simply to type the name of the dataframe variable by itself in a code cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>neuro</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>57185</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>61885</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>23.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "0    86285       0     bio   26.0  64.0            72\n",
       "1    77125       0     bio   28.0  58.0            43\n",
       "2    71922       0     bio   10.0  38.0            23\n",
       "3    70499       0     bio   16.0  46.0            64\n",
       "4    66624       0     bio   11.0  41.0            23\n",
       "..     ...     ...     ...    ...   ...           ...\n",
       "72   53662       1   neuro    1.0  31.0             3\n",
       "73   57185       1    stat    9.0  39.0             7\n",
       "74   52254       1    stat    2.0  32.0             9\n",
       "75   61885       1    math   23.0  60.0             9\n",
       "76   49542       1    math    3.0  33.0             5\n",
       "\n",
       "[77 rows x 6 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This outputs a \"table\" view of the dataframe showing the column names, and several of the rows of the dataset.  It doesn't show you **all** of the data at once because in many large files this would be too much to make sense of.\n",
    "\n",
    "You can also specifically request the first several rows of the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "0   86285       0     bio   26.0  64.0            72\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "2   71922       0     bio   10.0  38.0            23\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "4   66624       0     bio   11.0  41.0            23"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or the last several rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>neuro</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>57185</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>61885</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>23.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "72   53662       1   neuro    1.0  31.0             3\n",
       "73   57185       1    stat    9.0  39.0             7\n",
       "74   52254       1    stat    2.0  32.0             9\n",
       "75   61885       1    math   23.0  60.0             9\n",
       "76   49542       1    math    3.0  33.0             5"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The \"head\" of the dataframe is the top.  The \"tail\" of the dataframe is the bottom.\n",
    "\n",
    "\n",
    "We can also access individual rows and columns of a data frame."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Accessing individual columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To access a single column you can index it like a dictionary in Python using the column name as a key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     86285\n",
       "1     77125\n",
       "2     71922\n",
       "3     70499\n",
       "4     66624\n",
       "      ...  \n",
       "72    53662\n",
       "73    57185\n",
       "74    52254\n",
       "75    61885\n",
       "76    49542\n",
       "Name: salary, Length: 77, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['salary']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     64.0\n",
       "1     58.0\n",
       "2     38.0\n",
       "3     46.0\n",
       "4     41.0\n",
       "      ... \n",
       "72    31.0\n",
       "73    39.0\n",
       "74    32.0\n",
       "75    60.0\n",
       "76    33.0\n",
       "Name: age, Length: 77, dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['age']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another allowed syntax is to use the name of the dataframe variable and a `.columnname`.  For instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     86285\n",
       "1     77125\n",
       "2     71922\n",
       "3     70499\n",
       "4     66624\n",
       "      ...  \n",
       "72    53662\n",
       "73    57185\n",
       "74    52254\n",
       "75    61885\n",
       "76    49542\n",
       "Name: salary, Length: 77, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.salary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However the dictionary key-lookup method is preferred because it is possible that the name of a column is the same as a dataframe method (see below) and that causes Python to get confused."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{seealso} \n",
    "When we learn about [visualizing data](../../chapters/06/00-plots), [describing data](../../chapters/07/00-describingdata), and [linear regression](../../chapters/13/00-linearregression) you will see how the column struture of dataframes becomes very powerful.\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Accessing individal rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the Python bracket notation is use to lookup columns, a special command is needed to access rows.  The best way to look up a single row is to use `.iloc[]` where you pass the integer row number you want to access (zero indexed).  So to get the first row you type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "salary          86285\n",
       "gender              0\n",
       "departm           bio\n",
       "years              26\n",
       "age                64\n",
       "publications       72\n",
       "Name: 0, dtype: object"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the 5th row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "salary          66624\n",
       "gender              0\n",
       "departm           bio\n",
       "years              11\n",
       "age                41\n",
       "publications       23\n",
       "Name: 4, dtype: object"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or counting three backwards from the end:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "salary          52254\n",
       "gender              1\n",
       "departm          stat\n",
       "years               2\n",
       "age                32\n",
       "publications        9\n",
       "Name: 74, dtype: object"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[-3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Indexes and Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also note that there are two special elements of a normal dataframe called the **column index** and the **row index** (or just index).  The row index is the column on the left that has no name but seems like a counter of the rows (e.g., 72, 73, 74, ...).  The row index is useful in Pandas dataframes for looking things up by row.  Although you can index a row by counting (access the fifth row for instance), the index can be made on arbitrary types of data including strings, etc...  You don't need to know a ton about indexes to use Pandas typically but every once in a while they come up so it is useful to know the special nature of the index column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=77, step=1)"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['salary', 'gender', 'departm', 'years', 'age', 'publications'], dtype='object')"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above code shows how to find the row index and column index.\n",
    "\n",
    "You can change the row index to another column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>departm</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bio</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bio</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bio</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bio</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bio</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neuro</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>stat</th>\n",
       "      <td>57185</td>\n",
       "      <td>1</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>stat</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>math</th>\n",
       "      <td>61885</td>\n",
       "      <td>1</td>\n",
       "      <td>23.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>math</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         salary  gender  years   age  publications\n",
       "departm                                           \n",
       "bio       86285       0   26.0  64.0            72\n",
       "bio       77125       0   28.0  58.0            43\n",
       "bio       71922       0   10.0  38.0            23\n",
       "bio       70499       0   16.0  46.0            64\n",
       "bio       66624       0   11.0  41.0            23\n",
       "...         ...     ...    ...   ...           ...\n",
       "neuro     53662       1    1.0  31.0             3\n",
       "stat      57185       1    9.0  39.0             7\n",
       "stat      52254       1    2.0  32.0             9\n",
       "math      61885       1   23.0  60.0             9\n",
       "math      49542       1    3.0  33.0             5\n",
       "\n",
       "[77 rows x 5 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.set_index('departm')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or to reset it to a counter that count the row number use `.reset_index()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>departm</th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>bio</td>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>bio</td>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>bio</td>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>bio</td>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>bio</td>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>neuro</td>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>stat</td>\n",
       "      <td>57185</td>\n",
       "      <td>1</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>stat</td>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>math</td>\n",
       "      <td>61885</td>\n",
       "      <td>1</td>\n",
       "      <td>23.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>math</td>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   departm  salary  gender  years   age  publications\n",
       "0      bio   86285       0   26.0  64.0            72\n",
       "1      bio   77125       0   28.0  58.0            43\n",
       "2      bio   71922       0   10.0  38.0            23\n",
       "3      bio   70499       0   16.0  46.0            64\n",
       "4      bio   66624       0   11.0  41.0            23\n",
       "..     ...     ...     ...    ...   ...           ...\n",
       "72   neuro   53662       1    1.0  31.0             3\n",
       "73    stat   57185       1    9.0  39.0             7\n",
       "74    stat   52254       1    2.0  32.0             9\n",
       "75    math   61885       1   23.0  60.0             9\n",
       "76    math   49542       1    3.0  33.0             5\n",
       "\n",
       "[77 rows x 6 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2=df.set_index('departm')\n",
    "df2.reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the syntax we've used a few times here... we referenced the `df` variable which is the variable we created to store the data from our file.  Then the `.functionname()` is known as a method of the data frame which is provided by pandas.  For instance the `.head()` method prints out the first five rows of the data frame.  The `.tail()` method prints out the last five rows of the data frame.  There are many other methods available on data frames that you access by calling it using the `.functionname()` syntax.  The next parts of this chapter explain some of the most common ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding and deleting things from a dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes after we read in a dataset we might want to add new rows or columns or delete rows and columns from a dataframe.  One way to do this is to edit the original CSV file that we read in.  However, there is an important principle I want to emphasize thoughout this class: \n",
    "\n",
    "```{admonition} **ALWAYS do everything in code**\n",
    ":class: tip\n",
    "What do I mean by do everything in code?  What I mean is that if you go to your data file that you got from some place and then by hand delete some of the data in Google Sheets or Excel, noone will know that you did that.  There is no record of it.  Once you save the file the data will be deleted and noone will know you did this.  Instead if you keep your data as \"raw\" as possible and modify it using code, your code will document ALL of the steps you did in your analysis including the step of deleting data.  \"Excluding\" (not DELETEing) data is sometimes justified but the important thing is we want to document all our steps honestly and truthfully when doing analysis.  Using code to do every single step of an analysis helps us accomplish this.\n",
    "``` "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deleting rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To delete a row you can use the `.drop()` method to drop a particular item using its index value.  The `.drop()` method is not an \"in place operation\" instead it returns a new dataframe with the particular rows removed.\n",
    "\n",
    "\n",
    "```{admonition} **\"In place\" operations**\n",
    ":class: tip\n",
    "From time to time using pandas you will here about a method being described as \"in place\" or \"not in place.\"  In place means that the change you are making to the dataframe is made to the dataframe variable you describe.  So for instance if you have a dataframe stored in a variable named `df` and you call `df.drop([1])` it will drop the row corresponding to the index value 1.  However, if you look at `df` again using, for instance, `df.head()` you will see that it will not have changed.  To save the changes with a \"not in place\" operation you need to store the results in a new variable.  For insteance `df2 = df.drop([1])` will make a copy of `df`, drop the row corresponding to index 1 and then store the result in `df2`.  If drop was an \"in place\" operation it would have actually modified `df` and so you wouldn't need to store the result in a new variable.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is how to use it.  Suppose we have the salary dataframe from above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "0   86285       0     bio   26.0  64.0            72\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "2   71922       0     bio   10.0  38.0            23\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "4   66624       0     bio   11.0  41.0            23"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to delete the first we can drop it using the index it has, in this case 0:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>64451</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>23.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "2   71922       0     bio   10.0  38.0            23\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "4   66624       0     bio   11.0  41.0            23\n",
       "5   64451       0     bio   23.0  60.0            44"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2=df.drop([0])\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how I \"saved\" the modified dataframe into a new variable called `df2` using the equals sign.  Then if we call the `.head()` method on `df2` we can see that the first row has been removed.  `df`, the original dataframe variable, remains unchanged.\n",
    "\n",
    "You can also remove multiple rows by their index value at once:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>64451</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>23.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>59344</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>5.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>58560</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>8.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "5   64451       0     bio   23.0  60.0            44\n",
       "7   59344       0     bio    5.0  40.0            11\n",
       "8   58560       0     bio    8.0  38.0             8"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop([0,2,4,6]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here I dropped the rows with index 0,2,4,6, and also show an example of **chaining** dataframe methods.  \n",
    "\n",
    "\n",
    "```{admonition} **Chaining methods**\n",
    ":class: tip\n",
    "Dataframes in pandas a what is known as an object-oriented structure.  This means that most the functionality of a dataframe is tied to the variables themselves.  That is why in the previous examples we called `df.method()` like `df.drop()` instead of `pd.drop()` (calling from the base pandas library). Given this, most pandas methods either return themselves or a copy of the dataframe that has been altered.  Thus you can \"chain\" operations and methods together to make the code more concise.  Chaining means calling multiple methods in a row on a single line of code.  `df.drop([0,2,4,6]).head()` calls the `drop()` method on the dataframe and then called the `head()` method on the resulting modified data frame.  This means you don't have to store every intermediate step of your data manipulation into a new variable. \n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deleting columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To delete a column instead of a row you can also use the `.drop()` method, using an additional argument that refers to the \"axis\" you are dropping from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "0   86285       0     bio   26.0  64.0            72\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "2   71922       0     bio   10.0  38.0            23\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "4   66624       0     bio   11.0  41.0            23"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are a couple examples of dropping one or more columns by name.  Note that the case of the column name must match and also you need to specific `axis=1` to refer to dropping columns instead of rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm   age  publications\n",
       "0   86285       0     bio  64.0            72\n",
       "1   77125       0     bio  58.0            43\n",
       "2   71922       0     bio  38.0            23\n",
       "3   70499       0     bio  46.0            64\n",
       "4   66624       0     bio  41.0            23"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop('years',axis=1).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  publications\n",
       "0   86285       0     bio            72\n",
       "1   77125       0     bio            43\n",
       "2   71922       0     bio            23\n",
       "3   70499       0     bio            64\n",
       "4   66624       0     bio            23"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop(['years','age'],axis=1).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adding rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using pandas it is best to resist the urge to add rows one at a time to a dataframe.  For various reasons is this not the ideal way to use pandas [^columnbased].  Instead you can combine the rows of two different dataframes into one.  This might be useful in psychology for instance if each participant in your experiment had their own data file and you want to read each file into a dataframe and them combine them together to make one \"uber\" dataframe with all the data from your experiment.\n",
    "\n",
    "[^columnbased]: Internally, pandas is primarily organized using columns.  This just means that the computer code underneath pandas likes adding columns and hates adding new rows one at a time.  It is just a thing.  Thus, adding one row at a time is very inefficient.  As a result it is usually best to get all your rows set and then create the dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's two simple dataframes and we can combine them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>27</td>\n",
       "      <td>23000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>45</td>\n",
       "      <td>100000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23</td>\n",
       "      <td>35000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>60000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>60</td>\n",
       "      <td>50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>70</td>\n",
       "      <td>23000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>53</td>\n",
       "      <td>60000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>56</td>\n",
       "      <td>135000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age  salary\n",
       "0   10       0\n",
       "1   27   23000\n",
       "2   45  100000\n",
       "3   23   35000\n",
       "4   21   60000\n",
       "0   60   50000\n",
       "1   70   23000\n",
       "2   53   60000\n",
       "3   56  135000\n",
       "4   80       0"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.DataFrame({\"age\": [10,27,45,23,21], \"salary\": [0,23000,100000,35000,60000]})\n",
    "df2 = pd.DataFrame({\"age\": [60,70,53,56,80], \"salary\": [50000,23000,60000,135000,0]})\n",
    "\n",
    "df_combined = pd.concat([df1,df2])\n",
    "\n",
    "df_combined"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This only works because they have the same columns.  If they have different columns then the missing entries of either are filled in with `NaN` which is the code for \"missing values\" in pandas.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>salary</th>\n",
       "      <th>height</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>27</td>\n",
       "      <td>23000.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>45</td>\n",
       "      <td>100000.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23</td>\n",
       "      <td>35000.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>60000.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>60</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>70</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>53</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>56</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>80</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age    salary  height\n",
       "0   10       0.0     NaN\n",
       "1   27   23000.0     NaN\n",
       "2   45  100000.0     NaN\n",
       "3   23   35000.0     NaN\n",
       "4   21   60000.0     NaN\n",
       "0   60       NaN     5.2\n",
       "1   70       NaN     6.0\n",
       "2   53       NaN     5.7\n",
       "3   56       NaN     3.4\n",
       "4   80       NaN     4.6"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.DataFrame({\"age\": [10,27,45,23,21], \"salary\": [0,23000,100000,35000,60000]})\n",
    "df2 = pd.DataFrame({\"age\": [60,70,53,56,80], \"height\": [5.2,6.0,5.7,3.4,4.6]})\n",
    "\n",
    "df_combined = pd.concat([df1,df2])\n",
    "\n",
    "df_combined"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will talk about dealing with \"missing\" values shortly but basically missing values in pandas allows for incomplete rows: you might have have information about every single field of a row and so you can uses `NaN` (stands for Not-a-number in computer speak) to represent missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adding columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have considered adding rows, now let's consider adding columns.  This is actually pretty easy.  You just assign some values to a new columns name.  First we will create a data frame with two random columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_1</th>\n",
       "      <th>col_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.016417</td>\n",
       "      <td>0.555351</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.323619</td>\n",
       "      <td>0.868004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.744349</td>\n",
       "      <td>0.767006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.023376</td>\n",
       "      <td>0.915717</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.431546</td>\n",
       "      <td>0.761252</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.668849</td>\n",
       "      <td>0.522965</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.199476</td>\n",
       "      <td>0.161559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.071968</td>\n",
       "      <td>0.866112</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.452259</td>\n",
       "      <td>0.650265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.449017</td>\n",
       "      <td>0.797710</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      col_1     col_2\n",
       "0  0.016417  0.555351\n",
       "1  0.323619  0.868004\n",
       "2  0.744349  0.767006\n",
       "3  0.023376  0.915717\n",
       "4  0.431546  0.761252\n",
       "5  0.668849  0.522965\n",
       "6  0.199476  0.161559\n",
       "7  0.071968  0.866112\n",
       "8  0.452259  0.650265\n",
       "9  0.449017  0.797710"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({\"col_1\": np.random.rand(10), \"col_2\": np.random.rand(10)})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we simply assign a new column `df[\\'sum\\']` and define it to be the sum of two columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_1</th>\n",
       "      <th>col_2</th>\n",
       "      <th>sum</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.016417</td>\n",
       "      <td>0.555351</td>\n",
       "      <td>0.571768</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.323619</td>\n",
       "      <td>0.868004</td>\n",
       "      <td>1.191623</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.744349</td>\n",
       "      <td>0.767006</td>\n",
       "      <td>1.511355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.023376</td>\n",
       "      <td>0.915717</td>\n",
       "      <td>0.939093</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.431546</td>\n",
       "      <td>0.761252</td>\n",
       "      <td>1.192798</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.668849</td>\n",
       "      <td>0.522965</td>\n",
       "      <td>1.191814</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.199476</td>\n",
       "      <td>0.161559</td>\n",
       "      <td>0.361035</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.071968</td>\n",
       "      <td>0.866112</td>\n",
       "      <td>0.938080</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.452259</td>\n",
       "      <td>0.650265</td>\n",
       "      <td>1.102524</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.449017</td>\n",
       "      <td>0.797710</td>\n",
       "      <td>1.246726</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      col_1     col_2       sum\n",
       "0  0.016417  0.555351  0.571768\n",
       "1  0.323619  0.868004  1.191623\n",
       "2  0.744349  0.767006  1.511355\n",
       "3  0.023376  0.915717  0.939093\n",
       "4  0.431546  0.761252  1.192798\n",
       "5  0.668849  0.522965  1.191814\n",
       "6  0.199476  0.161559  0.361035\n",
       "7  0.071968  0.866112  0.938080\n",
       "8  0.452259  0.650265  1.102524\n",
       "9  0.449017  0.797710  1.246726"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['sum'] = df['col_1']+df['col_2']\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also define new columns to be a constant value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_1</th>\n",
       "      <th>col_2</th>\n",
       "      <th>sum</th>\n",
       "      <th>constant</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.189631</td>\n",
       "      <td>0.014450</td>\n",
       "      <td>0.204081</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.455379</td>\n",
       "      <td>0.781008</td>\n",
       "      <td>1.236386</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.879715</td>\n",
       "      <td>0.426790</td>\n",
       "      <td>1.306504</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.942081</td>\n",
       "      <td>0.905498</td>\n",
       "      <td>1.847578</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.189411</td>\n",
       "      <td>0.631864</td>\n",
       "      <td>0.821275</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.813499</td>\n",
       "      <td>0.069108</td>\n",
       "      <td>0.882607</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.884211</td>\n",
       "      <td>0.467572</td>\n",
       "      <td>1.351783</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.232354</td>\n",
       "      <td>0.593110</td>\n",
       "      <td>0.825464</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.762218</td>\n",
       "      <td>0.176046</td>\n",
       "      <td>0.938265</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.872711</td>\n",
       "      <td>0.053765</td>\n",
       "      <td>0.926476</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      col_1     col_2       sum  constant\n",
       "0  0.189631  0.014450  0.204081         2\n",
       "1  0.455379  0.781008  1.236386         2\n",
       "2  0.879715  0.426790  1.306504         2\n",
       "3  0.942081  0.905498  1.847578         2\n",
       "4  0.189411  0.631864  0.821275         2\n",
       "5  0.813499  0.069108  0.882607         2\n",
       "6  0.884211  0.467572  1.351783         2\n",
       "7  0.232354  0.593110  0.825464         2\n",
       "8  0.762218  0.176046  0.938265         2\n",
       "9  0.872711  0.053765  0.926476         2"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['constant'] = 2\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are of course some limitations and technicalities here but for the most part you just name a new column and define it as above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deleting rows with missing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we are talking about adding and removing rows and columns it can also make sense to discuss removing rows with missing data.   You might want to for example drop any trial from an experiment where a subject didn't give a response before a timer deadline.  In this case the column coding what the response was might be \"empty\" or \"missing\" and you would want to use the `NaN` value to indicate it was missing.  To totally delete rows with any missing value use the `dropna()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>27</td>\n",
       "      <td>23000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>45</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23</td>\n",
       "      <td>35000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>60000.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age   salary\n",
       "0   10      0.0\n",
       "1   27  23000.0\n",
       "2   45      NaN\n",
       "3   23  35000.0\n",
       "4   21  60000.0"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({\"age\": [10,27,45,23,21], \"salary\": [0,23000,None,35000,60000]})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see the salary value is missing for row with index 2 (age 45).  To drop any row with a missing value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>27</td>\n",
       "      <td>23000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23</td>\n",
       "      <td>35000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>60000.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age   salary\n",
       "0   10      0.0\n",
       "1   27  23000.0\n",
       "3   23  35000.0\n",
       "4   21  60000.0"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several other tricks for dropping missing data besides this.  For example, you can delete rows with only a specific column value missing, etc...  However for now this should work for us."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting individual values and changing them"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes you want to just extract a single value or row from a dataframe or change a single value.  The `.iat` method lets you pull out a single value from the data frame given the position of the index and column.  The `.iat` method is one of the very weird parts of Pandas which has no real good explanation.  Instead of the parentheses used to call normal methods, `.iat` and a few others use square brackets.  One way to think about it is that `.iat` is like looking something up in a normal Python array but it also seems like a method attached to the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>neuro</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>57185</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>61885</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>23.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "0    86285       0     bio   26.0  64.0            72\n",
       "1    77125       0     bio   28.0  58.0            43\n",
       "2    71922       0     bio   10.0  38.0            23\n",
       "3    70499       0     bio   16.0  46.0            64\n",
       "4    66624       0     bio   11.0  41.0            23\n",
       "..     ...     ...     ...    ...   ...           ...\n",
       "72   53662       1   neuro    1.0  31.0             3\n",
       "73   57185       1    stat    9.0  39.0             7\n",
       "74   52254       1    stat    2.0  32.0             9\n",
       "75   61885       1    math   23.0  60.0             9\n",
       "76   49542       1    math    3.0  33.0             5\n",
       "\n",
       "[77 rows x 6 columns]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gets the value at row 0 column 0 which is the salary of the first person in the data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "86285"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iat[0,0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gets the age of the person in the sixth row, fourth column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "53.0"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iat[6,4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A slighly more reader-friendly option is to use `.at[]` which is a method that lets you look things up using the names of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "40.0"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.at[10, 'age']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use this to **change** the value of a particular cell."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we see that the person in the first row has age 64."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "0   86285       0     bio   26.0  64.0            72\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "2   71922       0     bio   10.0  38.0            23\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "4   66624       0     bio   11.0  41.0            23"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `.at[]` we set it to 100."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.at[0,'age']=100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then when we look at the result the age has been changed in the dataframe `df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years    age  publications\n",
       "0   86285       0     bio   26.0  100.0            72\n",
       "1   77125       0     bio   28.0   58.0            43\n",
       "2   71922       0     bio   10.0   38.0            23\n",
       "3   70499       0     bio   16.0   46.0            64\n",
       "4   66624       0     bio   11.0   41.0            23"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way to remember this is that methods that look things up use square brackets.  And if the method begins with an i (like `.iat[]`) it will look it up by integer index (i.e., the number of the column or row).  Otherwise `.at[]` looks up by row (or a named index) and column name.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking the size/dimensions of your data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One common thing you need to do is to verify the number of rows and columns in your dataframe.  The `.shape` property can help with that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(77, 6)"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tells us that the dataframe contained in variable `df` has 77 rows and 6 columns.  This is often helpful when you first read in a dataset to verify it has all the data you expected to be there!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} **Methods versus properties**\n",
    ":class: tip\n",
    "The `.shape` property doesn't include a final `()` unlike other methods we have learned about like `.drop()` which required parentheses.  This reflects that size is known as a property of the dataframe while `.drop()` is a method.  The conceptual differences can be confusing for why one is one way or the other.  However, it is helpful to often think about them as the distinction between nouns and verbs in langauge.  Properties (nouns) are static descriptors of a dataset such as the size of the dataset or the column names.  In contrast, methods (verbs) are things that require computation or modification of the dataframe like deleting things or performing computations on columns or rows.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ultimately the step we just covered recreate much of what you do with the graphical user interface in Excel (change cell values, add/delete rows and columns, etc...).  The real power of Pandas comes from more complex things you can do with dataframes.  That is what we will explore in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Things you can do with dataframes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main goal of getting your data into a dataframe is that enables several methods for manipulating your data in powerful ways."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sorting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often times it can help us understand a dataset better if we can sort the rows of the dataset according to the values in one or more columns.  For instance in the salary data set we have been considering it is hard to know who is the highest and lowest paid faculty.  One approach would to be sort the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "0   86285       0     bio   26.0  64.0            72\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "2   71922       0     bio   10.0  38.0            23\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "4   66624       0     bio   11.0  41.0            23"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can sort this dataset in ascending order with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>44687</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>51391</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>5.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>97630</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>34.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>104828</td>\n",
       "      <td>0</td>\n",
       "      <td>geol</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>105761</td>\n",
       "      <td>0</td>\n",
       "      <td>neuro</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>106412</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>23.0</td>\n",
       "      <td>53.0</td>\n",
       "      <td>29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>112800</td>\n",
       "      <td>0</td>\n",
       "      <td>neuro</td>\n",
       "      <td>14.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "23   44687       0    chem    4.0  34.0            19\n",
       "22   47021       0    chem    4.0  34.0            12\n",
       "76   49542       1    math    3.0  33.0             5\n",
       "53   51391       0    stat    5.0  35.0             8\n",
       "74   52254       1    stat    2.0  32.0             9\n",
       "..     ...     ...     ...    ...   ...           ...\n",
       "14   97630       0    chem   34.0  64.0            43\n",
       "24  104828       0    geol    NaN  50.0            44\n",
       "29  105761       0   neuro    9.0  39.0            30\n",
       "41  106412       0    stat   23.0  53.0            29\n",
       "28  112800       0   neuro   14.0  44.0            33\n",
       "\n",
       "[77 rows x 6 columns]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values('salary')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can easily see from this output at 44,687 is the lowest salary and 112,800 is the highest.  `sort_values()` is **not** an inplace operation so the original dataframe is still unsorted and we have to store the sorted result in a new dataframe variable if we want to keep working with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>77125</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>16.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66624</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>11.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm  years   age  publications\n",
       "0   86285       0     bio   26.0  64.0            72\n",
       "1   77125       0     bio   28.0  58.0            43\n",
       "2   71922       0     bio   10.0  38.0            23\n",
       "3   70499       0     bio   16.0  46.0            64\n",
       "4   66624       0     bio   11.0  41.0            23"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()  # still unsorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can sort the other way by adding an additional parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>112800</td>\n",
       "      <td>0</td>\n",
       "      <td>neuro</td>\n",
       "      <td>14.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>106412</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>23.0</td>\n",
       "      <td>53.0</td>\n",
       "      <td>29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>105761</td>\n",
       "      <td>0</td>\n",
       "      <td>neuro</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>104828</td>\n",
       "      <td>0</td>\n",
       "      <td>geol</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>97630</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>34.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>51391</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>5.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>44687</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "28  112800       0   neuro   14.0  44.0            33\n",
       "41  106412       0    stat   23.0  53.0            29\n",
       "29  105761       0   neuro    9.0  39.0            30\n",
       "24  104828       0    geol    NaN  50.0            44\n",
       "14   97630       0    chem   34.0  64.0            43\n",
       "..     ...     ...     ...    ...   ...           ...\n",
       "74   52254       1    stat    2.0  32.0             9\n",
       "53   51391       0    stat    5.0  35.0             8\n",
       "76   49542       1    math    3.0  33.0             5\n",
       "22   47021       0    chem    4.0  34.0            12\n",
       "23   44687       0    chem    4.0  34.0            19\n",
       "\n",
       "[77 rows x 6 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values('salary',ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And if you sort by two columns it will do them in order (so the first listed column is sorted first then the second):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>44687</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>51391</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>5.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>97630</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>34.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>104828</td>\n",
       "      <td>0</td>\n",
       "      <td>geol</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50.0</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>105761</td>\n",
       "      <td>0</td>\n",
       "      <td>neuro</td>\n",
       "      <td>9.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>106412</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>23.0</td>\n",
       "      <td>53.0</td>\n",
       "      <td>29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>112800</td>\n",
       "      <td>0</td>\n",
       "      <td>neuro</td>\n",
       "      <td>14.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "23   44687       0    chem    4.0  34.0            19\n",
       "22   47021       0    chem    4.0  34.0            12\n",
       "76   49542       1    math    3.0  33.0             5\n",
       "53   51391       0    stat    5.0  35.0             8\n",
       "74   52254       1    stat    2.0  32.0             9\n",
       "..     ...     ...     ...    ...   ...           ...\n",
       "14   97630       0    chem   34.0  64.0            43\n",
       "24  104828       0    geol    NaN  50.0            44\n",
       "29  105761       0   neuro    9.0  39.0            30\n",
       "41  106412       0    stat   23.0  53.0            29\n",
       "28  112800       0   neuro   14.0  44.0            33\n",
       "\n",
       "[77 rows x 6 columns]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values(['salary','age'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this data set it is mostly the same to sort by salary first then age because most people don't have the same salary so that already provides an order.  However if we do it the other way, i.e., age first then salary, it will order by people age and then for the people who are the same age sort by salary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>neuro</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>53656</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>72044</td>\n",
       "      <td>0</td>\n",
       "      <td>physics</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>82444</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>31.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>86285</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>26.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>97630</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>34.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>76291</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>29.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>64762</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>25.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>29</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>77 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender  departm  years   age  publications\n",
       "72   53662       1    neuro    1.0  31.0             3\n",
       "74   52254       1     stat    2.0  32.0             9\n",
       "52   53656       0     stat    2.0  32.0             4\n",
       "56   72044       0  physics    2.0  32.0            16\n",
       "76   49542       1     math    3.0  33.0             5\n",
       "..     ...     ...      ...    ...   ...           ...\n",
       "15   82444       0     chem   31.0  61.0            42\n",
       "0    86285       0      bio   26.0  64.0            72\n",
       "14   97630       0     chem   34.0  64.0            43\n",
       "16   76291       0     chem   29.0  65.0            33\n",
       "18   64762       0     chem   25.0   NaN            29\n",
       "\n",
       "[77 rows x 6 columns]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values(['age','salary'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see in this shortened output there are several people who are 32 in the database and their salaries are ordered from smallest to biggest.\n",
    "\n",
    "Sorting is easy to do in Pandas but also easy to do in Excel because there is a \"sort\" tool in such programs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Arithmetic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perhaps one of the most useful features of dataframes (and spreadsheets) is the ability to create formulas that compute new values based on the rows and columns.  For instance if you had a dataframe that had rows for students and each column was the grade on an assignment a common operation might be to compute the average grade as a new column.  Let's take a look at a simple example of this and then discuss arithmetic operations in Pandas more generally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>student</th>\n",
       "      <th>assignment1</th>\n",
       "      <th>assignment2</th>\n",
       "      <th>assignment3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>90</td>\n",
       "      <td>82</td>\n",
       "      <td>89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>80</td>\n",
       "      <td>84</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>70</td>\n",
       "      <td>96</td>\n",
       "      <td>89</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  student  assignment1  assignment2  assignment3\n",
       "0     001           90           82           89\n",
       "1     002           80           84           75\n",
       "2     003           70           96           89"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades_df = pd.DataFrame({'student':['001','002','003'], 'assignment1': [90, 80, 70], 'assignment2': [82,84,96], 'assignment3': [89,75,89]})\n",
    "grades_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is not necessarily the easier way to **enter** this data (you might prefer to use a spreadsheet for that), but you could read in a csv to load the grades for instance.  Next you would want to create the average grade for each student."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>student</th>\n",
       "      <th>assignment1</th>\n",
       "      <th>assignment2</th>\n",
       "      <th>assignment3</th>\n",
       "      <th>average</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>90</td>\n",
       "      <td>82</td>\n",
       "      <td>89</td>\n",
       "      <td>87.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>80</td>\n",
       "      <td>84</td>\n",
       "      <td>75</td>\n",
       "      <td>79.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>70</td>\n",
       "      <td>96</td>\n",
       "      <td>89</td>\n",
       "      <td>85.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  student  assignment1  assignment2  assignment3    average\n",
       "0     001           90           82           89  87.000000\n",
       "1     002           80           84           75  79.666667\n",
       "2     003           70           96           89  85.000000"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades_df['average']=(grades_df['assignment1']+grades_df['assignment2']+grades_df['assignment3'])/3\n",
    "grades_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So you can see here we added up the column for assignment 1, 2, and 3 and then divided by three.  Then we assigned that resulting value to a new column called average.  You might wonder how did Pandas know to do this for all three students?  The answer is the power of **broadcasting** a feature of many programming languages that automatically detects when you are doing arithmetic operations on collections of numbers and then does that operation for **each entry** rather than like the first one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also broadcast the addition of constant values to a column.  For instance to give all the students a five point bonus we could do this\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>student</th>\n",
       "      <th>assignment1</th>\n",
       "      <th>assignment2</th>\n",
       "      <th>assignment3</th>\n",
       "      <th>average</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>90</td>\n",
       "      <td>82</td>\n",
       "      <td>89</td>\n",
       "      <td>92.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>80</td>\n",
       "      <td>84</td>\n",
       "      <td>75</td>\n",
       "      <td>84.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>70</td>\n",
       "      <td>96</td>\n",
       "      <td>89</td>\n",
       "      <td>90.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  student  assignment1  assignment2  assignment3    average\n",
       "0     001           90           82           89  92.000000\n",
       "1     002           80           84           75  84.666667\n",
       "2     003           70           96           89  90.000000"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades_df['average']=grades_df['average']+5\n",
    "grades_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, here it added 5 to *each entry* of the grades column rather than just one or the first row."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Basically any math function can be composed of the columns.  You might also be interested in functions you could compute down the columns rather than across them, however we will consider those in more detail in the later chapter on [](../../chapters/07/00-describingdata)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Slicing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very useful and common feature for manipulating data is slicing.  Slicing refers to selecting out subsets of a dataset for further analysis.  For example we might want to plot only the salaries of the women in this data set.  To do this we want to \"slice\" out a subset of the data and analyze that further.  We saw slicing before in a previous chapter on Python programming with strings and lists where you can \"slice\" pieces of a collection out.  Pandas takes this several steps further making it more functional for data composed of multiple rows and columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you remember we can slice a piece of a list by specifying the starting element and the ending element and including the `:` operator to indicate that we want the values \"in between\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['b', 'c', 'd']"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a_list = ['a','b','c','d','e','f','g']\n",
    "a_list[1:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do the same thing using the `.iloc[]` method on a data frame.  However since a Pandas dataframe is two dimensional we can specify two slide ranges one for the rows and one for the columns. `iloc[]` is simialr to `iat[]` we say above in that it is used via square brackets (because it is a lookup indexing operation) and the `i` part refer to that we are providing the integer location of the rows and columns (i.e., using numbers to say which row and columns we want)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>71922</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70499</td>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   salary  gender departm\n",
       "2   71922       0     bio\n",
       "3   70499       0     bio"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[2:4,0:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That example takes row 2 and 3 and columns 0,1,2 (remember Python is zero indexed so the first row or column is numbered zero)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also slice using column names using just `loc[]`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>28.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>bio</td>\n",
       "      <td>10.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   gender departm  years   age  publications\n",
       "1       0     bio   28.0  58.0            43\n",
       "2       0     bio   10.0  38.0            23"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[1:2,\"gender\":]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to `.iat[]` and `.at[]`, `iloc[]` uses integer lookup and `loc[]` uses index names.  You might wonder why I still used numbers for the rows in the above.  This is because this dataframe still has an **index** for the rows which is integer based.  However, indexes are an important part of Dataframes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Related to slicing is \"selecting\" which is grabbing subsets of a dataframe's rows based on the **values** of some of the rows.  This is different than slicing which takes little chunks out of a larger dataframe using indexes or column names.  Here we are interested in selecting rows that meet a particular criterion.  For instance in the professor salary dataset we read in we might want to select only the rows that represent women in the dataset.  Let's look at an example of how to do that and then discuss the principle."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we need to discuss the concept of logical operations which are broadcast along the columns.  For instance if we write something like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      True\n",
       "1      True\n",
       "2     False\n",
       "3     False\n",
       "4     False\n",
       "      ...  \n",
       "72    False\n",
       "73    False\n",
       "74    False\n",
       "75     True\n",
       "76    False\n",
       "Name: age, Length: 77, dtype: bool"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['age']>50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get a column of `True`/`False` values (also known as Boolean values since they take on two values) which reflect a test for each row if the age value is greater than 50.  If it is, then `True` is entered into the new column and if it isn't then `False` is entered in.\n",
    "\n",
    "We can write more complex logical operations as well.  For instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      True\n",
       "1      True\n",
       "2     False\n",
       "3     False\n",
       "4     False\n",
       "      ...  \n",
       "72    False\n",
       "73    False\n",
       "74    False\n",
       "75     True\n",
       "76    False\n",
       "Name: age, Length: 77, dtype: bool"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df['age']>50) & (df['age']<70)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This expression does a logical `and` due to the `&` symbol and will be true if the age is greater than 50 AND less than 70."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The examples so far used a single row but we can also make combination using multiple columns.  For instance we could select all the rows corresponding to professors that are male and under 50."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     False\n",
       "1     False\n",
       "2     False\n",
       "3     False\n",
       "4     False\n",
       "      ...  \n",
       "72     True\n",
       "73     True\n",
       "74     True\n",
       "75    False\n",
       "76     True\n",
       "Length: 77, dtype: bool"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df['age']<50) & (df['gender']==1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to make an \"or\" you use the '|' (pipe) character instead of the '&'.\n",
    "\n",
    "Now that we have this boolean column we can use it to select subsets of the original dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>44687</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>53656</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>72044</td>\n",
       "      <td>0</td>\n",
       "      <td>physics</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>55949</td>\n",
       "      <td>1</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>neuro</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender  departm  years   age  publications\n",
       "22   47021       0     chem    4.0  34.0            12\n",
       "23   44687       0     chem    4.0  34.0            19\n",
       "52   53656       0     stat    2.0  32.0             4\n",
       "56   72044       0  physics    2.0  32.0            16\n",
       "70   55949       1     chem    4.0  34.0            12\n",
       "72   53662       1    neuro    1.0  31.0             3\n",
       "74   52254       1     stat    2.0  32.0             9\n",
       "76   49542       1     math    3.0  33.0             5"
      ]
     },
     "execution_count": 123,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['age']<35]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The previous line selects all the professors under 35.  Notice the syntax here as it is kind of sensible.  On the outer part we have `df[]` and in the middle of the bracket we provide the logical column/series as we just discussed.  You can break them into two steps if you like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>44687</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>53656</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>72044</td>\n",
       "      <td>0</td>\n",
       "      <td>physics</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>55949</td>\n",
       "      <td>1</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>neuro</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender  departm  years   age  publications\n",
       "22   47021       0     chem    4.0  34.0            12\n",
       "23   44687       0     chem    4.0  34.0            19\n",
       "52   53656       0     stat    2.0  32.0             4\n",
       "56   72044       0  physics    2.0  32.0            16\n",
       "70   55949       1     chem    4.0  34.0            12\n",
       "72   53662       1    neuro    1.0  31.0             3\n",
       "74   52254       1     stat    2.0  32.0             9\n",
       "76   49542       1     math    3.0  33.0             5"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "under35=df['age']<35\n",
    "df[under35]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This makes clear that we define the \"rule\" for what things are true and false in this column (`under35`) and then use it to select rows from the original dataframe.\n",
    "\n",
    "You use this a lot in data analysis because often a data file from a single subject in an experiment has trials you want to skip or analyze, or you might use it to select trials from particular subjects, or trials that meet a certain requirement (E.g., if a reaction time was too long or something).  Thus it is important to bookmark this concept and we will return to it several times throughout the semester."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An event simpler synatx for this is provided by the `.query()` method on Pandas data frames.  For example, using this command this is how you would select all the professors under age 35 in the the salary dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>44687</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>53656</td>\n",
       "      <td>0</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>72044</td>\n",
       "      <td>0</td>\n",
       "      <td>physics</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>55949</td>\n",
       "      <td>1</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>53662</td>\n",
       "      <td>1</td>\n",
       "      <td>neuro</td>\n",
       "      <td>1.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>52254</td>\n",
       "      <td>1</td>\n",
       "      <td>stat</td>\n",
       "      <td>2.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>49542</td>\n",
       "      <td>1</td>\n",
       "      <td>math</td>\n",
       "      <td>3.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender  departm  years   age  publications\n",
       "22   47021       0     chem    4.0  34.0            12\n",
       "23   44687       0     chem    4.0  34.0            19\n",
       "52   53656       0     stat    2.0  32.0             4\n",
       "56   72044       0  physics    2.0  32.0            16\n",
       "70   55949       1     chem    4.0  34.0            12\n",
       "72   53662       1    neuro    1.0  31.0             3\n",
       "74   52254       1     stat    2.0  32.0             9\n",
       "76   49542       1     math    3.0  33.0             5"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query('age<35')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is nice because it does the same thing as the above without all the extra brackets which sometimes contribute to typos.  Here is a slightly more complex version using the logical and operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>44687</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>55949</td>\n",
       "      <td>1</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "22   47021       0    chem    4.0  34.0            12\n",
       "23   44687       0    chem    4.0  34.0            19\n",
       "70   55949       1    chem    4.0  34.0            12"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query(\"age<35 & departm=='chem'\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>salary</th>\n",
       "      <th>gender</th>\n",
       "      <th>departm</th>\n",
       "      <th>years</th>\n",
       "      <th>age</th>\n",
       "      <th>publications</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>47021</td>\n",
       "      <td>0</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>55949</td>\n",
       "      <td>1</td>\n",
       "      <td>chem</td>\n",
       "      <td>4.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    salary  gender departm  years   age  publications\n",
       "22   47021       0    chem    4.0  34.0            12\n",
       "70   55949       1    chem    4.0  34.0            12"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query(\"age<35 & departm=='chem' & publications == 12\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So nice!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Iteration refers to stepping through either the rows of the columns of your dataframe one by one and doing some thing with each row or column.  We have encountered iteration before when we learned about Python for loops.  If you remember, the typical for loop we had we iterated over a list.  For example this code iterates down a list of values and prints each one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "1\n",
      "2\n",
      "3\n",
      "4\n",
      "5\n",
      "6\n",
      "7\n",
      "8\n",
      "9\n"
     ]
    }
   ],
   "source": [
    "for i in range(10):\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can iterate over both rows and columns of a dataframe since the format is two dimensional.  This prints out the titles of each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "salary\n",
      "gender\n",
      "departm\n",
      "years\n",
      "age\n",
      "publications\n"
     ]
    }
   ],
   "source": [
    "for column in df:\n",
    "    print(column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above method only gets you the individual column name.  To get the data within each column, you use a special methods called `.iteritems()`.  For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "tags": [
     "output_scroll"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "salary\n",
      "---\n",
      "0     86285\n",
      "1     77125\n",
      "2     71922\n",
      "3     70499\n",
      "4     66624\n",
      "      ...  \n",
      "72    53662\n",
      "73    57185\n",
      "74    52254\n",
      "75    61885\n",
      "76    49542\n",
      "Name: salary, Length: 77, dtype: int64\n",
      "gender\n",
      "---\n",
      "0     0\n",
      "1     0\n",
      "2     0\n",
      "3     0\n",
      "4     0\n",
      "     ..\n",
      "72    1\n",
      "73    1\n",
      "74    1\n",
      "75    1\n",
      "76    1\n",
      "Name: gender, Length: 77, dtype: int64\n",
      "departm\n",
      "---\n",
      "0       bio\n",
      "1       bio\n",
      "2       bio\n",
      "3       bio\n",
      "4       bio\n",
      "      ...  \n",
      "72    neuro\n",
      "73     stat\n",
      "74     stat\n",
      "75     math\n",
      "76     math\n",
      "Name: departm, Length: 77, dtype: object\n",
      "years\n",
      "---\n",
      "0     26.0\n",
      "1     28.0\n",
      "2     10.0\n",
      "3     16.0\n",
      "4     11.0\n",
      "      ... \n",
      "72     1.0\n",
      "73     9.0\n",
      "74     2.0\n",
      "75    23.0\n",
      "76     3.0\n",
      "Name: years, Length: 77, dtype: float64\n",
      "age\n",
      "---\n",
      "0     64.0\n",
      "1     58.0\n",
      "2     38.0\n",
      "3     46.0\n",
      "4     41.0\n",
      "      ... \n",
      "72    31.0\n",
      "73    39.0\n",
      "74    32.0\n",
      "75    60.0\n",
      "76    33.0\n",
      "Name: age, Length: 77, dtype: float64\n",
      "publications\n",
      "---\n",
      "0     72\n",
      "1     43\n",
      "2     23\n",
      "3     64\n",
      "4     23\n",
      "      ..\n",
      "72     3\n",
      "73     7\n",
      "74     9\n",
      "75     9\n",
      "76     5\n",
      "Name: publications, Length: 77, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "for columnname, columndata in df.iteritems():\n",
    "    print(columnname)\n",
    "    print('---')\n",
    "    print(columndata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, if you want to step through row-by-row use the `.iterrows()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "output_scroll"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0, salary          86285\n",
      "gender              0\n",
      "departm           bio\n",
      "years              26\n",
      "age                64\n",
      "publications       72\n",
      "Name: 0, dtype: object)\n",
      "(1, salary          77125\n",
      "gender              0\n",
      "departm           bio\n",
      "years              28\n",
      "age                58\n",
      "publications       43\n",
      "Name: 1, dtype: object)\n",
      "(2, salary          71922\n",
      "gender              0\n",
      "departm           bio\n",
      "years              10\n",
      "age                38\n",
      "publications       23\n",
      "Name: 2, dtype: object)\n",
      "(3, salary          70499\n",
      "gender              0\n",
      "departm           bio\n",
      "years              16\n",
      "age                46\n",
      "publications       64\n",
      "Name: 3, dtype: object)\n",
      "(4, salary          66624\n",
      "gender              0\n",
      "departm           bio\n",
      "years              11\n",
      "age                41\n",
      "publications       23\n",
      "Name: 4, dtype: object)\n",
      "(5, salary          64451\n",
      "gender              0\n",
      "departm           bio\n",
      "years              23\n",
      "age                60\n",
      "publications       44\n",
      "Name: 5, dtype: object)\n",
      "(6, salary          64366\n",
      "gender              0\n",
      "departm           bio\n",
      "years              23\n",
      "age                53\n",
      "publications       22\n",
      "Name: 6, dtype: object)\n",
      "(7, salary          59344\n",
      "gender              0\n",
      "departm           bio\n",
      "years               5\n",
      "age                40\n",
      "publications       11\n",
      "Name: 7, dtype: object)\n",
      "(8, salary          58560\n",
      "gender              0\n",
      "departm           bio\n",
      "years               8\n",
      "age                38\n",
      "publications        8\n",
      "Name: 8, dtype: object)\n",
      "(9, salary          58294\n",
      "gender              0\n",
      "departm           bio\n",
      "years              20\n",
      "age                50\n",
      "publications       12\n",
      "Name: 9, dtype: object)\n",
      "(10, salary          56092\n",
      "gender              0\n",
      "departm           bio\n",
      "years               2\n",
      "age                40\n",
      "publications        4\n",
      "Name: 10, dtype: object)\n",
      "(11, salary          54452\n",
      "gender              0\n",
      "departm           bio\n",
      "years              13\n",
      "age                43\n",
      "publications        7\n",
      "Name: 11, dtype: object)\n",
      "(12, salary          54269\n",
      "gender              0\n",
      "departm           bio\n",
      "years              26\n",
      "age                56\n",
      "publications       12\n",
      "Name: 12, dtype: object)\n",
      "(13, salary          55125\n",
      "gender              0\n",
      "departm           bio\n",
      "years               8\n",
      "age                38\n",
      "publications        9\n",
      "Name: 13, dtype: object)\n",
      "(14, salary          97630\n",
      "gender              0\n",
      "departm          chem\n",
      "years              34\n",
      "age                64\n",
      "publications       43\n",
      "Name: 14, dtype: object)\n",
      "(15, salary          82444\n",
      "gender              0\n",
      "departm          chem\n",
      "years              31\n",
      "age                61\n",
      "publications       42\n",
      "Name: 15, dtype: object)\n",
      "(16, salary          76291\n",
      "gender              0\n",
      "departm          chem\n",
      "years              29\n",
      "age                65\n",
      "publications       33\n",
      "Name: 16, dtype: object)\n",
      "(17, salary          75382\n",
      "gender              0\n",
      "departm          chem\n",
      "years              26\n",
      "age                56\n",
      "publications       39\n",
      "Name: 17, dtype: object)\n",
      "(18, salary          64762\n",
      "gender              0\n",
      "departm          chem\n",
      "years              25\n",
      "age               NaN\n",
      "publications       29\n",
      "Name: 18, dtype: object)\n",
      "(19, salary          62607\n",
      "gender              0\n",
      "departm          chem\n",
      "years              20\n",
      "age                45\n",
      "publications       34\n",
      "Name: 19, dtype: object)\n",
      "(20, salary          60373\n",
      "gender              0\n",
      "departm          chem\n",
      "years              26\n",
      "age                56\n",
      "publications       43\n",
      "Name: 20, dtype: object)\n",
      "(21, salary          58892\n",
      "gender              0\n",
      "departm          chem\n",
      "years              18\n",
      "age                48\n",
      "publications       21\n",
      "Name: 21, dtype: object)\n",
      "(22, salary          47021\n",
      "gender              0\n",
      "departm          chem\n",
      "years               4\n",
      "age                34\n",
      "publications       12\n",
      "Name: 22, dtype: object)\n",
      "(23, salary          44687\n",
      "gender              0\n",
      "departm          chem\n",
      "years               4\n",
      "age                34\n",
      "publications       19\n",
      "Name: 23, dtype: object)\n",
      "(24, salary          104828\n",
      "gender               0\n",
      "departm           geol\n",
      "years              NaN\n",
      "age                 50\n",
      "publications        44\n",
      "Name: 24, dtype: object)\n",
      "(25, salary          71456\n",
      "gender              0\n",
      "departm          geol\n",
      "years              11\n",
      "age                41\n",
      "publications       32\n",
      "Name: 25, dtype: object)\n",
      "(26, salary          65144\n",
      "gender              0\n",
      "departm          geol\n",
      "years               7\n",
      "age                37\n",
      "publications       12\n",
      "Name: 26, dtype: object)\n",
      "(27, salary          52766\n",
      "gender              0\n",
      "departm          geol\n",
      "years               4\n",
      "age                38\n",
      "publications       32\n",
      "Name: 27, dtype: object)\n",
      "(28, salary          112800\n",
      "gender               0\n",
      "departm          neuro\n",
      "years               14\n",
      "age                 44\n",
      "publications        33\n",
      "Name: 28, dtype: object)\n",
      "(29, salary          105761\n",
      "gender               0\n",
      "departm          neuro\n",
      "years                9\n",
      "age                 39\n",
      "publications        30\n",
      "Name: 29, dtype: object)\n",
      "(30, salary          92951\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              11\n",
      "age                41\n",
      "publications       20\n",
      "Name: 30, dtype: object)\n",
      "(31, salary          86621\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              19\n",
      "age                49\n",
      "publications       10\n",
      "Name: 31, dtype: object)\n",
      "(32, salary          85569\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              20\n",
      "age                46\n",
      "publications       35\n",
      "Name: 32, dtype: object)\n",
      "(33, salary          83896\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              10\n",
      "age                40\n",
      "publications       22\n",
      "Name: 33, dtype: object)\n",
      "(34, salary          79735\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              11\n",
      "age                41\n",
      "publications       32\n",
      "Name: 34, dtype: object)\n",
      "(35, salary          71518\n",
      "gender              0\n",
      "departm         neuro\n",
      "years               7\n",
      "age                37\n",
      "publications       34\n",
      "Name: 35, dtype: object)\n",
      "(36, salary          68029\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              15\n",
      "age                45\n",
      "publications       33\n",
      "Name: 36, dtype: object)\n",
      "(37, salary          66482\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              14\n",
      "age                44\n",
      "publications       42\n",
      "Name: 37, dtype: object)\n",
      "(38, salary          61680\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              18\n",
      "age                48\n",
      "publications       20\n",
      "Name: 38, dtype: object)\n",
      "(39, salary          60455\n",
      "gender              0\n",
      "departm         neuro\n",
      "years               8\n",
      "age                38\n",
      "publications       49\n",
      "Name: 39, dtype: object)\n",
      "(40, salary          58932\n",
      "gender              0\n",
      "departm         neuro\n",
      "years              11\n",
      "age                41\n",
      "publications       49\n",
      "Name: 40, dtype: object)\n",
      "(41, salary          106412\n",
      "gender               0\n",
      "departm           stat\n",
      "years               23\n",
      "age                 53\n",
      "publications        29\n",
      "Name: 41, dtype: object)\n",
      "(42, salary          86980\n",
      "gender              0\n",
      "departm          stat\n",
      "years              23\n",
      "age                53\n",
      "publications       42\n",
      "Name: 42, dtype: object)\n",
      "(43, salary          78114\n",
      "gender              0\n",
      "departm          stat\n",
      "years               8\n",
      "age                38\n",
      "publications       24\n",
      "Name: 43, dtype: object)\n",
      "(44, salary          74085\n",
      "gender              0\n",
      "departm          stat\n",
      "years              11\n",
      "age                41\n",
      "publications       33\n",
      "Name: 44, dtype: object)\n",
      "(45, salary          72250\n",
      "gender              0\n",
      "departm          stat\n",
      "years              26\n",
      "age                56\n",
      "publications        9\n",
      "Name: 45, dtype: object)\n",
      "(46, salary          69596\n",
      "gender              0\n",
      "departm          stat\n",
      "years              20\n",
      "age                50\n",
      "publications       18\n",
      "Name: 46, dtype: object)\n",
      "(47, salary          65285\n",
      "gender              0\n",
      "departm          stat\n",
      "years              20\n",
      "age                50\n",
      "publications       15\n",
      "Name: 47, dtype: object)\n",
      "(48, salary          62557\n",
      "gender              0\n",
      "departm          stat\n",
      "years              28\n",
      "age                58\n",
      "publications       14\n",
      "Name: 48, dtype: object)\n",
      "(49, salary          61947\n",
      "gender              0\n",
      "departm          stat\n",
      "years              22\n",
      "age                58\n",
      "publications       17\n",
      "Name: 49, dtype: object)\n",
      "(50, salary          58565\n",
      "gender              0\n",
      "departm          stat\n",
      "years              29\n",
      "age                59\n",
      "publications       11\n",
      "Name: 50, dtype: object)\n",
      "(51, salary          58365\n",
      "gender              0\n",
      "departm          stat\n",
      "years              18\n",
      "age                48\n",
      "publications       21\n",
      "Name: 51, dtype: object)\n",
      "(52, salary          53656\n",
      "gender              0\n",
      "departm          stat\n",
      "years               2\n",
      "age                32\n",
      "publications        4\n",
      "Name: 52, dtype: object)\n",
      "(53, salary          51391\n",
      "gender              0\n",
      "departm          stat\n",
      "years               5\n",
      "age                35\n",
      "publications        8\n",
      "Name: 53, dtype: object)\n",
      "(54, salary            96936\n",
      "gender                0\n",
      "departm         physics\n",
      "years                15\n",
      "age                  50\n",
      "publications         17\n",
      "Name: 54, dtype: object)\n",
      "(55, salary            83216\n",
      "gender                0\n",
      "departm         physics\n",
      "years                11\n",
      "age                  37\n",
      "publications         19\n",
      "Name: 55, dtype: object)\n",
      "(56, salary            72044\n",
      "gender                0\n",
      "departm         physics\n",
      "years                 2\n",
      "age                  32\n",
      "publications         16\n",
      "Name: 56, dtype: object)\n",
      "(57, salary            64048\n",
      "gender                0\n",
      "departm         physics\n",
      "years                23\n",
      "age                  53\n",
      "publications          4\n",
      "Name: 57, dtype: object)\n",
      "(58, salary            58888\n",
      "gender                0\n",
      "departm         physics\n",
      "years                26\n",
      "age                  56\n",
      "publications          7\n",
      "Name: 58, dtype: object)\n",
      "(59, salary            58744\n",
      "gender                0\n",
      "departm         physics\n",
      "years                20\n",
      "age                  50\n",
      "publications          9\n",
      "Name: 59, dtype: object)\n",
      "(60, salary            55944\n",
      "gender                0\n",
      "departm         physics\n",
      "years                21\n",
      "age                  51\n",
      "publications          8\n",
      "Name: 60, dtype: object)\n",
      "(61, salary            54076\n",
      "gender                0\n",
      "departm         physics\n",
      "years                19\n",
      "age                  49\n",
      "publications         12\n",
      "Name: 61, dtype: object)\n",
      "(62, salary          82142\n",
      "gender              0\n",
      "departm          math\n",
      "years               9\n",
      "age                39\n",
      "publications        9\n",
      "Name: 62, dtype: object)\n",
      "(63, salary          70509\n",
      "gender              0\n",
      "departm          math\n",
      "years              23\n",
      "age                53\n",
      "publications        7\n",
      "Name: 63, dtype: object)\n",
      "(64, salary          60320\n",
      "gender              0\n",
      "departm          math\n",
      "years              14\n",
      "age                44\n",
      "publications        7\n",
      "Name: 64, dtype: object)\n",
      "(65, salary          55814\n",
      "gender              0\n",
      "departm          math\n",
      "years               8\n",
      "age                38\n",
      "publications        6\n",
      "Name: 65, dtype: object)\n",
      "(66, salary          53638\n",
      "gender              0\n",
      "departm          math\n",
      "years               4\n",
      "age                42\n",
      "publications        8\n",
      "Name: 66, dtype: object)\n",
      "(67, salary          53517\n",
      "gender              2\n",
      "departm          math\n",
      "years               5\n",
      "age                35\n",
      "publications        5\n",
      "Name: 67, dtype: object)\n",
      "(68, salary          59139\n",
      "gender              1\n",
      "departm           bio\n",
      "years               8\n",
      "age                38\n",
      "publications       23\n",
      "Name: 68, dtype: object)\n",
      "(69, salary          52968\n",
      "gender              1\n",
      "departm           bio\n",
      "years              18\n",
      "age                48\n",
      "publications       32\n",
      "Name: 69, dtype: object)\n",
      "(70, salary          55949\n",
      "gender              1\n",
      "departm          chem\n",
      "years               4\n",
      "age                34\n",
      "publications       12\n",
      "Name: 70, dtype: object)\n",
      "(71, salary          58893\n",
      "gender              1\n",
      "departm         neuro\n",
      "years              10\n",
      "age                35\n",
      "publications        4\n",
      "Name: 71, dtype: object)\n",
      "(72, salary          53662\n",
      "gender              1\n",
      "departm         neuro\n",
      "years               1\n",
      "age                31\n",
      "publications        3\n",
      "Name: 72, dtype: object)\n",
      "(73, salary          57185\n",
      "gender              1\n",
      "departm          stat\n",
      "years               9\n",
      "age                39\n",
      "publications        7\n",
      "Name: 73, dtype: object)\n",
      "(74, salary          52254\n",
      "gender              1\n",
      "departm          stat\n",
      "years               2\n",
      "age                32\n",
      "publications        9\n",
      "Name: 74, dtype: object)\n",
      "(75, salary          61885\n",
      "gender              1\n",
      "departm          math\n",
      "years              23\n",
      "age                60\n",
      "publications        9\n",
      "Name: 75, dtype: object)\n",
      "(76, salary          49542\n",
      "gender              1\n",
      "departm          math\n",
      "years               3\n",
      "age                33\n",
      "publications        5\n",
      "Name: 76, dtype: object)\n"
     ]
    }
   ],
   "source": [
    "for row in df.iterrows():\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A [substantially faster version](https://medium.com/@formigone/stop-using-df-iterrows-2fbc2931b60e) of `.iterrows()` is `.itertuples()` which returns a slightly different representation of each row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pandas(Index=0, salary=86285, gender=0, departm='bio', years=26.0, age=64.0, publications=72)\n",
      "Pandas(Index=1, salary=77125, gender=0, departm='bio', years=28.0, age=58.0, publications=43)\n",
      "Pandas(Index=2, salary=71922, gender=0, departm='bio', years=10.0, age=38.0, publications=23)\n",
      "Pandas(Index=3, salary=70499, gender=0, departm='bio', years=16.0, age=46.0, publications=64)\n",
      "Pandas(Index=4, salary=66624, gender=0, departm='bio', years=11.0, age=41.0, publications=23)\n",
      "Pandas(Index=5, salary=64451, gender=0, departm='bio', years=23.0, age=60.0, publications=44)\n",
      "Pandas(Index=6, salary=64366, gender=0, departm='bio', years=23.0, age=53.0, publications=22)\n",
      "Pandas(Index=7, salary=59344, gender=0, departm='bio', years=5.0, age=40.0, publications=11)\n",
      "Pandas(Index=8, salary=58560, gender=0, departm='bio', years=8.0, age=38.0, publications=8)\n",
      "Pandas(Index=9, salary=58294, gender=0, departm='bio', years=20.0, age=50.0, publications=12)\n",
      "Pandas(Index=10, salary=56092, gender=0, departm='bio', years=2.0, age=40.0, publications=4)\n",
      "Pandas(Index=11, salary=54452, gender=0, departm='bio', years=13.0, age=43.0, publications=7)\n",
      "Pandas(Index=12, salary=54269, gender=0, departm='bio', years=26.0, age=56.0, publications=12)\n",
      "Pandas(Index=13, salary=55125, gender=0, departm='bio', years=8.0, age=38.0, publications=9)\n",
      "Pandas(Index=14, salary=97630, gender=0, departm='chem', years=34.0, age=64.0, publications=43)\n",
      "Pandas(Index=15, salary=82444, gender=0, departm='chem', years=31.0, age=61.0, publications=42)\n",
      "Pandas(Index=16, salary=76291, gender=0, departm='chem', years=29.0, age=65.0, publications=33)\n",
      "Pandas(Index=17, salary=75382, gender=0, departm='chem', years=26.0, age=56.0, publications=39)\n",
      "Pandas(Index=18, salary=64762, gender=0, departm='chem', years=25.0, age=nan, publications=29)\n",
      "Pandas(Index=19, salary=62607, gender=0, departm='chem', years=20.0, age=45.0, publications=34)\n",
      "Pandas(Index=20, salary=60373, gender=0, departm='chem', years=26.0, age=56.0, publications=43)\n",
      "Pandas(Index=21, salary=58892, gender=0, departm='chem', years=18.0, age=48.0, publications=21)\n",
      "Pandas(Index=22, salary=47021, gender=0, departm='chem', years=4.0, age=34.0, publications=12)\n",
      "Pandas(Index=23, salary=44687, gender=0, departm='chem', years=4.0, age=34.0, publications=19)\n",
      "Pandas(Index=24, salary=104828, gender=0, departm='geol', years=nan, age=50.0, publications=44)\n",
      "Pandas(Index=25, salary=71456, gender=0, departm='geol', years=11.0, age=41.0, publications=32)\n",
      "Pandas(Index=26, salary=65144, gender=0, departm='geol', years=7.0, age=37.0, publications=12)\n",
      "Pandas(Index=27, salary=52766, gender=0, departm='geol', years=4.0, age=38.0, publications=32)\n",
      "Pandas(Index=28, salary=112800, gender=0, departm='neuro', years=14.0, age=44.0, publications=33)\n",
      "Pandas(Index=29, salary=105761, gender=0, departm='neuro', years=9.0, age=39.0, publications=30)\n",
      "Pandas(Index=30, salary=92951, gender=0, departm='neuro', years=11.0, age=41.0, publications=20)\n",
      "Pandas(Index=31, salary=86621, gender=0, departm='neuro', years=19.0, age=49.0, publications=10)\n",
      "Pandas(Index=32, salary=85569, gender=0, departm='neuro', years=20.0, age=46.0, publications=35)\n",
      "Pandas(Index=33, salary=83896, gender=0, departm='neuro', years=10.0, age=40.0, publications=22)\n",
      "Pandas(Index=34, salary=79735, gender=0, departm='neuro', years=11.0, age=41.0, publications=32)\n",
      "Pandas(Index=35, salary=71518, gender=0, departm='neuro', years=7.0, age=37.0, publications=34)\n",
      "Pandas(Index=36, salary=68029, gender=0, departm='neuro', years=15.0, age=45.0, publications=33)\n",
      "Pandas(Index=37, salary=66482, gender=0, departm='neuro', years=14.0, age=44.0, publications=42)\n",
      "Pandas(Index=38, salary=61680, gender=0, departm='neuro', years=18.0, age=48.0, publications=20)\n",
      "Pandas(Index=39, salary=60455, gender=0, departm='neuro', years=8.0, age=38.0, publications=49)\n",
      "Pandas(Index=40, salary=58932, gender=0, departm='neuro', years=11.0, age=41.0, publications=49)\n",
      "Pandas(Index=41, salary=106412, gender=0, departm='stat', years=23.0, age=53.0, publications=29)\n",
      "Pandas(Index=42, salary=86980, gender=0, departm='stat', years=23.0, age=53.0, publications=42)\n",
      "Pandas(Index=43, salary=78114, gender=0, departm='stat', years=8.0, age=38.0, publications=24)\n",
      "Pandas(Index=44, salary=74085, gender=0, departm='stat', years=11.0, age=41.0, publications=33)\n",
      "Pandas(Index=45, salary=72250, gender=0, departm='stat', years=26.0, age=56.0, publications=9)\n",
      "Pandas(Index=46, salary=69596, gender=0, departm='stat', years=20.0, age=50.0, publications=18)\n",
      "Pandas(Index=47, salary=65285, gender=0, departm='stat', years=20.0, age=50.0, publications=15)\n",
      "Pandas(Index=48, salary=62557, gender=0, departm='stat', years=28.0, age=58.0, publications=14)\n",
      "Pandas(Index=49, salary=61947, gender=0, departm='stat', years=22.0, age=58.0, publications=17)\n",
      "Pandas(Index=50, salary=58565, gender=0, departm='stat', years=29.0, age=59.0, publications=11)\n",
      "Pandas(Index=51, salary=58365, gender=0, departm='stat', years=18.0, age=48.0, publications=21)\n",
      "Pandas(Index=52, salary=53656, gender=0, departm='stat', years=2.0, age=32.0, publications=4)\n",
      "Pandas(Index=53, salary=51391, gender=0, departm='stat', years=5.0, age=35.0, publications=8)\n",
      "Pandas(Index=54, salary=96936, gender=0, departm='physics', years=15.0, age=50.0, publications=17)\n",
      "Pandas(Index=55, salary=83216, gender=0, departm='physics', years=11.0, age=37.0, publications=19)\n",
      "Pandas(Index=56, salary=72044, gender=0, departm='physics', years=2.0, age=32.0, publications=16)\n",
      "Pandas(Index=57, salary=64048, gender=0, departm='physics', years=23.0, age=53.0, publications=4)\n",
      "Pandas(Index=58, salary=58888, gender=0, departm='physics', years=26.0, age=56.0, publications=7)\n",
      "Pandas(Index=59, salary=58744, gender=0, departm='physics', years=20.0, age=50.0, publications=9)\n",
      "Pandas(Index=60, salary=55944, gender=0, departm='physics', years=21.0, age=51.0, publications=8)\n",
      "Pandas(Index=61, salary=54076, gender=0, departm='physics', years=19.0, age=49.0, publications=12)\n",
      "Pandas(Index=62, salary=82142, gender=0, departm='math', years=9.0, age=39.0, publications=9)\n",
      "Pandas(Index=63, salary=70509, gender=0, departm='math', years=23.0, age=53.0, publications=7)\n",
      "Pandas(Index=64, salary=60320, gender=0, departm='math', years=14.0, age=44.0, publications=7)\n",
      "Pandas(Index=65, salary=55814, gender=0, departm='math', years=8.0, age=38.0, publications=6)\n",
      "Pandas(Index=66, salary=53638, gender=0, departm='math', years=4.0, age=42.0, publications=8)\n",
      "Pandas(Index=67, salary=53517, gender=2, departm='math', years=5.0, age=35.0, publications=5)\n",
      "Pandas(Index=68, salary=59139, gender=1, departm='bio', years=8.0, age=38.0, publications=23)\n",
      "Pandas(Index=69, salary=52968, gender=1, departm='bio', years=18.0, age=48.0, publications=32)\n",
      "Pandas(Index=70, salary=55949, gender=1, departm='chem', years=4.0, age=34.0, publications=12)\n",
      "Pandas(Index=71, salary=58893, gender=1, departm='neuro', years=10.0, age=35.0, publications=4)\n",
      "Pandas(Index=72, salary=53662, gender=1, departm='neuro', years=1.0, age=31.0, publications=3)\n",
      "Pandas(Index=73, salary=57185, gender=1, departm='stat', years=9.0, age=39.0, publications=7)\n",
      "Pandas(Index=74, salary=52254, gender=1, departm='stat', years=2.0, age=32.0, publications=9)\n",
      "Pandas(Index=75, salary=61885, gender=1, departm='math', years=23.0, age=60.0, publications=9)\n",
      "Pandas(Index=76, salary=49542, gender=1, departm='math', years=3.0, age=33.0, publications=5)\n"
     ]
    }
   ],
   "source": [
    "for row in df.itertuples():\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data organization - tidy and wide formats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you have seen how to import your data and the basic operations of dataframe, you might think that you're now ready to start analyzing and visualizing your data—yet there's still quite a bit of work that has to be done cleaning up your dataset before you do this. Some of this clean up might include fixing typos in your datasets, or filtering out participants with incomplete data. But even more important, you'll need to make sure that your data is organized and structured in a consistent manner. One way of organizing your data, referred to as **tidy format** (<a href=\"https://vita.had.co.nz/papers/tidy-data.pdf\">a term coined by Hadley Wickham</a>), is particularly helpful for facilitating data analysis. \n",
    "\n",
    "To clarify what tidy format is, we will also talk about another common format of data called **wide format** that isn't very tidy. While there are some reasons to use wide format over tidy format when programming (e.g., some functions in R require data to be in wide format), wide format is more commonly used when working with interface-based software (such as Excel and SPSS).\n",
    "\n",
    "The clearest way of demonstrating the differences between wide versus tidy formats is by simply looking at datasets of both kinds. Take the example dataset below. It shows quiz scores for four participants across three timepoints plus some demographic data. It's currently in wide format:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "      <th>time1</th>\n",
       "      <th>time2</th>\n",
       "      <th>time3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>m</td>\n",
       "      <td>12</td>\n",
       "      <td>15</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>m</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "      <td>15</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>f</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name  age gender  time1  time2  time3\n",
       "0     tom   26      m     12     15     15\n",
       "1    nick   23      m     10      9     12\n",
       "2   julie   18      f     15     13     14\n",
       "3  angela   21      f     10     10     12"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "\n",
    "data = [['tom', 26, \"m\", 12, 15, 15], \n",
    "        ['nick', 23, \"m\", 10, 9, 12], \n",
    "        ['julie', 18, \"f\", 15, 13, 14], \n",
    "        ['angela', 21, \"f\", 10, 10, 12]] \n",
    "df_wide = pd.DataFrame(data, columns = ['name', 'age', 'gender', 'time1', 'time2', 'time3']) \n",
    "df_wide "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In wide format, each *individual* has their own row, and all data pertaining to that individual is contained within that row. For example, the first row has all of the data for Tom, including his scores across each of the three timepoints. \n",
    "\n",
    "Now let's compare this to tidy format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "      <th>timepoint</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>m</td>\n",
       "      <td>time1</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>m</td>\n",
       "      <td>time2</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>m</td>\n",
       "      <td>time3</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>m</td>\n",
       "      <td>time1</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>m</td>\n",
       "      <td>time2</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>m</td>\n",
       "      <td>time3</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "      <td>time1</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "      <td>time2</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "      <td>time3</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>f</td>\n",
       "      <td>time1</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>f</td>\n",
       "      <td>time2</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>f</td>\n",
       "      <td>time3</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      name  age gender timepoint  score\n",
       "0      tom   26      m     time1     12\n",
       "4      tom   26      m     time2     15\n",
       "8      tom   26      m     time3     15\n",
       "1     nick   23      m     time1     10\n",
       "5     nick   23      m     time2      9\n",
       "9     nick   23      m     time3     12\n",
       "2    julie   18      f     time1     15\n",
       "6    julie   18      f     time2     13\n",
       "10   julie   18      f     time3     14\n",
       "3   angela   21      f     time1     10\n",
       "7   angela   21      f     time2     10\n",
       "11  angela   21      f     time3     12"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tidy = pd.melt(df_wide, id_vars=['name', 'age', 'gender'], var_name='timepoint', value_name='score') \n",
    "df_tidy = df_tidy.sort_values(ascending=False, by='name')\n",
    "df_tidy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this format, now each *observation* has their own row. That is, we now have three separate rows for Tom, with each row reflecting a single quiz score from one of the three timepoints. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is tidy data, and why use it? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pulling directly from <a href=\"https://r4ds.had.co.nz/tidy-data.html\">Wickham's book chapter on tidy data</a> (his book is written for R users, but the principles of tidying data works the same across languages), there are three rules that work together to make a dataset tidy ({cite}`Wickham:2017`): \n",
    "\n",
    "1. Each _observation_ must have its own row (observations could be _each person_, _each timepoint_, etc.)\n",
    "2. Each _variable_ must have its own column (variables are some kind of measurement: _gender_, _age_, _score_, etc.)\n",
    "3. Each _value_ must have its own cell (value are the actual measurement: _female_, _23 years_, _12 points_, etc.)\n",
    "\n",
    "<div style=\"text-align: center\">\n",
    "    <img src=\"./images/tidy_data.png\" width=\"600\">\n",
    "</div>\n",
    "Above, a visual representation of each of the three rules for tidy data: variables are in columns, observations are in rows, and values are in cells. Image borrowed from <a href=\"https://r4ds.had.co.nz/tidy-data.html\"><i>R for Data Science</i></a>.\n",
    "\n",
    "So going back to our dataset from before, we can see that it follows each of these three rules. Each person and each timepoint has their own rows (Rule 1). Every type of data measured (gender, age, etc.) has their own column (Rule 2), and each cell within the dataframe has only one value (Rule 3).\n",
    "\n",
    "Let's look at some examples where one or more of these rules are being violated. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>age_gender</th>\n",
       "      <th>time1</th>\n",
       "      <th>time2</th>\n",
       "      <th>time3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tom</td>\n",
       "      <td>26_m</td>\n",
       "      <td>12</td>\n",
       "      <td>15</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nick</td>\n",
       "      <td>23_m</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>julie</td>\n",
       "      <td>18_f</td>\n",
       "      <td>15</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>angela</td>\n",
       "      <td>21_f</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name age_gender  time1  time2  time3\n",
       "0     tom       26_m     12     15     15\n",
       "1    nick       23_m     10      9     12\n",
       "2   julie       18_f     15     13     14\n",
       "3  angela       21_f     10     10     12"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_untidy = [['tom', \"26_m\", 12, 15, 15], \n",
    "              ['nick', \"23_m\", 10, 9, 12], \n",
    "              ['julie', \"18_f\", 15, 13, 14], \n",
    "              ['angela', \"21_f\", 10, 10, 12]] \n",
    "df_untidy = pd.DataFrame(data_untidy, columns = ['name', 'age_gender', 'time1', 'time2', 'time3'])\n",
    "df_untidy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example above, this dataset is messy because two variables (age and gender) have been coded into the same column. Having multiple variables in the same column in this way makes it much harder to calculate the average age of participants or get a gender breakdown for the sample.  \n",
    "\n",
    "But this can easily be fixed by splitting the column into two. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>time1</th>\n",
       "      <th>time2</th>\n",
       "      <th>time3</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tom</td>\n",
       "      <td>12</td>\n",
       "      <td>15</td>\n",
       "      <td>15</td>\n",
       "      <td>26</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nick</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "      <td>23</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>julie</td>\n",
       "      <td>15</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>angela</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>12</td>\n",
       "      <td>21</td>\n",
       "      <td>f</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name  time1  time2  time3 age gender\n",
       "0     tom     12     15     15  26      m\n",
       "1    nick     10      9     12  23      m\n",
       "2   julie     15     13     14  18      f\n",
       "3  angela     10     10     12  21      f"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_untidy[['age', 'gender']] = df_untidy.age_gender.str.split(\"_\", expand=True)\n",
    "del df_untidy['age_gender']\n",
    "\n",
    "df_untidy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's another case of untidy data, similar to one we've seen before. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>age</th>\n",
       "      <th>t1_min</th>\n",
       "      <th>t1_max</th>\n",
       "      <th>t2_min</th>\n",
       "      <th>t2_max</th>\n",
       "      <th>t3_min</th>\n",
       "      <th>t3_max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>12</td>\n",
       "      <td>15</td>\n",
       "      <td>15</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>10</td>\n",
       "      <td>19</td>\n",
       "      <td>12</td>\n",
       "      <td>14</td>\n",
       "      <td>11</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>15</td>\n",
       "      <td>23</td>\n",
       "      <td>14</td>\n",
       "      <td>18</td>\n",
       "      <td>12</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name  age  t1_min  t1_max  t2_min  t2_max  t3_min  t3_max\n",
       "0     tom   26      12      15      15      17      18      20\n",
       "1    nick   23      10      19      12      14      11      18\n",
       "2   julie   18      15      23      14      18      12      19\n",
       "3  angela   21      10      11      12      14      15      30"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_messy = [['tom', 26, 12, 15, 15, 17, 18, 20], \n",
    "              ['nick', 23, 10, 19, 12, 14, 11, 18], \n",
    "              ['julie', 18, 15, 23, 14, 18, 12, 19], \n",
    "              ['angela', 21, 10, 11, 12, 14, 15, 30]] \n",
    "df_messy = pd.DataFrame(data_messy, columns = ['name', 'age', 't1_min', 't1_max', 't2_min', 't2_max', 't3_min', 't3_max'])\n",
    "df_messy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset violates the rules of a tidy dataset because a) there's more than one observation in each row, and b) we have multiple columns reflecting the same variable (the distinction between a minimum and maximum score). \n",
    "\n",
    "But we can fix it using `pd.melt` with a combination of few other things. In this tidied version, we now see that each variable (_name_, _age_, _time_, _min_ score, and _max_ score) all have their own columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>minmax</th>\n",
       "      <th>name</th>\n",
       "      <th>age</th>\n",
       "      <th>time</th>\n",
       "      <th>max</th>\n",
       "      <th>min</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>t1</td>\n",
       "      <td>11</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>t2</td>\n",
       "      <td>14</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>angela</td>\n",
       "      <td>21</td>\n",
       "      <td>t3</td>\n",
       "      <td>30</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>t1</td>\n",
       "      <td>23</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>t2</td>\n",
       "      <td>18</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>julie</td>\n",
       "      <td>18</td>\n",
       "      <td>t3</td>\n",
       "      <td>19</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>t1</td>\n",
       "      <td>19</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>t2</td>\n",
       "      <td>14</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>nick</td>\n",
       "      <td>23</td>\n",
       "      <td>t3</td>\n",
       "      <td>18</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>t1</td>\n",
       "      <td>15</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>t2</td>\n",
       "      <td>17</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>tom</td>\n",
       "      <td>26</td>\n",
       "      <td>t3</td>\n",
       "      <td>20</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "minmax    name  age time  max  min\n",
       "0       angela   21   t1   11   10\n",
       "1       angela   21   t2   14   12\n",
       "2       angela   21   t3   30   15\n",
       "3        julie   18   t1   23   15\n",
       "4        julie   18   t2   18   14\n",
       "5        julie   18   t3   19   12\n",
       "6         nick   23   t1   19   10\n",
       "7         nick   23   t2   14   12\n",
       "8         nick   23   t3   18   11\n",
       "9          tom   26   t1   15   12\n",
       "10         tom   26   t2   17   15\n",
       "11         tom   26   t3   20   18"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_messy = pd.melt(df_messy, id_vars=['name', 'age'], var_name='time_minmax', value_name='score')\n",
    "\n",
    "df_messy[['time','minmax']] = df_messy.time_minmax.str.split(\"_\", expand = True)\n",
    "del df_messy['time_minmax']\n",
    "\n",
    "df_messy = df_messy.pivot_table(index=['name','age','time'], columns='minmax', values='score')\n",
    "df_messy.reset_index(drop=False, inplace=True)\n",
    "df_messy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Don't worry if the actual code to make the data tidy doesn't make sense right away (you will get there). We'll talk more later about strategies you can use to wrangle data into tidy formats using Python. What you should be taking away from these examples is the differences in how these datasets are structured both before and after we tidied them. \n",
    "\n",
    "Cleaning up and rearranging datasets is inevitable part of doing psychological research. Lots of data you work with may come to you in untidy formats, and you will have to clean that data up before you can analyze it. The good news is that knowing these principles of tidy, organized datasets will serve you well as you to start to design your experiments! **The more you can build in tidy principles into how your data is recorded during the experiment, the more time you can save later when it comes to processing your data in Python.**\n",
    "\n",
    "Also importantly, this discussion highlights one major purpose of the Pandas library: namely to help you clean up and reorganize data.  The commands like `pd.melt()` and `pd.pivot_table()` are pretty complex operations which alter your view of a dataset and make it easier to perform certain types of analysis.  These reorganizations of data can be very difficult to do by hand but are facilitated by the dataframe library provided by Pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A case example: organizing data for an psychological experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The discussion so far might strike you as somewhat abstract.  However, principles of data formatting and organization are very important for psychological research.  Let's consider the case most central to this course which is organizing a data file for a psychological experiment conducted on several participants composed of multiple trials and trial types. This is perhaps the most typical type of data format we will encounter in this work and is somewhat different from survey questions and other types of data which has a more complex structure often."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To keep things simple, let's imagine that we are doing a decision making experiment.  On each trial of the experiment the subject will be offered a choice between a gamble and a sure amount of money.  For instance the subject might be presented with an option like \"Would you prefer \\$10 now or a 50\\% chance to win 20 dollars otherwise nothing.\"  If you think about it a bit for this specific pair of options they *on average* give the same expected value because the expected value of the uncertain option is $0.5*\\$20 + 0.5*0 = \\$10.$  If people were perfectly rational they should be indifferent between these two gambles.  However, a common phenomena is that people show a biased called **risk aversion** where they prefer a certain reward over a risky gamble.  Thus, even though on average you get the same, they do not want to take the gamble for the larger amount of money.  There are several explanations of this but one intuitive one is that people don't like regret and the feeling if you get nothing is so bad that makes people prefer the certain \\$10."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So imagine we do an experiment attempting to explore loss aversion.  We might present people with several of these types of gambles across multiple trials.  The reason is that we want more than one measurement and want to vary things like the probability of the higher reward on the risky gamble (e.g., instead of 50% chance we might want to explore 10% chance of winning).  In addition, we might want to change the magnitudes of the rewards across different trials.   Finally we will want to run multiple people so that our conclusions aren't tied to one person.  How should we structure our data file for this?  You might think that we should first write the code for the experiment and then consider the data format.  However, I find it is usually helpful to do the opposite.  We start with the data format in mind right from the start."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look back at the key idea in a \"tidy\" dataset because we definitely want to be tidy here.\n",
    "\n",
    "1. Each _observation_ must have its own row (observations could be _each person_, _each timepoint_, etc.)\n",
    "2. Each _variable_ must have its own column (variables are some kind of measurement: _gender_, _age_, _score_, etc.)\n",
    "3. Each _value_ must have its own cell (value are the actual measurement: _female_, _23 years_, _12 points_, etc.)\n",
    "\n",
    "So each _observation_ would be a trial and so it needs its own row.\n",
    "\n",
    "Each _variable_ would be information about the trial, both the details of what the subject saw on that trial (e.g., the magnitude of the rewards and gambles) and also how they responded (reaction time, what choice they made, etc...).\n",
    "\n",
    "Each _value_ must have its own cell means that each of these numbers need to have their own column and not be combined.  So let's make a pretend dataframe from scratch to illustrate this example.  \n",
    "\n",
    "So let's imagine that there are 10 stimuli in the experiment.  We can denote them as the value of the certain option and then the value of the high and low reward for the gamble.  This means each stimulus/trial in the experiment can be described with four numbers:\n",
    "\n",
    "- certain value\n",
    "- upper risky value\n",
    "- lower risky value\n",
    "- probability of upper risky value\n",
    "\n",
    "We can fix a few alternatives (let's say enough for 4 trials) using a python dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "stim={\"certain\": [10, 15, 25, 50], \n",
    "      \"upper_risk\": [20, 40, 30, 32], \n",
    "      \"lower_risk\": [0, 23, 29, 0], \n",
    "      \"prob\": [0.5, 0.5, 0.25, 0.75]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we can pretend to generate some responses from one subject."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "responses = {\"reaction_time\":[200, 400, 500, 600], \n",
    "             \"choice\": [0, 1, 1, 0]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can build up a dataframe for one subject.  We will do this by first defining a python dictionary called `trials` which includes all the _variables_ associated with one measurement  (trial number, the value of the certain gamble, the probabilty, etc...) as well as all the measurements made on that trial (response time, choice).  These start out at the beginning of the cell as empy lists and then a for loop below fills them in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'certain': [10, 15, 25, 50],\n",
       " 'choice': [0, 1, 1, 0],\n",
       " 'lower_risk': [0, 23, 29, 0],\n",
       " 'prob': [0.5, 0.5, 0.25, 0.75],\n",
       " 'reaction_time': [200, 400, 500, 600],\n",
       " 'subject_id': ['some_subject_id0',\n",
       "  'some_subject_id0',\n",
       "  'some_subject_id0',\n",
       "  'some_subject_id0'],\n",
       " 'trial_num': [0, 1, 2, 3],\n",
       " 'upper_risk': [20, 40, 30, 32]}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subject_id = \"some_subject_id0\"\n",
    "\n",
    "trials = {'subject_id':[],\n",
    "          'trial_num':[],\n",
    "          'certain':[],\n",
    "          'upper_risk':[],\n",
    "          'lower_risk':[],\n",
    "          'prob':[],\n",
    "          'reaction_time':[],\n",
    "          'choice':[]}\n",
    "\n",
    "for i in range(4):\n",
    "    trials['subject_id'].append(subject_id)\n",
    "    trials['trial_num'].append(i)\n",
    "    trials['certain'].append(stim['certain'][i])\n",
    "    trials['upper_risk'].append(stim['upper_risk'][i])\n",
    "    trials['lower_risk'].append(stim['lower_risk'][i])\n",
    "    trials['prob'].append(stim['prob'][i])\n",
    "    trials['reaction_time'].append(responses[\"reaction_time\"][i])\n",
    "    trials['choice'].append(responses[\"choice\"][i])\n",
    "\n",
    "trials"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a python dictionary that includes column names and then the data that goes in the columns which is perfect for initializing a Pandas dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject_id</th>\n",
       "      <th>trial_num</th>\n",
       "      <th>certain</th>\n",
       "      <th>upper_risk</th>\n",
       "      <th>lower_risk</th>\n",
       "      <th>prob</th>\n",
       "      <th>reaction_time</th>\n",
       "      <th>choice</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>200</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>1</td>\n",
       "      <td>15</td>\n",
       "      <td>40</td>\n",
       "      <td>23</td>\n",
       "      <td>0.50</td>\n",
       "      <td>400</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>2</td>\n",
       "      <td>25</td>\n",
       "      <td>30</td>\n",
       "      <td>29</td>\n",
       "      <td>0.25</td>\n",
       "      <td>500</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>3</td>\n",
       "      <td>50</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>600</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         subject_id  trial_num  certain  upper_risk  lower_risk  prob  \\\n",
       "0  some_subject_id0          0       10          20           0  0.50   \n",
       "1  some_subject_id0          1       15          40          23  0.50   \n",
       "2  some_subject_id0          2       25          30          29  0.25   \n",
       "3  some_subject_id0          3       50          32           0  0.75   \n",
       "\n",
       "   reaction_time  choice  \n",
       "0            200       0  \n",
       "1            400       1  \n",
       "2            500       1  \n",
       "3            600       0  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df=pd.DataFrame(trials)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is great and shows a very natural way to structure the data from an experiment.  Each trial is a row in the dataframe.  Each column either describes what the stimulus was on that trial or the response from the participant.  In addition two columns code the trial number and the subject idea.  This is critical for analysis as we might be interested in order effects on trials and so we need to know which trials were first or last.  In addition we want to keep subjects separate.  Eventually we will build up a more complex dataframe that includes multiple subjects.  For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject_id</th>\n",
       "      <th>trial_num</th>\n",
       "      <th>certain</th>\n",
       "      <th>upper_risk</th>\n",
       "      <th>lower_risk</th>\n",
       "      <th>prob</th>\n",
       "      <th>reaction_time</th>\n",
       "      <th>choice</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>993</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>1</td>\n",
       "      <td>15</td>\n",
       "      <td>40</td>\n",
       "      <td>23</td>\n",
       "      <td>0.50</td>\n",
       "      <td>587</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>2</td>\n",
       "      <td>25</td>\n",
       "      <td>30</td>\n",
       "      <td>29</td>\n",
       "      <td>0.25</td>\n",
       "      <td>626</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>some_subject_id0</td>\n",
       "      <td>3</td>\n",
       "      <td>50</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>835</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>some_subject_id1</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>679</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>some_subject_id1</td>\n",
       "      <td>1</td>\n",
       "      <td>15</td>\n",
       "      <td>40</td>\n",
       "      <td>23</td>\n",
       "      <td>0.50</td>\n",
       "      <td>205</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>some_subject_id1</td>\n",
       "      <td>2</td>\n",
       "      <td>25</td>\n",
       "      <td>30</td>\n",
       "      <td>29</td>\n",
       "      <td>0.25</td>\n",
       "      <td>798</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>some_subject_id1</td>\n",
       "      <td>3</td>\n",
       "      <td>50</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>742</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>some_subject_id2</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>729</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>some_subject_id2</td>\n",
       "      <td>1</td>\n",
       "      <td>15</td>\n",
       "      <td>40</td>\n",
       "      <td>23</td>\n",
       "      <td>0.50</td>\n",
       "      <td>790</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>some_subject_id2</td>\n",
       "      <td>2</td>\n",
       "      <td>25</td>\n",
       "      <td>30</td>\n",
       "      <td>29</td>\n",
       "      <td>0.25</td>\n",
       "      <td>207</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>some_subject_id2</td>\n",
       "      <td>3</td>\n",
       "      <td>50</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>526</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          subject_id  trial_num  certain  upper_risk  lower_risk  prob  \\\n",
       "0   some_subject_id0          0       10          20           0  0.50   \n",
       "1   some_subject_id0          1       15          40          23  0.50   \n",
       "2   some_subject_id0          2       25          30          29  0.25   \n",
       "3   some_subject_id0          3       50          32           0  0.75   \n",
       "4   some_subject_id1          0       10          20           0  0.50   \n",
       "5   some_subject_id1          1       15          40          23  0.50   \n",
       "6   some_subject_id1          2       25          30          29  0.25   \n",
       "7   some_subject_id1          3       50          32           0  0.75   \n",
       "8   some_subject_id2          0       10          20           0  0.50   \n",
       "9   some_subject_id2          1       15          40          23  0.50   \n",
       "10  some_subject_id2          2       25          30          29  0.25   \n",
       "11  some_subject_id2          3       50          32           0  0.75   \n",
       "\n",
       "    reaction_time  choice  \n",
       "0             993       1  \n",
       "1             587       0  \n",
       "2             626       1  \n",
       "3             835       0  \n",
       "4             679       0  \n",
       "5             205       1  \n",
       "6             798       0  \n",
       "7             742       1  \n",
       "8             729       1  \n",
       "9             790       0  \n",
       "10            207       0  \n",
       "11            526       1  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stim={\"certain\": [10, 15, 25, 50], \n",
    "      \"upper_risk\": [20, 40, 30, 32], \n",
    "      \"lower_risk\": [0, 23, 29, 0], \n",
    "      \"prob\": [0.5, 0.5, 0.25, 0.75]}\n",
    "\n",
    "trials = {'subject_id':[],\n",
    "          'trial_num':[],\n",
    "          'certain':[],\n",
    "          'upper_risk':[],\n",
    "          'lower_risk':[],\n",
    "          'prob':[],\n",
    "          'reaction_time':[],\n",
    "          'choice':[]}\n",
    "\n",
    "def generate_subject_data(subject_id, trials):\n",
    "    for i in range(4):\n",
    "        trials['subject_id'].append(subject_id)\n",
    "        trials['trial_num'].append(i)\n",
    "        trials['certain'].append(stim['certain'][i])\n",
    "        trials['upper_risk'].append(stim['upper_risk'][i])\n",
    "        trials['lower_risk'].append(stim['lower_risk'][i])\n",
    "        trials['prob'].append(stim['prob'][i])\n",
    "        trials['reaction_time'].append(np.random.randint(200,1000))\n",
    "        trials['choice'].append(np.random.randint(0,2))\n",
    "        \n",
    "for subj in [\"some_subject_id0\",\"some_subject_id1\",\"some_subject_id2\"]:\n",
    "    generate_subject_data(subj,trials)\n",
    "    \n",
    "exp_df=pd.DataFrame(trials)\n",
    "exp_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, take a careful look at this output.  This is a dataframe with 3 subjects worth of data.  The first column is the subject id.  The second is the trial number but notice how it starts at 0 and counts up to 3 and then stops when a new subject id begins.  Next is the stimulus definitions (certain, upper_risk, lower_risk, prob, etc...) and the responses.  This is exactly the way people usually code a data file for a cognitive or perceptual experiment.  It is \"tidy\" and includes all the information in a handy way we can use to analyze the data later.  Here we **generated** the data with a simple program that I wrote in the cell above but generally we write a more complex program to handle this including showing the stimlus to the subject, recording their reaction time and so forth.  This simple program is a proxy for that more complex program but shows you the important principle of how to **organize** your data in preparation for analysis.  The analysis itself is what we will contemplate over the next several chapters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Split-Apply-Combine Workflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perhaps one of the most common data analysis steps is known at the **split-apply-combine** workflow.  Helping with this workflow is one of the more interesting and powerful features of Pandas (and other dataframe) libraries.\n",
    "\n",
    "The **split-apply-combine** workflow refers to a very common sequence of data analysis steps that data analysts have to make.  Usually a dataset arrives with multiple sub-units inside it.  For example, we might get a .csv file that has all the trials of an experiment from several subjects.  Often we want to perform analyses that aggregate across these various subject categories.  For example we might want to analyze the performance of each subject.  Or alternatively we might want to analyze the difficulty of each item on the test.\n",
    "\n",
    "Both of these analyses require us to **split** the larger dataframe up into pieces and perform and analysis on each of these pieces.  The step where you perform this analysis is called **apply**.  So we break the bigger dataframe up into pieces, apply an analysis to each piece.  The final step is to **combine** the results of those analyses back into a new dataframe structure for further analysis.\n",
    "\n",
    "{numref}`split-apply-combine` tries to emphasize these steps:\n",
    "\n",
    "\n",
    "```{figure} ./images/split-apply-combine.png\n",
    ":width: 800px\n",
    ":name: split-apply-combine\n",
    "\n",
    "Illustration of the split-apply-combine workflow!\n",
    "```\n",
    "\n",
    "On the left we begin with our original dataframe (`df`) which has two columns.  One column is categorical and takes on values 'A', 'B', or 'C' while the second column is numeric.  The first step of the split-apply-combine workflow is that we might want to split the original dataframe into smaller groups based on the value of one or more of the columns.  For this example it makes sense to break the original dataframe into smaller dataframes called \"groups\" based on the value of `col1` (that is to say  if the row is in category 'A','B', or 'C').  The method that does this in Pandas is `.groupby` and we will discuss it in a moment.  The result of the groupby operation is a partitioning of the original dataframe into smallers sets (called groups) which are the rows of the original dataframe just reorganized.  The reorganization is such that all the rows that share the same value on the specified column are group together.  For instance in {numref}`split-apply-combine` all the rows that had value 'A' in `col1` are placed into the same group separate from all the rows that had value 'B' in `col1` and so forth.  This is the **split** operation.\n",
    "\n",
    "Next there are several things we can do to each of the groups.  We can iterate down them in a for loop for instance (more on that later) but also there are methods that will quickly compute various descriptives statistics to each group, or allow you to \"apply\" a custom function to each group.  This custom function could be quite complex and so it allows you do do a lot of additional processing of the groups.  The syntax for this tends to be very compact and helps to limit bugs so it is helpful to learn.\n",
    "\n",
    "Let's begin by the simple example in {numref}`split-apply-combine`:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>C</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>A</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  col1  col2\n",
       "0    A     1\n",
       "1    B     2\n",
       "2    C     3\n",
       "3    C     4\n",
       "4    B     2\n",
       "5    B     5\n",
       "6    A     3"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df=pd.DataFrame({\"col1\":['A','B','C','C','B','B','A'],\"col2\": [1,2,3,4,2,5,3]})\n",
    "sac_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will group the rows by the values in `col1`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11809eac8>"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df.groupby('col1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result here doesn't print out the individual groups but instead returns a `pandas.core.groupby.generic.DataFrameGroupBy` object.  However we can print out the groups using iteration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A\n",
      "-----\n",
      "  col1  col2\n",
      "0    A     1\n",
      "6    A     3\n",
      "\n",
      "\n",
      "B\n",
      "-----\n",
      "  col1  col2\n",
      "1    B     2\n",
      "4    B     2\n",
      "5    B     5\n",
      "\n",
      "\n",
      "C\n",
      "-----\n",
      "  col1  col2\n",
      "2    C     3\n",
      "3    C     4\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for name, group_df in sac_df.groupby('col1'):\n",
    "    print(name)\n",
    "    print('-----')\n",
    "    print(group_df)\n",
    "    print()\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the rows have been sorted into groups based on the value in `col1`.  Since there are three distinct/unique values in `col1` there are three groups here but if there were more distinct values there would be more groups.\n",
    "\n",
    "Next we compute the sum() which applies the sum to each group.  We can chain these operations in a sequence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>col1</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      col2\n",
       "col1      \n",
       "A        4\n",
       "B        9\n",
       "C        7"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df.groupby('col1').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result here is a new dataframe where the sum of the values of `col2` for the different subgroups 'A','B', 'C' has been computed.\n",
    "\n",
    ":::{tip}\n",
    "Check the results by hand to make sure you understand what has been calculated here!\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok this example is fairly simple, what if we consider a more complex dataframe with more columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1.10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>A</td>\n",
       "      <td>2</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>3</td>\n",
       "      <td>12.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C</td>\n",
       "      <td>B</td>\n",
       "      <td>4</td>\n",
       "      <td>16.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "      <td>22.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B</td>\n",
       "      <td>B</td>\n",
       "      <td>5</td>\n",
       "      <td>9.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>A</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>0.50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  col1 col2  col3   col4\n",
       "0    A    A     1   1.10\n",
       "1    B    A     2   2.30\n",
       "2    C    A     3  12.00\n",
       "3    C    B     4  16.00\n",
       "4    B    B     2  22.21\n",
       "5    B    B     5   9.00\n",
       "6    A    B     3   0.50"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2=pd.DataFrame({\"col1\":['A','B','C','C','B','B','A'],\"col2\":['A','A','A','B','B','B','B'],\"col3\": [1,2,3,4,2,5,3],\"col4\": [1.1,2.3,12.,16,22.21,9,0.5]})\n",
    "sac_df_2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we `.groupby()` column 1 (`col1`) we get the same groups as before but notice they includes all the rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A\n",
      "-----\n",
      "  col1 col2  col3  col4\n",
      "0    A    A     1   1.1\n",
      "6    A    B     3   0.5\n",
      "\n",
      "\n",
      "B\n",
      "-----\n",
      "  col1 col2  col3   col4\n",
      "1    B    A     2   2.30\n",
      "4    B    B     2  22.21\n",
      "5    B    B     5   9.00\n",
      "\n",
      "\n",
      "C\n",
      "-----\n",
      "  col1 col2  col3  col4\n",
      "2    C    A     3  12.0\n",
      "3    C    B     4  16.0\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for name, group_df in sac_df_2.groupby('col1'):\n",
    "    print(name)\n",
    "    print('-----')\n",
    "    print(group_df)\n",
    "    print()\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now the apply-combine step computes the sum for all the numeric columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>4</td>\n",
       "      <td>1.60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>9</td>\n",
       "      <td>33.51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>7</td>\n",
       "      <td>28.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      col3   col4\n",
       "col1             \n",
       "A        4   1.60\n",
       "B        9  33.51\n",
       "C        7  28.00"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2.groupby('col1').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course you don't have to apply the sum to all the columns.  If you just want to analyze `col4` you can select it out before applying the sum:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "col1\n",
       "A     1.60\n",
       "B    33.51\n",
       "C    28.00\n",
       "Name: col4, dtype: float64"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2.groupby('col1')['col4'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also group using more than one column as the grouping factors.  For example in `sac_df_2` there are actually two columns that have discrete values and so you might want to make groups that are all combinations of those two factors.  If so you just pass a list of grouping columns to `.groupby()`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('A', 'A')\n",
      "-----\n",
      "  col1 col2  col3  col4\n",
      "0    A    A     1   1.1\n",
      "\n",
      "\n",
      "('A', 'B')\n",
      "-----\n",
      "  col1 col2  col3  col4\n",
      "6    A    B     3   0.5\n",
      "\n",
      "\n",
      "('B', 'A')\n",
      "-----\n",
      "  col1 col2  col3  col4\n",
      "1    B    A     2   2.3\n",
      "\n",
      "\n",
      "('B', 'B')\n",
      "-----\n",
      "  col1 col2  col3   col4\n",
      "4    B    B     2  22.21\n",
      "5    B    B     5   9.00\n",
      "\n",
      "\n",
      "('C', 'A')\n",
      "-----\n",
      "  col1 col2  col3  col4\n",
      "2    C    A     3  12.0\n",
      "\n",
      "\n",
      "('C', 'B')\n",
      "-----\n",
      "  col1 col2  col3  col4\n",
      "3    C    B     4  16.0\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for name, group_df in sac_df_2.groupby(['col1','col2']):\n",
    "    print(name)\n",
    "    print('-----')\n",
    "    print(group_df)\n",
    "    print()\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then the sum operation will apply to all combinations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">A</th>\n",
       "      <th>A</th>\n",
       "      <td>1</td>\n",
       "      <td>1.10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>3</td>\n",
       "      <td>0.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">B</th>\n",
       "      <th>A</th>\n",
       "      <td>2</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>7</td>\n",
       "      <td>31.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">C</th>\n",
       "      <th>A</th>\n",
       "      <td>3</td>\n",
       "      <td>12.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>4</td>\n",
       "      <td>16.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           col3   col4\n",
       "col1 col2             \n",
       "A    A        1   1.10\n",
       "     B        3   0.50\n",
       "B    A        2   2.30\n",
       "     B        7  31.21\n",
       "C    A        3  12.00\n",
       "     B        4  16.00"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2.groupby(['col1','col2']).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output here looks a little different because the columns are organized hierarchically (the specific terms is that the index is hierarchical).  But we can call `.reset_index()` a function we mentioned earlier to delete the hierarchical index and flatten things back to a standard data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1.10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>0.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B</td>\n",
       "      <td>A</td>\n",
       "      <td>2</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B</td>\n",
       "      <td>B</td>\n",
       "      <td>7</td>\n",
       "      <td>31.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>3</td>\n",
       "      <td>12.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>C</td>\n",
       "      <td>B</td>\n",
       "      <td>4</td>\n",
       "      <td>16.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  col1 col2  col3   col4\n",
       "0    A    A     1   1.10\n",
       "1    A    B     3   0.50\n",
       "2    B    A     2   2.30\n",
       "3    B    B     7  31.21\n",
       "4    C    A     3  12.00\n",
       "5    C    B     4  16.00"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2.groupby(['col1','col2']).sum().reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final result here is itself a dataframe and so can be use for further analysis and plotting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Changing the \"apply\" step"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example we considered so far we summed the columns within in group.  However there are actually several types of things you can do to groups  in the \"apply\" step:\n",
    "    \n",
    "- **Aggregation**: computes a descriptive statistic like a sum, mean, max, min, etc...\n",
    "- **Transformation**: perform a group-specific computation like standardsize (z-score) within a group\n",
    "- **Filtration**: discard data from small groups or filter data from a group based on the sum or mean in that group (e.g., removing outliers in a group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far we computed the sum of columns within each group.  However there are several other functions you can use.  For example to compute the median instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>2.0</td>\n",
       "      <td>0.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>2.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>3.5</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      col3  col4\n",
       "col1            \n",
       "A      2.0   0.8\n",
       "B      2.0   9.0\n",
       "C      3.5  14.0"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2.groupby('col1').median()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a list of common aggregation functions:\n",
    "\n",
    "\n",
    "\n",
    "```{list-table} Example aggregation functions\n",
    ":header-rows: 1\n",
    ":name: aggregation-table\n",
    ":width: 200px\n",
    "\n",
    "* - Function\n",
    "  - Description\n",
    "* - `.mean()`\n",
    "  - Compute mean of groups\n",
    "* - `sum()`\n",
    "  - Compute sum of group values\n",
    "* - `size()`\n",
    "  - Compute group sizes\n",
    "* - `count()`\n",
    "  - Compute count of group\n",
    "* - `std()`\n",
    "  - Standard deviation of groups\n",
    "* - `var()`\n",
    "  - Compute variance of groups\n",
    "* - `sem()`\n",
    "  - Standard error of the mean of groups\n",
    "* - `describe()`\n",
    "  - Generates descriptive statistics\n",
    "* - `first()`\n",
    "  - Compute first of group values\n",
    "* - `last()`\n",
    "  - Compute last of group values\n",
    "* - `nth()`\n",
    "  - Take nth value, or a subset if n is a list\n",
    "* - `min()`\n",
    "  - Compute min of group values\n",
    "* - `max()`\n",
    "  - Compute max of group values\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition there is a special `.agg()` (or `.aggregate()`) function that you can pass mulitple functions to from other libraries: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>co32</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>4</td>\n",
       "      <td>1.60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>9</td>\n",
       "      <td>33.51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>7</td>\n",
       "      <td>28.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      co32   col4\n",
       "col1             \n",
       "A        4   1.60\n",
       "B        9  33.51\n",
       "C        7  28.00"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df.groupby('col1').agg(np.sum)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The previous example used the numpy `np.sum()` function.  You can also provide multiple functions at once:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>co32</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1.10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>A</td>\n",
       "      <td>2</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>3</td>\n",
       "      <td>12.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C</td>\n",
       "      <td>B</td>\n",
       "      <td>4</td>\n",
       "      <td>16.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "      <td>22.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B</td>\n",
       "      <td>B</td>\n",
       "      <td>5</td>\n",
       "      <td>9.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>A</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>0.50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  col1 col2  co32   col4\n",
       "0    A    A     1   1.10\n",
       "1    B    A     2   2.30\n",
       "2    C    A     3  12.00\n",
       "3    C    B     4  16.00\n",
       "4    B    B     2  22.21\n",
       "5    B    B     5   9.00\n",
       "6    A    B     3   0.50"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"3\" halign=\"left\">col2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.414214</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.732051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>7</td>\n",
       "      <td>3.5</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     col2               \n",
       "      sum mean       std\n",
       "col1                    \n",
       "A       4  2.0  1.414214\n",
       "B       9  3.0  1.732051\n",
       "C       7  3.5  0.707107"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df.groupby('col1').agg([np.sum, np.mean, np.std])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which as you can see actually computes mutiple statistics on each group."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{seealso}\n",
    "You can read more about aggregation in the Pandos docs on [Aggregation](https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation).\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Transform"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Aggregation methods take in a grouped dataframe and compute a single number of statistics for each group.  In contrast transform applies a function to a column and returns a column of the same size usually.  This allows you to change the data within each sub group.  A common version of this for behavioral science is to z-score data within a group to detect outliers, etc...  Since that is the most common use case we'll focus on that here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.577350</td>\n",
       "      <td>-0.875545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.707107</td>\n",
       "      <td>-0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.577350</td>\n",
       "      <td>1.089743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.154701</td>\n",
       "      <td>-0.214198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.707107</td>\n",
       "      <td>-0.707107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       col3      col4\n",
       "0 -0.707107  0.707107\n",
       "1 -0.577350 -0.875545\n",
       "2 -0.707107 -0.707107\n",
       "3  0.707107  0.707107\n",
       "4 -0.577350  1.089743\n",
       "5  1.154701 -0.214198\n",
       "6  0.707107 -0.707107"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2.groupby('col1').transform(lambda x: (x-x.mean())/x.std())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The transform function takes as argument a `lambda` function which is a way to write short one-line functions in python.  This function takes 1 argument (`x`) which is the dataframe coresponding to each column of each group.  It then computes the z-score of this data within the group and column and the result is combined into a new dataframe.  This new data frame has the same column names as the original but we might want to rename them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col3_trans</th>\n",
       "      <th>col4_trans</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.577350</td>\n",
       "      <td>-0.875545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.707107</td>\n",
       "      <td>-0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.577350</td>\n",
       "      <td>1.089743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1.154701</td>\n",
       "      <td>-0.214198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.707107</td>\n",
       "      <td>-0.707107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col3_trans  col4_trans\n",
       "0   -0.707107    0.707107\n",
       "1   -0.577350   -0.875545\n",
       "2   -0.707107   -0.707107\n",
       "3    0.707107    0.707107\n",
       "4   -0.577350    1.089743\n",
       "5    1.154701   -0.214198\n",
       "6    0.707107   -0.707107"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transformed_df=sac_df_2.groupby('col1').transform(lambda x: (x-x.mean())/x.std())\n",
    "transformed_df=transformed_df.rename(columns={'col3':'col3_trans', 'col4': 'col4_trans'})\n",
    "transformed_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To add these transformed columns back into our original data frame as new column we can use the `.join()` function which is like `.concat()` which we leared about earlier but combines the columns from a dataframe rather than appending the rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "      <th>col3_trans</th>\n",
       "      <th>col4_trans</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1.10</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>A</td>\n",
       "      <td>2</td>\n",
       "      <td>2.30</td>\n",
       "      <td>-0.577350</td>\n",
       "      <td>-0.875545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>3</td>\n",
       "      <td>12.00</td>\n",
       "      <td>-0.707107</td>\n",
       "      <td>-0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C</td>\n",
       "      <td>B</td>\n",
       "      <td>4</td>\n",
       "      <td>16.00</td>\n",
       "      <td>0.707107</td>\n",
       "      <td>0.707107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "      <td>22.21</td>\n",
       "      <td>-0.577350</td>\n",
       "      <td>1.089743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B</td>\n",
       "      <td>B</td>\n",
       "      <td>5</td>\n",
       "      <td>9.00</td>\n",
       "      <td>1.154701</td>\n",
       "      <td>-0.214198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>A</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.707107</td>\n",
       "      <td>-0.707107</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  col1 col2  col3   col4  col3_trans  col4_trans\n",
       "0    A    A     1   1.10   -0.707107    0.707107\n",
       "1    B    A     2   2.30   -0.577350   -0.875545\n",
       "2    C    A     3  12.00   -0.707107   -0.707107\n",
       "3    C    B     4  16.00    0.707107    0.707107\n",
       "4    B    B     2  22.21   -0.577350    1.089743\n",
       "5    B    B     5   9.00    1.154701   -0.214198\n",
       "6    A    B     3   0.50    0.707107   -0.707107"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sac_df_2.join(transformed_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The end result is that we z-scored `col3` and `col4` within the groups implies by `col1`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{seealso}\n",
    "You can read more about aggregation in the Pandos docs on [Transformation](https://pandas.pydata.org/docs/user_guide/groupby.html#transformation).\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Filtration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the filtration *apply* step you provide a function that returns true or false for each group.  Any group that the function returns `True` for is kept for the new dataframe and any group that returns `False` is dropped. \n",
    "For example this step removes rows from the dataframe that belong to groups with only one or two members:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B\n",
       "0  0  a\n",
       "1  1  a\n",
       "2  2  b\n",
       "3  3  b\n",
       "4  4  b\n",
       "5  5  b\n",
       "6  6  c\n",
       "7  7  c"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})\n",
    "dff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B\n",
       "2  2  b\n",
       "3  3  b\n",
       "4  4  b\n",
       "5  5  b"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dff.groupby('B').filter(lambda x: len(x) > 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice 'a' and 'c' entries have been removed because those groups have only two members each."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{seealso}\n",
    "You can read more about aggregation in the Pandos docs on [Filtration](https://pandas.pydata.org/docs/user_guide/groupby.html#filtration).\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further Reading and Resources\n",
    "\n",
    "- 10 minute [intro to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)\n",
    "- A longer standalone video [on how to use pandas](https://www.youtube.com/watch?v=vmEHCJofslg)\n",
    "- A visual guide to [split-apply-combine](https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e) strategies in Pandas by Towards data science\n",
    "- [Split-Apply-combine strategy for data mining](https://medium.com/analytics-vidhya/split-apply-combine-strategy-for-data-mining-4fd6e2a0cc99) includes some additional facts and features of this approach"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "```{bibliography} ../../references.bib\n",
    ":filter: docname in docnames\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}