{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Format and Structure of Digital Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "This chapter authored by [Todd M. Gureckis](http://gureckislab.org/~gureckis) and [Kelsey Moty](http://kelseymoty.github.io) and is released under the [license for this book](../../LICENSE).\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Video Lecture\n", "\n", "This video provides an complementary overview of this chapter. There are things mentioned in the chapter not mentioned in the video and vice versa. Together they give an overview of this unit so please read and watch.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to think about data and organize it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data is an important concept in science. The first step is we measure something. In a [previous chapter we discuss issues in measurement](../../chapters/04/00-researchdesign) including different types of scales, units, etc... However the main thing is that data is born when a number is assigned to some observation we make. Here is a lonely single number which maybe measures something:\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Things get more interesting when we make multiple observations and so we have many data. In order to do anything with more than one number though we start running into the question of how to organize things. How do we keep track of which number was collected first, second or third for instance? If we have just a big jumble of numbers we can easily get lost. \n", "\n", "Before we get too many numbers we need to start thinking abstractly about **organzing** our measurements into some type of collection. In our previous chapter on the [basics of Python](../../chapters/03/00-python) we discussed the concept of [collections](chapters/03/00-python.html#collections). Collections are things in the native Python language space that organize multiple numbers, string, or even other collections. This included [lists](chapters/03/00-python.html#lists) (which organize things in some order), [dictionaries](chapters/03/00-python.html#dictionaries) (which do not preseve order but allow \"looking up\" the value of a number by a special address known as a key), or [sets](chapters/03/00-python.html#sets) (which are unordered collections of things useful for counting, and performing set operations like union, etc...). \n", "\n", "\n", "For instance if we measured a number repeatedly in time a list might be a useful way to organize them:\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets imagine the numbers above represent some measurement on a person on three different days (monday, tuesday, wednesday). It might be their blood pressure or temperature. Learning a lot about one person is nice and fits cleanly into a list. However, more often it gets more complex. What we if instead have multiple people we are recording information from and so the data starts looking two dimensional. There is maybe 3 people in the study and each has 3 measurements. In that case we might then organize the data as a list of lists (or a matrix):\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although a matrix is a nice way to organize multiple observations made on multiple people it can get a little bit confusing. Are the rows or the columns the subjects in the example above? You have to remember to write that down otherwise you could easily get confused. What is this data? What does it describe? For this reason we might look beyond standard Python collections to more complex structures. \n", "\n", "For example, you are all likely familiar with the concept of a spreadsheet (e.g., Microsoft Excel or Google Sheets). These are nicer than matricies because they have named rows and columns and so just by looking at the structure of the spreadsheet you can often learn more about the structure of the data. Columns names are sometimes known as **metadata** because they are data themselves that *describe* other data.\n", "\n", "\n", "
\n", " \n", "
\n", "\n", "This is much nicer because the columns names help us remember what each measurement records and also what is contained in the rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two dimensional structure of a spreadsheet is generally the most complex types of data that we will deal with. However, just so you are aware, as data gets really big it can make sense to take other organizational approaches. For instance, a website that had millions of users reviewing movies might not want to make a long spreadsheet recording each user and the rating they gave and the movie the rating was about. Stop and think for a moment how you could do that in a spreadsheet. Perhaps you could make each row a user and each column a movie. However, as you know Netflix and other sites have hundred of thousands of TV shows and movies and so the data would be really unweildy. Also most of the data would be \"missing\" because people only provide a rating for a small subset of the total number of movies. As a result, big websites adopt alternative ways of organizating data including the use of what are known as **relational databases**. \n", "\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relational databases are made up of multiple spreadsheets (effectively) where each one represents only a smaller piece of the data. In the figure above the green words are columns of a smaller spreadsheet. This database is thus organized into four separate spreadsheets (or \"tables\"). For instance the ratings table has an unique id number (i.e. row number) and then the rating a person gave, a user id of who gave it, and a movie id for which movie it was about. Separately there is a movies table which had its own id (unique id for each movie), the title or name of the movie, and a description/summary of the movie. The orange lines reflect links where the value in one columns of one table connects with or refers to the entries of another one. This can be a much more efficient way to organize the data. \n", "\n", "The main point this example brings forward is that the way you organize your data is something you really have to think about and plan to begin with. I've found that this topic actually is intuitively interesting to many students. The reason is that we love organizing our homes and offices. It feels great. When it comes to data taking the same mentality - the \"fresh\" feeling of being organized, is really key to making good scientific analyses that do not have bugs. Problems with data organization can even be deadly ([How not to kill people with spreadsheets](https://foreignpolicy.com/2020/10/08/uk-coronavirus-covid-spreadsheet-excel-error-outsourcing/)). If you get really interested in organizing data, there are [entire books](https://www.amazon.com/Discipline-Organizing-MIT-Press/dp/0262518503) {cite}`Glushko:2013` and fields of study about how to best organize data (e.g., library and information sciences). The choices you make in how to organize your data at one point in time really influence how easy it can be to do certain analyses later. More on this later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Common file formats for data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data often come in files with particular formatted structure. This formatted structure makes it easier for computer programs to read in the data and manipulate it. In this section we will go over a couple of the common data formats you might encounter in traditional psychological research." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Excel Workbooks (.xls, .xslx files)\n", "XLSX file format (or XLS in older versions of Excel) is the default file format used in Excel. Under the hood, Excel Workbooks are built using a highly structured markup language called Extensible Markup Language (XML). Essentially what this means is while you are using Excel's graphic interface to edit your data, XML is adding a bunch of tags behind the scenes to the XLSX file so it knows how to properly format the data each time you open it up. All of these tags in XML are what allow you to change font colors and sizes, add graphs and images, and use formulas. \n", "\n", "This is what you see when you when you use Excel: \n", "
\n", " \n", "
\n", "\n", "\n", "And this is what the exact same Workbook looks like behind the scenes:\n", "
\n", " \n", "
\n", "\n", "But this complexity can also make it difficult to use XLSX files with any other software or programming language that isn't Excel. For one, not all programs can open XLSX files, so these data files can make your data inaccessible to other people. Two, all of this special formatting can sometimes lead to problems when you read in data from XLSX files (e.g., converting variables to weird formats). To avoid issues like these, it is preferable to store your data using plain-text formats (such as CSV and TSV files). \n", "\n", "\n", "#### Comma-separated Values (.csv files)\n", "CSV, or Comma-separated value files, on the other hand, are written using a plain-text format. No special tags, just text. The only formatting in CSV files are the commas used to delimit different values for columns and new lines to indicate rows. You can see this if you open up a CSV file using Notepad (or other text editors). \n", "\n", "\n", "Here's the same dataset as before but now as a CSV file:\n", "
\n", " \n", "
\n", "\n", "This means that what you can store within a CSV file is quite limited (no images or special formatting, for example). But it also means that this kind of file is great for storing and sharing data. CSV files can be opened by any software, making it accessible to others. Given that CSV files inherently have no formatting, some of the issues that come with using XLSX files never arise. \n", "\n", "#### Tab-separated Values (.tsv files)\n", "TSV files are very similar to CSV files, except that instead of using a comma to delimit between values in a dataset, it uses tabs (a special type of spacing, the same type you get when you hit the tab key on your computer).\n", "\n", "```{seealso} \n", "It is a little bit beyond the scope of our class, but another file format you might encounter is referred to as JSON which stands for Javascript Object Notation (JSON). JSON is similar to a python dictionary and is stored in a plain text file like a CSV. However, JSON is somewhat more flexible than the two-dimensional structure of a spreadsheet. Often one analysis step is to convert JSON into a 2D spreadsheet-type structure. [Here](https://realpython.com/python-json/) is a helpful guide to JSON in python.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting from spreadsheet programs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Google Sheets \n", "\n", "To export a .csv file from Google Sheets, you click on File > Download > Comma-separated values (.csv). \n", "\n", "
\n", "\n", "If you created a Google Sheet with multiple sheets, you will have to save each individual sheet as a .csv file because .csv files do not support multiple sheets. To do this, you will have to click on each sheet and go through the save process for each sheet. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excel\n", "\n", "To create a .csv file from Excel, click on File (found on the top left of the menu bar).\n", "\n", "
\n", "\n", "A new menu will appear. From there, select \"Save as\" and then choose where on your computer you want to save the file. A pop-up window will open up, and from there, you can choose what to name the file and what kind of file type to save it as. To save it as a will there will be a dropdown menu where you can select which kind of file type you would like to save the file as (labeled \"Save as type\"). \n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uploading csv files to JupyterHub" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many of the datasets you will be working with this semester will already be available for you to use on JupyterHub. However, if you want to work with your own datasets, you will need to upload them yourselves to the cluster.\n", "\n", "To do that, go to the \"Files\" tab on JupyterHub and use the \"Upload\" button. \n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Pandas library and the concept of a dataframe API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "
\n", "\n", "Throughout this class there are several libraries (i.e., tools which extend the capabilities of the core Python language) which we will use a lot. One of those is [Pandas](https://pandas.pydata.org). Pandas is a open-source project for Python that provides a data analysis and manipulation tool. Pandas gives you a way to interact with a dataset organized like a spreadsheet (meaning columns and rows which are possibly named or labeled with meta data) in order to perform sophisticated data analyses. Pandas is really the \"backbone\" of the Python data science and modeling environment. Pandas could be thought of as a langauge (or [API](https://en.wikipedia.org/wiki/API)) for unlocking tabular data. **If you want to become better at data analysis with Python there is no single package I can think of that is worth more of your time learning.**\n", "\n", "Although there is no required book for this class, \"[Learning the Pandas Library: Python Tools for Data Munging, Analysis and Visualization](https://www.amazon.com/Learning-Pandas-Library-Munging-Analysis/dp/153359824X)\" by Matt Harrison {cite}`Harrison:2016` is highly recommended both because it is short and to the point but also because it explains key Pandas concepts very clearly. In addition, the Pandas project includes a very helpful [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html) which explains many of the key concepts.\n", "\n", "```{admonition} Pandas can be confusing at first!\n", ":class: tip\n", "I was aware of Pandas for several years and never really \"understood it.\" However, when it clicked it opened a universe of data analysis to me. It takes some patience in part because data manipulation is a very complex topic at least at the conceptual level.\n", "```\n", "\n", "The organization of this guide is not to give a complete description of everything in Pandas but to kind of give you a sense of how you use Pandas (with code) to do the types of tasks you typically would do in a spreadsheet like Excel or Google Sheets. In addition, we show a few of the key features of Pandas that are not that easy to do in Excel which are very useful for behavioral data analysis.\n", "\n", "The first step of understanding Pandas is to load it:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two code lines above are the typical way to load in the Pandas library. [Numpy](https://numpy.org) is a related library that Pandas is built on top of (it really is like a Russian doll where one project enables the next!). Tradionally, pandas is imported 'as' **pd** and numpy 'as' **np**. This means that when you are reading code online if you see `pd.somefunction()` you can probably assume that the function is part of the pandas library because the function begins with the `pd.` syntax." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data (e.g., csvs) into Python using Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before getting further with Pandas it helps to first understand how to read some data into a dataframe object or how to create a dataframe from scratch. Usually you are reading data in from some other file (e.g., a CSV file) and \"loading\" it into Pandas. Pandas has many different ways for reading in different file types. Here, we will use `pd.read_csv()` because we will mostly be working with .csv files in this class.\n", "\n", "When reading in your .csv file, there are two things you absolutely have to do:\n", "\n", "First, **you need to store the data into a variable.** Otherwise, you won't be able to work with your data. Here, we called the dataframe \"df\" but you can name it whatever you want. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# incorrect: won't store your data\n", "pd.read_csv('salary.csv')\n", "\n", "# correct: creates a dataframe called df that you can work with \n", "df = pd.read_csv('salary.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, **you need to tell `pd.read_csv` where the file you are trying to import is located.** The path can point to a file located in a folder in your local enviroment (e.g., on your computer, on JupyterHub) or to a file available online. \n", "\n", "To point to a file available online, you'll put the link to the file. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are pointing to a file on JupyterHub (or to a file on your computer, if you had Python downloaded on your computer), you'll need to specify the path to the file. If the .csv file and your notebook are in the same folder, you only have to put the name of the file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('salary.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the file is located in a different folder than the notebook you are working with, you will have to specify which folder(s) that the computer needs to look at to find it. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('folder1/folder2/salary.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, your .csv file might be saved in a folder that's not in the same folder as your notebook. For example, say you have a folder called \"project\". And in that the folder, there was a folder called \"code\" that contained your notebooks/python code, as well as a folder called \"data\" that contained your data (.csv) files. To import your .csv file, you need to use `..` to tell the your computer go up one folder in your directory (get out of the \"code\" folder) in order for it to find the \"data\" folder. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/salary.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, `pd.read_csv` assumes that file uses commas to differentiate between different entries in the file. But you can also explicitly tell `pd.read_csv` that you want to use commas as the delimiter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('salary.csv', sep = \",\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pd.read_csv` also assumes by default that the first row in your .csv files lists the names for each column of data (the column headers). You can explicitly tell `pd.read_csv` to do this by writing:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('salary.csv', sep = \",\", header = 'infer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, datasets may have multiple headers. (e.g., both the first and second rows of the dataset list column names). `pd.read_csv` allows you to keep both rows as headers by modifying the `header` argument with a list of integers. Remember that 0 actually means the first row. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('salary.csv', sep = \",\", header = [0,1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When creating your own datasets, it's generally best practice to give your columns headers, so that's it easier for people looking at your data (including yourself!) to know what's in the dataset. However, you may occassionally come across datasets that do not have headers. To tell `pd.read_csv` there's no header to import, set `header` to \"None\":" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('salary.csv', sep = \",\", header = None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{warning} \n", "When creating your own dataset, refrain from using characters like space or period (.) in the column names. This will make things easier for you down the line when using Pandas for statistical modeling.\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Viewing the data in a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we have loaded a dataset into a variable called `df`. Now we might like to look at the data to check it was properly read in and also to learn more about the structure of this dataset. Perhaps the simplest method is simply to type the name of the dataframe variable by itself in a code cell:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
.....................
72536621neuro1.031.03
73571851stat9.039.07
74522541stat2.032.09
75618851math23.060.09
76495421math3.033.05
\n", "

77 rows × 6 columns

\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23\n", ".. ... ... ... ... ... ...\n", "72 53662 1 neuro 1.0 31.0 3\n", "73 57185 1 stat 9.0 39.0 7\n", "74 52254 1 stat 2.0 32.0 9\n", "75 61885 1 math 23.0 60.0 9\n", "76 49542 1 math 3.0 33.0 5\n", "\n", "[77 rows x 6 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This outputs a \"table\" view of the dataframe showing the column names, and several of the rows of the dataset. It doesn't show you **all** of the data at once because in many large files this would be too much to make sense of.\n", "\n", "You can also specifically request the first several rows of the dataframe:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or the last several rows:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
72536621neuro1.031.03
73571851stat9.039.07
74522541stat2.032.09
75618851math23.060.09
76495421math3.033.05
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "72 53662 1 neuro 1.0 31.0 3\n", "73 57185 1 stat 9.0 39.0 7\n", "74 52254 1 stat 2.0 32.0 9\n", "75 61885 1 math 23.0 60.0 9\n", "76 49542 1 math 3.0 33.0 5" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"head\" of the dataframe is the top. The \"tail\" of the dataframe is the bottom.\n", "\n", "\n", "We can also access individual rows and columns of a data frame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Accessing individual columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access a single column you can index it like a dictionary in Python using the column name as a key:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 86285\n", "1 77125\n", "2 71922\n", "3 70499\n", "4 66624\n", " ... \n", "72 53662\n", "73 57185\n", "74 52254\n", "75 61885\n", "76 49542\n", "Name: salary, Length: 77, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['salary']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 64.0\n", "1 58.0\n", "2 38.0\n", "3 46.0\n", "4 41.0\n", " ... \n", "72 31.0\n", "73 39.0\n", "74 32.0\n", "75 60.0\n", "76 33.0\n", "Name: age, Length: 77, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another allowed syntax is to use the name of the dataframe variable and a `.columnname`. For instance:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 86285\n", "1 77125\n", "2 71922\n", "3 70499\n", "4 66624\n", " ... \n", "72 53662\n", "73 57185\n", "74 52254\n", "75 61885\n", "76 49542\n", "Name: salary, Length: 77, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.salary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However the dictionary key-lookup method is preferred because it is possible that the name of a column is the same as a dataframe method (see below) and that causes Python to get confused." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{seealso} \n", "When we learn about [visualizing data](../../chapters/06/00-plots), [describing data](../../chapters/07/00-describingdata), and [linear regression](../../chapters/13/00-linearregression) you will see how the column struture of dataframes becomes very powerful.\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Accessing individal rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the Python bracket notation is use to lookup columns, a special command is needed to access rows. The best way to look up a single row is to use `.iloc[]` where you pass the integer row number you want to access (zero indexed). So to get the first row you type:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "salary 86285\n", "gender 0\n", "departm bio\n", "years 26\n", "age 64\n", "publications 72\n", "Name: 0, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the 5th row:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "salary 66624\n", "gender 0\n", "departm bio\n", "years 11\n", "age 41\n", "publications 23\n", "Name: 4, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or counting three backwards from the end:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "salary 52254\n", "gender 1\n", "departm stat\n", "years 2\n", "age 32\n", "publications 9\n", "Name: 74, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[-3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexes and Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also note that there are two special elements of a normal dataframe called the **column index** and the **row index** (or just index). The row index is the column on the left that has no name but seems like a counter of the rows (e.g., 72, 73, 74, ...). The row index is useful in Pandas dataframes for looking things up by row. Although you can index a row by counting (access the fifth row for instance), the index can be made on arbitrary types of data including strings, etc... You don't need to know a ton about indexes to use Pandas typically but every once in a while they come up so it is useful to know the special nature of the index column." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=77, step=1)" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['salary', 'gender', 'departm', 'years', 'age', 'publications'], dtype='object')" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above code shows how to find the row index and column index.\n", "\n", "You can change the row index to another column:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderyearsagepublications
departm
bio86285026.064.072
bio77125028.058.043
bio71922010.038.023
bio70499016.046.064
bio66624011.041.023
..................
neuro5366211.031.03
stat5718519.039.07
stat5225412.032.09
math61885123.060.09
math4954213.033.05
\n", "

77 rows × 5 columns

\n", "
" ], "text/plain": [ " salary gender years age publications\n", "departm \n", "bio 86285 0 26.0 64.0 72\n", "bio 77125 0 28.0 58.0 43\n", "bio 71922 0 10.0 38.0 23\n", "bio 70499 0 16.0 46.0 64\n", "bio 66624 0 11.0 41.0 23\n", "... ... ... ... ... ...\n", "neuro 53662 1 1.0 31.0 3\n", "stat 57185 1 9.0 39.0 7\n", "stat 52254 1 2.0 32.0 9\n", "math 61885 1 23.0 60.0 9\n", "math 49542 1 3.0 33.0 5\n", "\n", "[77 rows x 5 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.set_index('departm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or to reset it to a counter that count the row number use `.reset_index()`." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
departmsalarygenderyearsagepublications
0bio86285026.064.072
1bio77125028.058.043
2bio71922010.038.023
3bio70499016.046.064
4bio66624011.041.023
.....................
72neuro5366211.031.03
73stat5718519.039.07
74stat5225412.032.09
75math61885123.060.09
76math4954213.033.05
\n", "

77 rows × 6 columns

\n", "
" ], "text/plain": [ " departm salary gender years age publications\n", "0 bio 86285 0 26.0 64.0 72\n", "1 bio 77125 0 28.0 58.0 43\n", "2 bio 71922 0 10.0 38.0 23\n", "3 bio 70499 0 16.0 46.0 64\n", "4 bio 66624 0 11.0 41.0 23\n", ".. ... ... ... ... ... ...\n", "72 neuro 53662 1 1.0 31.0 3\n", "73 stat 57185 1 9.0 39.0 7\n", "74 stat 52254 1 2.0 32.0 9\n", "75 math 61885 1 23.0 60.0 9\n", "76 math 49542 1 3.0 33.0 5\n", "\n", "[77 rows x 6 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2=df.set_index('departm')\n", "df2.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the syntax we've used a few times here... we referenced the `df` variable which is the variable we created to store the data from our file. Then the `.functionname()` is known as a method of the data frame which is provided by pandas. For instance the `.head()` method prints out the first five rows of the data frame. The `.tail()` method prints out the last five rows of the data frame. There are many other methods available on data frames that you access by calling it using the `.functionname()` syntax. The next parts of this chapter explain some of the most common ones." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding and deleting things from a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes after we read in a dataset we might want to add new rows or columns or delete rows and columns from a dataframe. One way to do this is to edit the original CSV file that we read in. However, there is an important principle I want to emphasize thoughout this class: \n", "\n", "```{admonition} **ALWAYS do everything in code**\n", ":class: tip\n", "What do I mean by do everything in code? What I mean is that if you go to your data file that you got from some place and then by hand delete some of the data in Google Sheets or Excel, noone will know that you did that. There is no record of it. Once you save the file the data will be deleted and noone will know you did this. Instead if you keep your data as \"raw\" as possible and modify it using code, your code will document ALL of the steps you did in your analysis including the step of deleting data. \"Excluding\" (not DELETEing) data is sometimes justified but the important thing is we want to document all our steps honestly and truthfully when doing analysis. Using code to do every single step of an analysis helps us accomplish this.\n", "``` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To delete a row you can use the `.drop()` method to drop a particular item using its index value. The `.drop()` method is not an \"in place operation\" instead it returns a new dataframe with the particular rows removed.\n", "\n", "\n", "```{admonition} **\"In place\" operations**\n", ":class: tip\n", "From time to time using pandas you will here about a method being described as \"in place\" or \"not in place.\" In place means that the change you are making to the dataframe is made to the dataframe variable you describe. So for instance if you have a dataframe stored in a variable named `df` and you call `df.drop([1])` it will drop the row corresponding to the index value 1. However, if you look at `df` again using, for instance, `df.head()` you will see that it will not have changed. To save the changes with a \"not in place\" operation you need to store the results in a new variable. For insteance `df2 = df.drop([1])` will make a copy of `df`, drop the row corresponding to index 1 and then store the result in `df2`. If drop was an \"in place\" operation it would have actually modified `df` and so you wouldn't need to store the result in a new variable.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is how to use it. Suppose we have the salary dataframe from above:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to delete the first we can drop it using the index it has, in this case 0:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
5644510bio23.060.044
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23\n", "5 64451 0 bio 23.0 60.0 44" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2=df.drop([0])\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how I \"saved\" the modified dataframe into a new variable called `df2` using the equals sign. Then if we call the `.head()` method on `df2` we can see that the first row has been removed. `df`, the original dataframe variable, remains unchanged.\n", "\n", "You can also remove multiple rows by their index value at once:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
1771250bio28.058.043
3704990bio16.046.064
5644510bio23.060.044
7593440bio5.040.011
8585600bio8.038.08
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "1 77125 0 bio 28.0 58.0 43\n", "3 70499 0 bio 16.0 46.0 64\n", "5 64451 0 bio 23.0 60.0 44\n", "7 59344 0 bio 5.0 40.0 11\n", "8 58560 0 bio 8.0 38.0 8" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop([0,2,4,6]).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here I dropped the rows with index 0,2,4,6, and also show an example of **chaining** dataframe methods. \n", "\n", "\n", "```{admonition} **Chaining methods**\n", ":class: tip\n", "Dataframes in pandas a what is known as an object-oriented structure. This means that most the functionality of a dataframe is tied to the variables themselves. That is why in the previous examples we called `df.method()` like `df.drop()` instead of `pd.drop()` (calling from the base pandas library). Given this, most pandas methods either return themselves or a copy of the dataframe that has been altered. Thus you can \"chain\" operations and methods together to make the code more concise. Chaining means calling multiple methods in a row on a single line of code. `df.drop([0,2,4,6]).head()` calls the `drop()` method on the dataframe and then called the `head()` method on the resulting modified data frame. This means you don't have to store every intermediate step of your data manipulation into a new variable. \n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To delete a column instead of a row you can also use the `.drop()` method, using an additional argument that refers to the \"axis\" you are dropping from." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are a couple examples of dropping one or more columns by name. Note that the case of the column name must match and also you need to specific `axis=1` to refer to dropping columns instead of rows." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmagepublications
0862850bio64.072
1771250bio58.043
2719220bio38.023
3704990bio46.064
4666240bio41.023
\n", "
" ], "text/plain": [ " salary gender departm age publications\n", "0 86285 0 bio 64.0 72\n", "1 77125 0 bio 58.0 43\n", "2 71922 0 bio 38.0 23\n", "3 70499 0 bio 46.0 64\n", "4 66624 0 bio 41.0 23" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop('years',axis=1).head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmpublications
0862850bio72
1771250bio43
2719220bio23
3704990bio64
4666240bio23
\n", "
" ], "text/plain": [ " salary gender departm publications\n", "0 86285 0 bio 72\n", "1 77125 0 bio 43\n", "2 71922 0 bio 23\n", "3 70499 0 bio 64\n", "4 66624 0 bio 23" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(['years','age'],axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using pandas it is best to resist the urge to add rows one at a time to a dataframe. For various reasons is this not the ideal way to use pandas [^columnbased]. Instead you can combine the rows of two different dataframes into one. This might be useful in psychology for instance if each participant in your experiment had their own data file and you want to read each file into a dataframe and them combine them together to make one \"uber\" dataframe with all the data from your experiment.\n", "\n", "[^columnbased]: Internally, pandas is primarily organized using columns. This just means that the computer code underneath pandas likes adding columns and hates adding new rows one at a time. It is just a thing. Thus, adding one row at a time is very inefficient. As a result it is usually best to get all your rows set and then create the dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's two simple dataframes and we can combine them:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesalary
0100
12723000
245100000
32335000
42160000
06050000
17023000
25360000
356135000
4800
\n", "
" ], "text/plain": [ " age salary\n", "0 10 0\n", "1 27 23000\n", "2 45 100000\n", "3 23 35000\n", "4 21 60000\n", "0 60 50000\n", "1 70 23000\n", "2 53 60000\n", "3 56 135000\n", "4 80 0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({\"age\": [10,27,45,23,21], \"salary\": [0,23000,100000,35000,60000]})\n", "df2 = pd.DataFrame({\"age\": [60,70,53,56,80], \"salary\": [50000,23000,60000,135000,0]})\n", "\n", "df_combined = pd.concat([df1,df2])\n", "\n", "df_combined" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This only works because they have the same columns. If they have different columns then the missing entries of either are filled in with `NaN` which is the code for \"missing values\" in pandas. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesalaryheight
0100.0NaN
12723000.0NaN
245100000.0NaN
32335000.0NaN
42160000.0NaN
060NaN5.2
170NaN6.0
253NaN5.7
356NaN3.4
480NaN4.6
\n", "
" ], "text/plain": [ " age salary height\n", "0 10 0.0 NaN\n", "1 27 23000.0 NaN\n", "2 45 100000.0 NaN\n", "3 23 35000.0 NaN\n", "4 21 60000.0 NaN\n", "0 60 NaN 5.2\n", "1 70 NaN 6.0\n", "2 53 NaN 5.7\n", "3 56 NaN 3.4\n", "4 80 NaN 4.6" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({\"age\": [10,27,45,23,21], \"salary\": [0,23000,100000,35000,60000]})\n", "df2 = pd.DataFrame({\"age\": [60,70,53,56,80], \"height\": [5.2,6.0,5.7,3.4,4.6]})\n", "\n", "df_combined = pd.concat([df1,df2])\n", "\n", "df_combined" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will talk about dealing with \"missing\" values shortly but basically missing values in pandas allows for incomplete rows: you might have have information about every single field of a row and so you can uses `NaN` (stands for Not-a-number in computer speak) to represent missing values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have considered adding rows, now let's consider adding columns. This is actually pretty easy. You just assign some values to a new columns name. First we will create a data frame with two random columns:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_1col_2
00.0164170.555351
10.3236190.868004
20.7443490.767006
30.0233760.915717
40.4315460.761252
50.6688490.522965
60.1994760.161559
70.0719680.866112
80.4522590.650265
90.4490170.797710
\n", "
" ], "text/plain": [ " col_1 col_2\n", "0 0.016417 0.555351\n", "1 0.323619 0.868004\n", "2 0.744349 0.767006\n", "3 0.023376 0.915717\n", "4 0.431546 0.761252\n", "5 0.668849 0.522965\n", "6 0.199476 0.161559\n", "7 0.071968 0.866112\n", "8 0.452259 0.650265\n", "9 0.449017 0.797710" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"col_1\": np.random.rand(10), \"col_2\": np.random.rand(10)})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we simply assign a new column `df[\\'sum\\']` and define it to be the sum of two columns." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_1col_2sum
00.0164170.5553510.571768
10.3236190.8680041.191623
20.7443490.7670061.511355
30.0233760.9157170.939093
40.4315460.7612521.192798
50.6688490.5229651.191814
60.1994760.1615590.361035
70.0719680.8661120.938080
80.4522590.6502651.102524
90.4490170.7977101.246726
\n", "
" ], "text/plain": [ " col_1 col_2 sum\n", "0 0.016417 0.555351 0.571768\n", "1 0.323619 0.868004 1.191623\n", "2 0.744349 0.767006 1.511355\n", "3 0.023376 0.915717 0.939093\n", "4 0.431546 0.761252 1.192798\n", "5 0.668849 0.522965 1.191814\n", "6 0.199476 0.161559 0.361035\n", "7 0.071968 0.866112 0.938080\n", "8 0.452259 0.650265 1.102524\n", "9 0.449017 0.797710 1.246726" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['sum'] = df['col_1']+df['col_2']\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also define new columns to be a constant value:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_1col_2sumconstant
00.1896310.0144500.2040812
10.4553790.7810081.2363862
20.8797150.4267901.3065042
30.9420810.9054981.8475782
40.1894110.6318640.8212752
50.8134990.0691080.8826072
60.8842110.4675721.3517832
70.2323540.5931100.8254642
80.7622180.1760460.9382652
90.8727110.0537650.9264762
\n", "
" ], "text/plain": [ " col_1 col_2 sum constant\n", "0 0.189631 0.014450 0.204081 2\n", "1 0.455379 0.781008 1.236386 2\n", "2 0.879715 0.426790 1.306504 2\n", "3 0.942081 0.905498 1.847578 2\n", "4 0.189411 0.631864 0.821275 2\n", "5 0.813499 0.069108 0.882607 2\n", "6 0.884211 0.467572 1.351783 2\n", "7 0.232354 0.593110 0.825464 2\n", "8 0.762218 0.176046 0.938265 2\n", "9 0.872711 0.053765 0.926476 2" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['constant'] = 2\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are of course some limitations and technicalities here but for the most part you just name a new column and define it as above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting rows with missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we are talking about adding and removing rows and columns it can also make sense to discuss removing rows with missing data. You might want to for example drop any trial from an experiment where a subject didn't give a response before a timer deadline. In this case the column coding what the response was might be \"empty\" or \"missing\" and you would want to use the `NaN` value to indicate it was missing. To totally delete rows with any missing value use the `dropna()` function:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesalary
0100.0
12723000.0
245NaN
32335000.0
42160000.0
\n", "
" ], "text/plain": [ " age salary\n", "0 10 0.0\n", "1 27 23000.0\n", "2 45 NaN\n", "3 23 35000.0\n", "4 21 60000.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"age\": [10,27,45,23,21], \"salary\": [0,23000,None,35000,60000]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see the salary value is missing for row with index 2 (age 45). To drop any row with a missing value:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesalary
0100.0
12723000.0
32335000.0
42160000.0
\n", "
" ], "text/plain": [ " age salary\n", "0 10 0.0\n", "1 27 23000.0\n", "3 23 35000.0\n", "4 21 60000.0" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several other tricks for dropping missing data besides this. For example, you can delete rows with only a specific column value missing, etc... However for now this should work for us." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting individual values and changing them" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes you want to just extract a single value or row from a dataframe or change a single value. The `.iat` method lets you pull out a single value from the data frame given the position of the index and column. The `.iat` method is one of the very weird parts of Pandas which has no real good explanation. Instead of the parentheses used to call normal methods, `.iat` and a few others use square brackets. One way to think about it is that `.iat` is like looking something up in a normal Python array but it also seems like a method attached to the dataframe." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
.....................
72536621neuro1.031.03
73571851stat9.039.07
74522541stat2.032.09
75618851math23.060.09
76495421math3.033.05
\n", "

77 rows × 6 columns

\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23\n", ".. ... ... ... ... ... ...\n", "72 53662 1 neuro 1.0 31.0 3\n", "73 57185 1 stat 9.0 39.0 7\n", "74 52254 1 stat 2.0 32.0 9\n", "75 61885 1 math 23.0 60.0 9\n", "76 49542 1 math 3.0 33.0 5\n", "\n", "[77 rows x 6 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gets the value at row 0 column 0 which is the salary of the first person in the data frame." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "86285" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iat[0,0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gets the age of the person in the sixth row, fourth column." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "53.0" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iat[6,4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A slighly more reader-friendly option is to use `.at[]` which is a method that lets you look things up using the names of columns." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40.0" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.at[10, 'age']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use this to **change** the value of a particular cell." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we see that the person in the first row has age 64." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `.at[]` we set it to 100." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "df.at[0,'age']=100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then when we look at the result the age has been changed in the dataframe `df`." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.0100.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 100.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The way to remember this is that methods that look things up use square brackets. And if the method begins with an i (like `.iat[]`) it will look it up by integer index (i.e., the number of the column or row). Otherwise `.at[]` looks up by row (or a named index) and column name.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking the size/dimensions of your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One common thing you need to do is to verify the number of rows and columns in your dataframe. The `.shape` property can help with that." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(77, 6)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tells us that the dataframe contained in variable `df` has 77 rows and 6 columns. This is often helpful when you first read in a dataset to verify it has all the data you expected to be there!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} **Methods versus properties**\n", ":class: tip\n", "The `.shape` property doesn't include a final `()` unlike other methods we have learned about like `.drop()` which required parentheses. This reflects that size is known as a property of the dataframe while `.drop()` is a method. The conceptual differences can be confusing for why one is one way or the other. However, it is helpful to often think about them as the distinction between nouns and verbs in langauge. Properties (nouns) are static descriptors of a dataset such as the size of the dataset or the column names. In contrast, methods (verbs) are things that require computation or modification of the dataframe like deleting things or performing computations on columns or rows.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ultimately the step we just covered recreate much of what you do with the graphical user interface in Excel (change cell values, add/delete rows and columns, etc...). The real power of Pandas comes from more complex things you can do with dataframes. That is what we will explore in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Things you can do with dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main goal of getting your data into a dataframe is that enables several methods for manipulating your data in powerful ways." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often times it can help us understand a dataset better if we can sort the rows of the dataset according to the values in one or more columns. For instance in the salary data set we have been considering it is hard to know who is the highest and lowest paid faculty. One approach would to be sort the values." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://cims.nyu.edu/~brenden/courses/labincp/data/salary.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can sort this dataset in ascending order with:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
23446870chem4.034.019
22470210chem4.034.012
76495421math3.033.05
53513910stat5.035.08
74522541stat2.032.09
.....................
14976300chem34.064.043
241048280geolNaN50.044
291057610neuro9.039.030
411064120stat23.053.029
281128000neuro14.044.033
\n", "

77 rows × 6 columns

\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "23 44687 0 chem 4.0 34.0 19\n", "22 47021 0 chem 4.0 34.0 12\n", "76 49542 1 math 3.0 33.0 5\n", "53 51391 0 stat 5.0 35.0 8\n", "74 52254 1 stat 2.0 32.0 9\n", ".. ... ... ... ... ... ...\n", "14 97630 0 chem 34.0 64.0 43\n", "24 104828 0 geol NaN 50.0 44\n", "29 105761 0 neuro 9.0 39.0 30\n", "41 106412 0 stat 23.0 53.0 29\n", "28 112800 0 neuro 14.0 44.0 33\n", "\n", "[77 rows x 6 columns]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values('salary')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can easily see from this output at 44,687 is the lowest salary and 112,800 is the highest. `sort_values()` is **not** an inplace operation so the original dataframe is still unsorted and we have to store the sorted result in a new dataframe variable if we want to keep working with it." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
0862850bio26.064.072
1771250bio28.058.043
2719220bio10.038.023
3704990bio16.046.064
4666240bio11.041.023
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "0 86285 0 bio 26.0 64.0 72\n", "1 77125 0 bio 28.0 58.0 43\n", "2 71922 0 bio 10.0 38.0 23\n", "3 70499 0 bio 16.0 46.0 64\n", "4 66624 0 bio 11.0 41.0 23" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head() # still unsorted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can sort the other way by adding an additional parameter:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
281128000neuro14.044.033
411064120stat23.053.029
291057610neuro9.039.030
241048280geolNaN50.044
14976300chem34.064.043
.....................
74522541stat2.032.09
53513910stat5.035.08
76495421math3.033.05
22470210chem4.034.012
23446870chem4.034.019
\n", "

77 rows × 6 columns

\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "28 112800 0 neuro 14.0 44.0 33\n", "41 106412 0 stat 23.0 53.0 29\n", "29 105761 0 neuro 9.0 39.0 30\n", "24 104828 0 geol NaN 50.0 44\n", "14 97630 0 chem 34.0 64.0 43\n", ".. ... ... ... ... ... ...\n", "74 52254 1 stat 2.0 32.0 9\n", "53 51391 0 stat 5.0 35.0 8\n", "76 49542 1 math 3.0 33.0 5\n", "22 47021 0 chem 4.0 34.0 12\n", "23 44687 0 chem 4.0 34.0 19\n", "\n", "[77 rows x 6 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values('salary',ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if you sort by two columns it will do them in order (so the first listed column is sorted first then the second):" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
23446870chem4.034.019
22470210chem4.034.012
76495421math3.033.05
53513910stat5.035.08
74522541stat2.032.09
.....................
14976300chem34.064.043
241048280geolNaN50.044
291057610neuro9.039.030
411064120stat23.053.029
281128000neuro14.044.033
\n", "

77 rows × 6 columns

\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "23 44687 0 chem 4.0 34.0 19\n", "22 47021 0 chem 4.0 34.0 12\n", "76 49542 1 math 3.0 33.0 5\n", "53 51391 0 stat 5.0 35.0 8\n", "74 52254 1 stat 2.0 32.0 9\n", ".. ... ... ... ... ... ...\n", "14 97630 0 chem 34.0 64.0 43\n", "24 104828 0 geol NaN 50.0 44\n", "29 105761 0 neuro 9.0 39.0 30\n", "41 106412 0 stat 23.0 53.0 29\n", "28 112800 0 neuro 14.0 44.0 33\n", "\n", "[77 rows x 6 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(['salary','age'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this data set it is mostly the same to sort by salary first then age because most people don't have the same salary so that already provides an order. However if we do it the other way, i.e., age first then salary, it will order by people age and then for the people who are the same age sort by salary." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
72536621neuro1.031.03
74522541stat2.032.09
52536560stat2.032.04
56720440physics2.032.016
76495421math3.033.05
.....................
15824440chem31.061.042
0862850bio26.064.072
14976300chem34.064.043
16762910chem29.065.033
18647620chem25.0NaN29
\n", "

77 rows × 6 columns

\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "72 53662 1 neuro 1.0 31.0 3\n", "74 52254 1 stat 2.0 32.0 9\n", "52 53656 0 stat 2.0 32.0 4\n", "56 72044 0 physics 2.0 32.0 16\n", "76 49542 1 math 3.0 33.0 5\n", ".. ... ... ... ... ... ...\n", "15 82444 0 chem 31.0 61.0 42\n", "0 86285 0 bio 26.0 64.0 72\n", "14 97630 0 chem 34.0 64.0 43\n", "16 76291 0 chem 29.0 65.0 33\n", "18 64762 0 chem 25.0 NaN 29\n", "\n", "[77 rows x 6 columns]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(['age','salary'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see in this shortened output there are several people who are 32 in the database and their salaries are ordered from smallest to biggest.\n", "\n", "Sorting is easy to do in Pandas but also easy to do in Excel because there is a \"sort\" tool in such programs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Arithmetic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps one of the most useful features of dataframes (and spreadsheets) is the ability to create formulas that compute new values based on the rows and columns. For instance if you had a dataframe that had rows for students and each column was the grade on an assignment a common operation might be to compute the average grade as a new column. Let's take a look at a simple example of this and then discuss arithmetic operations in Pandas more generally." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studentassignment1assignment2assignment3
0001908289
1002808475
2003709689
\n", "
" ], "text/plain": [ " student assignment1 assignment2 assignment3\n", "0 001 90 82 89\n", "1 002 80 84 75\n", "2 003 70 96 89" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades_df = pd.DataFrame({'student':['001','002','003'], 'assignment1': [90, 80, 70], 'assignment2': [82,84,96], 'assignment3': [89,75,89]})\n", "grades_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is not necessarily the easier way to **enter** this data (you might prefer to use a spreadsheet for that), but you could read in a csv to load the grades for instance. Next you would want to create the average grade for each student." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studentassignment1assignment2assignment3average
000190828987.000000
100280847579.666667
200370968985.000000
\n", "
" ], "text/plain": [ " student assignment1 assignment2 assignment3 average\n", "0 001 90 82 89 87.000000\n", "1 002 80 84 75 79.666667\n", "2 003 70 96 89 85.000000" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades_df['average']=(grades_df['assignment1']+grades_df['assignment2']+grades_df['assignment3'])/3\n", "grades_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So you can see here we added up the column for assignment 1, 2, and 3 and then divided by three. Then we assigned that resulting value to a new column called average. You might wonder how did Pandas know to do this for all three students? The answer is the power of **broadcasting** a feature of many programming languages that automatically detects when you are doing arithmetic operations on collections of numbers and then does that operation for **each entry** rather than like the first one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also broadcast the addition of constant values to a column. For instance to give all the students a five point bonus we could do this\"" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
studentassignment1assignment2assignment3average
000190828992.000000
100280847584.666667
200370968990.000000
\n", "
" ], "text/plain": [ " student assignment1 assignment2 assignment3 average\n", "0 001 90 82 89 92.000000\n", "1 002 80 84 75 84.666667\n", "2 003 70 96 89 90.000000" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades_df['average']=grades_df['average']+5\n", "grades_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, here it added 5 to *each entry* of the grades column rather than just one or the first row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically any math function can be composed of the columns. You might also be interested in functions you could compute down the columns rather than across them, however we will consider those in more detail in the later chapter on [](../../chapters/07/00-describingdata)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very useful and common feature for manipulating data is slicing. Slicing refers to selecting out subsets of a dataset for further analysis. For example we might want to plot only the salaries of the women in this data set. To do this we want to \"slice\" out a subset of the data and analyze that further. We saw slicing before in a previous chapter on Python programming with strings and lists where you can \"slice\" pieces of a collection out. Pandas takes this several steps further making it more functional for data composed of multiple rows and columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you remember we can slice a piece of a list by specifying the starting element and the ending element and including the `:` operator to indicate that we want the values \"in between\":" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['b', 'c', 'd']" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a_list = ['a','b','c','d','e','f','g']\n", "a_list[1:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the same thing using the `.iloc[]` method on a data frame. However since a Pandas dataframe is two dimensional we can specify two slide ranges one for the rows and one for the columns. `iloc[]` is simialr to `iat[]` we say above in that it is used via square brackets (because it is a lookup indexing operation) and the `i` part refer to that we are providing the integer location of the rows and columns (i.e., using numbers to say which row and columns we want)." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartm
2719220bio
3704990bio
\n", "
" ], "text/plain": [ " salary gender departm\n", "2 71922 0 bio\n", "3 70499 0 bio" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2:4,0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That example takes row 2 and 3 and columns 0,1,2 (remember Python is zero indexed so the first row or column is numbered zero)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also slice using column names using just `loc[]`:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderdepartmyearsagepublications
10bio28.058.043
20bio10.038.023
\n", "
" ], "text/plain": [ " gender departm years age publications\n", "1 0 bio 28.0 58.0 43\n", "2 0 bio 10.0 38.0 23" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[1:2,\"gender\":]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to `.iat[]` and `.at[]`, `iloc[]` uses integer lookup and `loc[]` uses index names. You might wonder why I still used numbers for the rows in the above. This is because this dataframe still has an **index** for the rows which is integer based. However, indexes are an important part of Dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Related to slicing is \"selecting\" which is grabbing subsets of a dataframe's rows based on the **values** of some of the rows. This is different than slicing which takes little chunks out of a larger dataframe using indexes or column names. Here we are interested in selecting rows that meet a particular criterion. For instance in the professor salary dataset we read in we might want to select only the rows that represent women in the dataset. Let's look at an example of how to do that and then discuss the principle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to discuss the concept of logical operations which are broadcast along the columns. For instance if we write something like" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 True\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "72 False\n", "73 False\n", "74 False\n", "75 True\n", "76 False\n", "Name: age, Length: 77, dtype: bool" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age']>50" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get a column of `True`/`False` values (also known as Boolean values since they take on two values) which reflect a test for each row if the age value is greater than 50. If it is, then `True` is entered into the new column and if it isn't then `False` is entered in.\n", "\n", "We can write more complex logical operations as well. For instance:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 True\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "72 False\n", "73 False\n", "74 False\n", "75 True\n", "76 False\n", "Name: age, Length: 77, dtype: bool" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df['age']>50) & (df['age']<70)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This expression does a logical `and` due to the `&` symbol and will be true if the age is greater than 50 AND less than 70." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The examples so far used a single row but we can also make combination using multiple columns. For instance we could select all the rows corresponding to professors that are male and under 50." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "72 True\n", "73 True\n", "74 True\n", "75 False\n", "76 True\n", "Length: 77, dtype: bool" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df['age']<50) & (df['gender']==1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to make an \"or\" you use the '|' (pipe) character instead of the '&'.\n", "\n", "Now that we have this boolean column we can use it to select subsets of the original dataframe:" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
22470210chem4.034.012
23446870chem4.034.019
52536560stat2.032.04
56720440physics2.032.016
70559491chem4.034.012
72536621neuro1.031.03
74522541stat2.032.09
76495421math3.033.05
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "22 47021 0 chem 4.0 34.0 12\n", "23 44687 0 chem 4.0 34.0 19\n", "52 53656 0 stat 2.0 32.0 4\n", "56 72044 0 physics 2.0 32.0 16\n", "70 55949 1 chem 4.0 34.0 12\n", "72 53662 1 neuro 1.0 31.0 3\n", "74 52254 1 stat 2.0 32.0 9\n", "76 49542 1 math 3.0 33.0 5" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['age']<35]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous line selects all the professors under 35. Notice the syntax here as it is kind of sensible. On the outer part we have `df[]` and in the middle of the bracket we provide the logical column/series as we just discussed. You can break them into two steps if you like:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
22470210chem4.034.012
23446870chem4.034.019
52536560stat2.032.04
56720440physics2.032.016
70559491chem4.034.012
72536621neuro1.031.03
74522541stat2.032.09
76495421math3.033.05
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "22 47021 0 chem 4.0 34.0 12\n", "23 44687 0 chem 4.0 34.0 19\n", "52 53656 0 stat 2.0 32.0 4\n", "56 72044 0 physics 2.0 32.0 16\n", "70 55949 1 chem 4.0 34.0 12\n", "72 53662 1 neuro 1.0 31.0 3\n", "74 52254 1 stat 2.0 32.0 9\n", "76 49542 1 math 3.0 33.0 5" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "under35=df['age']<35\n", "df[under35]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This makes clear that we define the \"rule\" for what things are true and false in this column (`under35`) and then use it to select rows from the original dataframe.\n", "\n", "You use this a lot in data analysis because often a data file from a single subject in an experiment has trials you want to skip or analyze, or you might use it to select trials from particular subjects, or trials that meet a certain requirement (E.g., if a reaction time was too long or something). Thus it is important to bookmark this concept and we will return to it several times throughout the semester." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An event simpler synatx for this is provided by the `.query()` method on Pandas data frames. For example, using this command this is how you would select all the professors under age 35 in the the salary dataset:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
22470210chem4.034.012
23446870chem4.034.019
52536560stat2.032.04
56720440physics2.032.016
70559491chem4.034.012
72536621neuro1.031.03
74522541stat2.032.09
76495421math3.033.05
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "22 47021 0 chem 4.0 34.0 12\n", "23 44687 0 chem 4.0 34.0 19\n", "52 53656 0 stat 2.0 32.0 4\n", "56 72044 0 physics 2.0 32.0 16\n", "70 55949 1 chem 4.0 34.0 12\n", "72 53662 1 neuro 1.0 31.0 3\n", "74 52254 1 stat 2.0 32.0 9\n", "76 49542 1 math 3.0 33.0 5" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('age<35')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is nice because it does the same thing as the above without all the extra brackets which sometimes contribute to typos. Here is a slightly more complex version using the logical and operator:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
22470210chem4.034.012
23446870chem4.034.019
70559491chem4.034.012
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "22 47021 0 chem 4.0 34.0 12\n", "23 44687 0 chem 4.0 34.0 19\n", "70 55949 1 chem 4.0 34.0 12" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"age<35 & departm=='chem'\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salarygenderdepartmyearsagepublications
22470210chem4.034.012
70559491chem4.034.012
\n", "
" ], "text/plain": [ " salary gender departm years age publications\n", "22 47021 0 chem 4.0 34.0 12\n", "70 55949 1 chem 4.0 34.0 12" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"age<35 & departm=='chem' & publications == 12\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So nice!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Iteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iteration refers to stepping through either the rows of the columns of your dataframe one by one and doing some thing with each row or column. We have encountered iteration before when we learned about Python for loops. If you remember, the typical for loop we had we iterated over a list. For example this code iterates down a list of values and prints each one." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "2\n", "3\n", "4\n", "5\n", "6\n", "7\n", "8\n", "9\n" ] } ], "source": [ "for i in range(10):\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can iterate over both rows and columns of a dataframe since the format is two dimensional. This prints out the titles of each column." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "salary\n", "gender\n", "departm\n", "years\n", "age\n", "publications\n" ] } ], "source": [ "for column in df:\n", " print(column)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above method only gets you the individual column name. To get the data within each column, you use a special methods called `.iteritems()`. For example:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "salary\n", "---\n", "0 86285\n", "1 77125\n", "2 71922\n", "3 70499\n", "4 66624\n", " ... \n", "72 53662\n", "73 57185\n", "74 52254\n", "75 61885\n", "76 49542\n", "Name: salary, Length: 77, dtype: int64\n", "gender\n", "---\n", "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "72 1\n", "73 1\n", "74 1\n", "75 1\n", "76 1\n", "Name: gender, Length: 77, dtype: int64\n", "departm\n", "---\n", "0 bio\n", "1 bio\n", "2 bio\n", "3 bio\n", "4 bio\n", " ... \n", "72 neuro\n", "73 stat\n", "74 stat\n", "75 math\n", "76 math\n", "Name: departm, Length: 77, dtype: object\n", "years\n", "---\n", "0 26.0\n", "1 28.0\n", "2 10.0\n", "3 16.0\n", "4 11.0\n", " ... \n", "72 1.0\n", "73 9.0\n", "74 2.0\n", "75 23.0\n", "76 3.0\n", "Name: years, Length: 77, dtype: float64\n", "age\n", "---\n", "0 64.0\n", "1 58.0\n", "2 38.0\n", "3 46.0\n", "4 41.0\n", " ... \n", "72 31.0\n", "73 39.0\n", "74 32.0\n", "75 60.0\n", "76 33.0\n", "Name: age, Length: 77, dtype: float64\n", "publications\n", "---\n", "0 72\n", "1 43\n", "2 23\n", "3 64\n", "4 23\n", " ..\n", "72 3\n", "73 7\n", "74 9\n", "75 9\n", "76 5\n", "Name: publications, Length: 77, dtype: int64\n" ] } ], "source": [ "for columnname, columndata in df.iteritems():\n", " print(columnname)\n", " print('---')\n", " print(columndata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, if you want to step through row-by-row use the `.iterrows()` method." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0, salary 86285\n", "gender 0\n", "departm bio\n", "years 26\n", "age 64\n", "publications 72\n", "Name: 0, dtype: object)\n", "(1, salary 77125\n", "gender 0\n", "departm bio\n", "years 28\n", "age 58\n", "publications 43\n", "Name: 1, dtype: object)\n", "(2, salary 71922\n", "gender 0\n", "departm bio\n", "years 10\n", "age 38\n", "publications 23\n", "Name: 2, dtype: object)\n", "(3, salary 70499\n", "gender 0\n", "departm bio\n", "years 16\n", "age 46\n", "publications 64\n", "Name: 3, dtype: object)\n", "(4, salary 66624\n", "gender 0\n", "departm bio\n", "years 11\n", "age 41\n", "publications 23\n", "Name: 4, dtype: object)\n", "(5, salary 64451\n", "gender 0\n", "departm bio\n", "years 23\n", "age 60\n", "publications 44\n", "Name: 5, dtype: object)\n", "(6, salary 64366\n", "gender 0\n", "departm bio\n", "years 23\n", "age 53\n", "publications 22\n", "Name: 6, dtype: object)\n", "(7, salary 59344\n", "gender 0\n", "departm bio\n", "years 5\n", "age 40\n", "publications 11\n", "Name: 7, dtype: object)\n", "(8, salary 58560\n", "gender 0\n", "departm bio\n", "years 8\n", "age 38\n", "publications 8\n", "Name: 8, dtype: object)\n", "(9, salary 58294\n", "gender 0\n", "departm bio\n", "years 20\n", "age 50\n", "publications 12\n", "Name: 9, dtype: object)\n", "(10, salary 56092\n", "gender 0\n", "departm bio\n", "years 2\n", "age 40\n", "publications 4\n", "Name: 10, dtype: object)\n", "(11, salary 54452\n", "gender 0\n", "departm bio\n", "years 13\n", "age 43\n", "publications 7\n", "Name: 11, dtype: object)\n", "(12, salary 54269\n", "gender 0\n", "departm bio\n", "years 26\n", "age 56\n", "publications 12\n", "Name: 12, dtype: object)\n", "(13, salary 55125\n", "gender 0\n", "departm bio\n", "years 8\n", "age 38\n", "publications 9\n", "Name: 13, dtype: object)\n", "(14, salary 97630\n", "gender 0\n", "departm chem\n", "years 34\n", "age 64\n", "publications 43\n", "Name: 14, dtype: object)\n", "(15, salary 82444\n", "gender 0\n", "departm chem\n", "years 31\n", "age 61\n", "publications 42\n", "Name: 15, dtype: object)\n", "(16, salary 76291\n", "gender 0\n", "departm chem\n", "years 29\n", "age 65\n", "publications 33\n", "Name: 16, dtype: object)\n", "(17, salary 75382\n", "gender 0\n", "departm chem\n", "years 26\n", "age 56\n", "publications 39\n", "Name: 17, dtype: object)\n", "(18, salary 64762\n", "gender 0\n", "departm chem\n", "years 25\n", "age NaN\n", "publications 29\n", "Name: 18, dtype: object)\n", "(19, salary 62607\n", "gender 0\n", "departm chem\n", "years 20\n", "age 45\n", "publications 34\n", "Name: 19, dtype: object)\n", "(20, salary 60373\n", "gender 0\n", "departm chem\n", "years 26\n", "age 56\n", "publications 43\n", "Name: 20, dtype: object)\n", "(21, salary 58892\n", "gender 0\n", "departm chem\n", "years 18\n", "age 48\n", "publications 21\n", "Name: 21, dtype: object)\n", "(22, salary 47021\n", "gender 0\n", "departm chem\n", "years 4\n", "age 34\n", "publications 12\n", "Name: 22, dtype: object)\n", "(23, salary 44687\n", "gender 0\n", "departm chem\n", "years 4\n", "age 34\n", "publications 19\n", "Name: 23, dtype: object)\n", "(24, salary 104828\n", "gender 0\n", "departm geol\n", "years NaN\n", "age 50\n", "publications 44\n", "Name: 24, dtype: object)\n", "(25, salary 71456\n", "gender 0\n", "departm geol\n", "years 11\n", "age 41\n", "publications 32\n", "Name: 25, dtype: object)\n", "(26, salary 65144\n", "gender 0\n", "departm geol\n", "years 7\n", "age 37\n", "publications 12\n", "Name: 26, dtype: object)\n", "(27, salary 52766\n", "gender 0\n", "departm geol\n", "years 4\n", "age 38\n", "publications 32\n", "Name: 27, dtype: object)\n", "(28, salary 112800\n", "gender 0\n", "departm neuro\n", "years 14\n", "age 44\n", "publications 33\n", "Name: 28, dtype: object)\n", "(29, salary 105761\n", "gender 0\n", "departm neuro\n", "years 9\n", "age 39\n", "publications 30\n", "Name: 29, dtype: object)\n", "(30, salary 92951\n", "gender 0\n", "departm neuro\n", "years 11\n", "age 41\n", "publications 20\n", "Name: 30, dtype: object)\n", "(31, salary 86621\n", "gender 0\n", "departm neuro\n", "years 19\n", "age 49\n", "publications 10\n", "Name: 31, dtype: object)\n", "(32, salary 85569\n", "gender 0\n", "departm neuro\n", "years 20\n", "age 46\n", "publications 35\n", "Name: 32, dtype: object)\n", "(33, salary 83896\n", "gender 0\n", "departm neuro\n", "years 10\n", "age 40\n", "publications 22\n", "Name: 33, dtype: object)\n", "(34, salary 79735\n", "gender 0\n", "departm neuro\n", "years 11\n", "age 41\n", "publications 32\n", "Name: 34, dtype: object)\n", "(35, salary 71518\n", "gender 0\n", "departm neuro\n", "years 7\n", "age 37\n", "publications 34\n", "Name: 35, dtype: object)\n", "(36, salary 68029\n", "gender 0\n", "departm neuro\n", "years 15\n", "age 45\n", "publications 33\n", "Name: 36, dtype: object)\n", "(37, salary 66482\n", "gender 0\n", "departm neuro\n", "years 14\n", "age 44\n", "publications 42\n", "Name: 37, dtype: object)\n", "(38, salary 61680\n", "gender 0\n", "departm neuro\n", "years 18\n", "age 48\n", "publications 20\n", "Name: 38, dtype: object)\n", "(39, salary 60455\n", "gender 0\n", "departm neuro\n", "years 8\n", "age 38\n", "publications 49\n", "Name: 39, dtype: object)\n", "(40, salary 58932\n", "gender 0\n", "departm neuro\n", "years 11\n", "age 41\n", "publications 49\n", "Name: 40, dtype: object)\n", "(41, salary 106412\n", "gender 0\n", "departm stat\n", "years 23\n", "age 53\n", "publications 29\n", "Name: 41, dtype: object)\n", "(42, salary 86980\n", "gender 0\n", "departm stat\n", "years 23\n", "age 53\n", "publications 42\n", "Name: 42, dtype: object)\n", "(43, salary 78114\n", "gender 0\n", "departm stat\n", "years 8\n", "age 38\n", "publications 24\n", "Name: 43, dtype: object)\n", "(44, salary 74085\n", "gender 0\n", "departm stat\n", "years 11\n", "age 41\n", "publications 33\n", "Name: 44, dtype: object)\n", "(45, salary 72250\n", "gender 0\n", "departm stat\n", "years 26\n", "age 56\n", "publications 9\n", "Name: 45, dtype: object)\n", "(46, salary 69596\n", "gender 0\n", "departm stat\n", "years 20\n", "age 50\n", "publications 18\n", "Name: 46, dtype: object)\n", "(47, salary 65285\n", "gender 0\n", "departm stat\n", "years 20\n", "age 50\n", "publications 15\n", "Name: 47, dtype: object)\n", "(48, salary 62557\n", "gender 0\n", "departm stat\n", "years 28\n", "age 58\n", "publications 14\n", "Name: 48, dtype: object)\n", "(49, salary 61947\n", "gender 0\n", "departm stat\n", "years 22\n", "age 58\n", "publications 17\n", "Name: 49, dtype: object)\n", "(50, salary 58565\n", "gender 0\n", "departm stat\n", "years 29\n", "age 59\n", "publications 11\n", "Name: 50, dtype: object)\n", "(51, salary 58365\n", "gender 0\n", "departm stat\n", "years 18\n", "age 48\n", "publications 21\n", "Name: 51, dtype: object)\n", "(52, salary 53656\n", "gender 0\n", "departm stat\n", "years 2\n", "age 32\n", "publications 4\n", "Name: 52, dtype: object)\n", "(53, salary 51391\n", "gender 0\n", "departm stat\n", "years 5\n", "age 35\n", "publications 8\n", "Name: 53, dtype: object)\n", "(54, salary 96936\n", "gender 0\n", "departm physics\n", "years 15\n", "age 50\n", "publications 17\n", "Name: 54, dtype: object)\n", "(55, salary 83216\n", "gender 0\n", "departm physics\n", "years 11\n", "age 37\n", "publications 19\n", "Name: 55, dtype: object)\n", "(56, salary 72044\n", "gender 0\n", "departm physics\n", "years 2\n", "age 32\n", "publications 16\n", "Name: 56, dtype: object)\n", "(57, salary 64048\n", "gender 0\n", "departm physics\n", "years 23\n", "age 53\n", "publications 4\n", "Name: 57, dtype: object)\n", "(58, salary 58888\n", "gender 0\n", "departm physics\n", "years 26\n", "age 56\n", "publications 7\n", "Name: 58, dtype: object)\n", "(59, salary 58744\n", "gender 0\n", "departm physics\n", "years 20\n", "age 50\n", "publications 9\n", "Name: 59, dtype: object)\n", "(60, salary 55944\n", "gender 0\n", "departm physics\n", "years 21\n", "age 51\n", "publications 8\n", "Name: 60, dtype: object)\n", "(61, salary 54076\n", "gender 0\n", "departm physics\n", "years 19\n", "age 49\n", "publications 12\n", "Name: 61, dtype: object)\n", "(62, salary 82142\n", "gender 0\n", "departm math\n", "years 9\n", "age 39\n", "publications 9\n", "Name: 62, dtype: object)\n", "(63, salary 70509\n", "gender 0\n", "departm math\n", "years 23\n", "age 53\n", "publications 7\n", "Name: 63, dtype: object)\n", "(64, salary 60320\n", "gender 0\n", "departm math\n", "years 14\n", "age 44\n", "publications 7\n", "Name: 64, dtype: object)\n", "(65, salary 55814\n", "gender 0\n", "departm math\n", "years 8\n", "age 38\n", "publications 6\n", "Name: 65, dtype: object)\n", "(66, salary 53638\n", "gender 0\n", "departm math\n", "years 4\n", "age 42\n", "publications 8\n", "Name: 66, dtype: object)\n", "(67, salary 53517\n", "gender 2\n", "departm math\n", "years 5\n", "age 35\n", "publications 5\n", "Name: 67, dtype: object)\n", "(68, salary 59139\n", "gender 1\n", "departm bio\n", "years 8\n", "age 38\n", "publications 23\n", "Name: 68, dtype: object)\n", "(69, salary 52968\n", "gender 1\n", "departm bio\n", "years 18\n", "age 48\n", "publications 32\n", "Name: 69, dtype: object)\n", "(70, salary 55949\n", "gender 1\n", "departm chem\n", "years 4\n", "age 34\n", "publications 12\n", "Name: 70, dtype: object)\n", "(71, salary 58893\n", "gender 1\n", "departm neuro\n", "years 10\n", "age 35\n", "publications 4\n", "Name: 71, dtype: object)\n", "(72, salary 53662\n", "gender 1\n", "departm neuro\n", "years 1\n", "age 31\n", "publications 3\n", "Name: 72, dtype: object)\n", "(73, salary 57185\n", "gender 1\n", "departm stat\n", "years 9\n", "age 39\n", "publications 7\n", "Name: 73, dtype: object)\n", "(74, salary 52254\n", "gender 1\n", "departm stat\n", "years 2\n", "age 32\n", "publications 9\n", "Name: 74, dtype: object)\n", "(75, salary 61885\n", "gender 1\n", "departm math\n", "years 23\n", "age 60\n", "publications 9\n", "Name: 75, dtype: object)\n", "(76, salary 49542\n", "gender 1\n", "departm math\n", "years 3\n", "age 33\n", "publications 5\n", "Name: 76, dtype: object)\n" ] } ], "source": [ "for row in df.iterrows():\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A [substantially faster version](https://medium.com/@formigone/stop-using-df-iterrows-2fbc2931b60e) of `.iterrows()` is `.itertuples()` which returns a slightly different representation of each row:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pandas(Index=0, salary=86285, gender=0, departm='bio', years=26.0, age=64.0, publications=72)\n", "Pandas(Index=1, salary=77125, gender=0, departm='bio', years=28.0, age=58.0, publications=43)\n", "Pandas(Index=2, salary=71922, gender=0, departm='bio', years=10.0, age=38.0, publications=23)\n", "Pandas(Index=3, salary=70499, gender=0, departm='bio', years=16.0, age=46.0, publications=64)\n", "Pandas(Index=4, salary=66624, gender=0, departm='bio', years=11.0, age=41.0, publications=23)\n", "Pandas(Index=5, salary=64451, gender=0, departm='bio', years=23.0, age=60.0, publications=44)\n", "Pandas(Index=6, salary=64366, gender=0, departm='bio', years=23.0, age=53.0, publications=22)\n", "Pandas(Index=7, salary=59344, gender=0, departm='bio', years=5.0, age=40.0, publications=11)\n", "Pandas(Index=8, salary=58560, gender=0, departm='bio', years=8.0, age=38.0, publications=8)\n", "Pandas(Index=9, salary=58294, gender=0, departm='bio', years=20.0, age=50.0, publications=12)\n", "Pandas(Index=10, salary=56092, gender=0, departm='bio', years=2.0, age=40.0, publications=4)\n", "Pandas(Index=11, salary=54452, gender=0, departm='bio', years=13.0, age=43.0, publications=7)\n", "Pandas(Index=12, salary=54269, gender=0, departm='bio', years=26.0, age=56.0, publications=12)\n", "Pandas(Index=13, salary=55125, gender=0, departm='bio', years=8.0, age=38.0, publications=9)\n", "Pandas(Index=14, salary=97630, gender=0, departm='chem', years=34.0, age=64.0, publications=43)\n", "Pandas(Index=15, salary=82444, gender=0, departm='chem', years=31.0, age=61.0, publications=42)\n", "Pandas(Index=16, salary=76291, gender=0, departm='chem', years=29.0, age=65.0, publications=33)\n", "Pandas(Index=17, salary=75382, gender=0, departm='chem', years=26.0, age=56.0, publications=39)\n", "Pandas(Index=18, salary=64762, gender=0, departm='chem', years=25.0, age=nan, publications=29)\n", "Pandas(Index=19, salary=62607, gender=0, departm='chem', years=20.0, age=45.0, publications=34)\n", "Pandas(Index=20, salary=60373, gender=0, departm='chem', years=26.0, age=56.0, publications=43)\n", "Pandas(Index=21, salary=58892, gender=0, departm='chem', years=18.0, age=48.0, publications=21)\n", "Pandas(Index=22, salary=47021, gender=0, departm='chem', years=4.0, age=34.0, publications=12)\n", "Pandas(Index=23, salary=44687, gender=0, departm='chem', years=4.0, age=34.0, publications=19)\n", "Pandas(Index=24, salary=104828, gender=0, departm='geol', years=nan, age=50.0, publications=44)\n", "Pandas(Index=25, salary=71456, gender=0, departm='geol', years=11.0, age=41.0, publications=32)\n", "Pandas(Index=26, salary=65144, gender=0, departm='geol', years=7.0, age=37.0, publications=12)\n", "Pandas(Index=27, salary=52766, gender=0, departm='geol', years=4.0, age=38.0, publications=32)\n", "Pandas(Index=28, salary=112800, gender=0, departm='neuro', years=14.0, age=44.0, publications=33)\n", "Pandas(Index=29, salary=105761, gender=0, departm='neuro', years=9.0, age=39.0, publications=30)\n", "Pandas(Index=30, salary=92951, gender=0, departm='neuro', years=11.0, age=41.0, publications=20)\n", "Pandas(Index=31, salary=86621, gender=0, departm='neuro', years=19.0, age=49.0, publications=10)\n", "Pandas(Index=32, salary=85569, gender=0, departm='neuro', years=20.0, age=46.0, publications=35)\n", "Pandas(Index=33, salary=83896, gender=0, departm='neuro', years=10.0, age=40.0, publications=22)\n", "Pandas(Index=34, salary=79735, gender=0, departm='neuro', years=11.0, age=41.0, publications=32)\n", "Pandas(Index=35, salary=71518, gender=0, departm='neuro', years=7.0, age=37.0, publications=34)\n", "Pandas(Index=36, salary=68029, gender=0, departm='neuro', years=15.0, age=45.0, publications=33)\n", "Pandas(Index=37, salary=66482, gender=0, departm='neuro', years=14.0, age=44.0, publications=42)\n", "Pandas(Index=38, salary=61680, gender=0, departm='neuro', years=18.0, age=48.0, publications=20)\n", "Pandas(Index=39, salary=60455, gender=0, departm='neuro', years=8.0, age=38.0, publications=49)\n", "Pandas(Index=40, salary=58932, gender=0, departm='neuro', years=11.0, age=41.0, publications=49)\n", "Pandas(Index=41, salary=106412, gender=0, departm='stat', years=23.0, age=53.0, publications=29)\n", "Pandas(Index=42, salary=86980, gender=0, departm='stat', years=23.0, age=53.0, publications=42)\n", "Pandas(Index=43, salary=78114, gender=0, departm='stat', years=8.0, age=38.0, publications=24)\n", "Pandas(Index=44, salary=74085, gender=0, departm='stat', years=11.0, age=41.0, publications=33)\n", "Pandas(Index=45, salary=72250, gender=0, departm='stat', years=26.0, age=56.0, publications=9)\n", "Pandas(Index=46, salary=69596, gender=0, departm='stat', years=20.0, age=50.0, publications=18)\n", "Pandas(Index=47, salary=65285, gender=0, departm='stat', years=20.0, age=50.0, publications=15)\n", "Pandas(Index=48, salary=62557, gender=0, departm='stat', years=28.0, age=58.0, publications=14)\n", "Pandas(Index=49, salary=61947, gender=0, departm='stat', years=22.0, age=58.0, publications=17)\n", "Pandas(Index=50, salary=58565, gender=0, departm='stat', years=29.0, age=59.0, publications=11)\n", "Pandas(Index=51, salary=58365, gender=0, departm='stat', years=18.0, age=48.0, publications=21)\n", "Pandas(Index=52, salary=53656, gender=0, departm='stat', years=2.0, age=32.0, publications=4)\n", "Pandas(Index=53, salary=51391, gender=0, departm='stat', years=5.0, age=35.0, publications=8)\n", "Pandas(Index=54, salary=96936, gender=0, departm='physics', years=15.0, age=50.0, publications=17)\n", "Pandas(Index=55, salary=83216, gender=0, departm='physics', years=11.0, age=37.0, publications=19)\n", "Pandas(Index=56, salary=72044, gender=0, departm='physics', years=2.0, age=32.0, publications=16)\n", "Pandas(Index=57, salary=64048, gender=0, departm='physics', years=23.0, age=53.0, publications=4)\n", "Pandas(Index=58, salary=58888, gender=0, departm='physics', years=26.0, age=56.0, publications=7)\n", "Pandas(Index=59, salary=58744, gender=0, departm='physics', years=20.0, age=50.0, publications=9)\n", "Pandas(Index=60, salary=55944, gender=0, departm='physics', years=21.0, age=51.0, publications=8)\n", "Pandas(Index=61, salary=54076, gender=0, departm='physics', years=19.0, age=49.0, publications=12)\n", "Pandas(Index=62, salary=82142, gender=0, departm='math', years=9.0, age=39.0, publications=9)\n", "Pandas(Index=63, salary=70509, gender=0, departm='math', years=23.0, age=53.0, publications=7)\n", "Pandas(Index=64, salary=60320, gender=0, departm='math', years=14.0, age=44.0, publications=7)\n", "Pandas(Index=65, salary=55814, gender=0, departm='math', years=8.0, age=38.0, publications=6)\n", "Pandas(Index=66, salary=53638, gender=0, departm='math', years=4.0, age=42.0, publications=8)\n", "Pandas(Index=67, salary=53517, gender=2, departm='math', years=5.0, age=35.0, publications=5)\n", "Pandas(Index=68, salary=59139, gender=1, departm='bio', years=8.0, age=38.0, publications=23)\n", "Pandas(Index=69, salary=52968, gender=1, departm='bio', years=18.0, age=48.0, publications=32)\n", "Pandas(Index=70, salary=55949, gender=1, departm='chem', years=4.0, age=34.0, publications=12)\n", "Pandas(Index=71, salary=58893, gender=1, departm='neuro', years=10.0, age=35.0, publications=4)\n", "Pandas(Index=72, salary=53662, gender=1, departm='neuro', years=1.0, age=31.0, publications=3)\n", "Pandas(Index=73, salary=57185, gender=1, departm='stat', years=9.0, age=39.0, publications=7)\n", "Pandas(Index=74, salary=52254, gender=1, departm='stat', years=2.0, age=32.0, publications=9)\n", "Pandas(Index=75, salary=61885, gender=1, departm='math', years=23.0, age=60.0, publications=9)\n", "Pandas(Index=76, salary=49542, gender=1, departm='math', years=3.0, age=33.0, publications=5)\n" ] } ], "source": [ "for row in df.itertuples():\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data organization - tidy and wide formats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you have seen how to import your data and the basic operations of dataframe, you might think that you're now ready to start analyzing and visualizing your data—yet there's still quite a bit of work that has to be done cleaning up your dataset before you do this. Some of this clean up might include fixing typos in your datasets, or filtering out participants with incomplete data. But even more important, you'll need to make sure that your data is organized and structured in a consistent manner. One way of organizing your data, referred to as **tidy format** (a term coined by Hadley Wickham), is particularly helpful for facilitating data analysis. \n", "\n", "To clarify what tidy format is, we will also talk about another common format of data called **wide format** that isn't very tidy. While there are some reasons to use wide format over tidy format when programming (e.g., some functions in R require data to be in wide format), wide format is more commonly used when working with interface-based software (such as Excel and SPSS).\n", "\n", "The clearest way of demonstrating the differences between wide versus tidy formats is by simply looking at datasets of both kinds. Take the example dataset below. It shows quiz scores for four participants across three timepoints plus some demographic data. It's currently in wide format: " ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegendertime1time2time3
0tom26m121515
1nick23m10912
2julie18f151314
3angela21f101012
\n", "
" ], "text/plain": [ " name age gender time1 time2 time3\n", "0 tom 26 m 12 15 15\n", "1 nick 23 m 10 9 12\n", "2 julie 18 f 15 13 14\n", "3 angela 21 f 10 10 12" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "data = [['tom', 26, \"m\", 12, 15, 15], \n", " ['nick', 23, \"m\", 10, 9, 12], \n", " ['julie', 18, \"f\", 15, 13, 14], \n", " ['angela', 21, \"f\", 10, 10, 12]] \n", "df_wide = pd.DataFrame(data, columns = ['name', 'age', 'gender', 'time1', 'time2', 'time3']) \n", "df_wide " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In wide format, each *individual* has their own row, and all data pertaining to that individual is contained within that row. For example, the first row has all of the data for Tom, including his scores across each of the three timepoints. \n", "\n", "Now let's compare this to tidy format:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegendertimepointscore
0tom26mtime112
4tom26mtime215
8tom26mtime315
1nick23mtime110
5nick23mtime29
9nick23mtime312
2julie18ftime115
6julie18ftime213
10julie18ftime314
3angela21ftime110
7angela21ftime210
11angela21ftime312
\n", "
" ], "text/plain": [ " name age gender timepoint score\n", "0 tom 26 m time1 12\n", "4 tom 26 m time2 15\n", "8 tom 26 m time3 15\n", "1 nick 23 m time1 10\n", "5 nick 23 m time2 9\n", "9 nick 23 m time3 12\n", "2 julie 18 f time1 15\n", "6 julie 18 f time2 13\n", "10 julie 18 f time3 14\n", "3 angela 21 f time1 10\n", "7 angela 21 f time2 10\n", "11 angela 21 f time3 12" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_tidy = pd.melt(df_wide, id_vars=['name', 'age', 'gender'], var_name='timepoint', value_name='score') \n", "df_tidy = df_tidy.sort_values(ascending=False, by='name')\n", "df_tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this format, now each *observation* has their own row. That is, we now have three separate rows for Tom, with each row reflecting a single quiz score from one of the three timepoints. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is tidy data, and why use it? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pulling directly from Wickham's book chapter on tidy data (his book is written for R users, but the principles of tidying data works the same across languages), there are three rules that work together to make a dataset tidy ({cite}`Wickham:2017`): \n", "\n", "1. Each _observation_ must have its own row (observations could be _each person_, _each timepoint_, etc.)\n", "2. Each _variable_ must have its own column (variables are some kind of measurement: _gender_, _age_, _score_, etc.)\n", "3. Each _value_ must have its own cell (value are the actual measurement: _female_, _23 years_, _12 points_, etc.)\n", "\n", "
\n", " \n", "
\n", "Above, a visual representation of each of the three rules for tidy data: variables are in columns, observations are in rows, and values are in cells. Image borrowed from R for Data Science.\n", "\n", "So going back to our dataset from before, we can see that it follows each of these three rules. Each person and each timepoint has their own rows (Rule 1). Every type of data measured (gender, age, etc.) has their own column (Rule 2), and each cell within the dataframe has only one value (Rule 3).\n", "\n", "Let's look at some examples where one or more of these rules are being violated. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameage_gendertime1time2time3
0tom26_m121515
1nick23_m10912
2julie18_f151314
3angela21_f101012
\n", "
" ], "text/plain": [ " name age_gender time1 time2 time3\n", "0 tom 26_m 12 15 15\n", "1 nick 23_m 10 9 12\n", "2 julie 18_f 15 13 14\n", "3 angela 21_f 10 10 12" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_untidy = [['tom', \"26_m\", 12, 15, 15], \n", " ['nick', \"23_m\", 10, 9, 12], \n", " ['julie', \"18_f\", 15, 13, 14], \n", " ['angela', \"21_f\", 10, 10, 12]] \n", "df_untidy = pd.DataFrame(data_untidy, columns = ['name', 'age_gender', 'time1', 'time2', 'time3'])\n", "df_untidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, this dataset is messy because two variables (age and gender) have been coded into the same column. Having multiple variables in the same column in this way makes it much harder to calculate the average age of participants or get a gender breakdown for the sample. \n", "\n", "But this can easily be fixed by splitting the column into two. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametime1time2time3agegender
0tom12151526m
1nick1091223m
2julie15131418f
3angela10101221f
\n", "
" ], "text/plain": [ " name time1 time2 time3 age gender\n", "0 tom 12 15 15 26 m\n", "1 nick 10 9 12 23 m\n", "2 julie 15 13 14 18 f\n", "3 angela 10 10 12 21 f" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_untidy[['age', 'gender']] = df_untidy.age_gender.str.split(\"_\", expand=True)\n", "del df_untidy['age_gender']\n", "\n", "df_untidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's another case of untidy data, similar to one we've seen before. " ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameaget1_mint1_maxt2_mint2_maxt3_mint3_max
0tom26121515171820
1nick23101912141118
2julie18152314181219
3angela21101112141530
\n", "
" ], "text/plain": [ " name age t1_min t1_max t2_min t2_max t3_min t3_max\n", "0 tom 26 12 15 15 17 18 20\n", "1 nick 23 10 19 12 14 11 18\n", "2 julie 18 15 23 14 18 12 19\n", "3 angela 21 10 11 12 14 15 30" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_messy = [['tom', 26, 12, 15, 15, 17, 18, 20], \n", " ['nick', 23, 10, 19, 12, 14, 11, 18], \n", " ['julie', 18, 15, 23, 14, 18, 12, 19], \n", " ['angela', 21, 10, 11, 12, 14, 15, 30]] \n", "df_messy = pd.DataFrame(data_messy, columns = ['name', 'age', 't1_min', 't1_max', 't2_min', 't2_max', 't3_min', 't3_max'])\n", "df_messy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset violates the rules of a tidy dataset because a) there's more than one observation in each row, and b) we have multiple columns reflecting the same variable (the distinction between a minimum and maximum score). \n", "\n", "But we can fix it using `pd.melt` with a combination of few other things. In this tidied version, we now see that each variable (_name_, _age_, _time_, _min_ score, and _max_ score) all have their own columns. " ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmaxnameagetimemaxmin
0angela21t11110
1angela21t21412
2angela21t33015
3julie18t12315
4julie18t21814
5julie18t31912
6nick23t11910
7nick23t21412
8nick23t31811
9tom26t11512
10tom26t21715
11tom26t32018
\n", "
" ], "text/plain": [ "minmax name age time max min\n", "0 angela 21 t1 11 10\n", "1 angela 21 t2 14 12\n", "2 angela 21 t3 30 15\n", "3 julie 18 t1 23 15\n", "4 julie 18 t2 18 14\n", "5 julie 18 t3 19 12\n", "6 nick 23 t1 19 10\n", "7 nick 23 t2 14 12\n", "8 nick 23 t3 18 11\n", "9 tom 26 t1 15 12\n", "10 tom 26 t2 17 15\n", "11 tom 26 t3 20 18" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_messy = pd.melt(df_messy, id_vars=['name', 'age'], var_name='time_minmax', value_name='score')\n", "\n", "df_messy[['time','minmax']] = df_messy.time_minmax.str.split(\"_\", expand = True)\n", "del df_messy['time_minmax']\n", "\n", "df_messy = df_messy.pivot_table(index=['name','age','time'], columns='minmax', values='score')\n", "df_messy.reset_index(drop=False, inplace=True)\n", "df_messy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't worry if the actual code to make the data tidy doesn't make sense right away (you will get there). We'll talk more later about strategies you can use to wrangle data into tidy formats using Python. What you should be taking away from these examples is the differences in how these datasets are structured both before and after we tidied them. \n", "\n", "Cleaning up and rearranging datasets is inevitable part of doing psychological research. Lots of data you work with may come to you in untidy formats, and you will have to clean that data up before you can analyze it. The good news is that knowing these principles of tidy, organized datasets will serve you well as you to start to design your experiments! **The more you can build in tidy principles into how your data is recorded during the experiment, the more time you can save later when it comes to processing your data in Python.**\n", "\n", "Also importantly, this discussion highlights one major purpose of the Pandas library: namely to help you clean up and reorganize data. The commands like `pd.melt()` and `pd.pivot_table()` are pretty complex operations which alter your view of a dataset and make it easier to perform certain types of analysis. These reorganizations of data can be very difficult to do by hand but are facilitated by the dataframe library provided by Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A case example: organizing data for an psychological experiment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The discussion so far might strike you as somewhat abstract. However, principles of data formatting and organization are very important for psychological research. Let's consider the case most central to this course which is organizing a data file for a psychological experiment conducted on several participants composed of multiple trials and trial types. This is perhaps the most typical type of data format we will encounter in this work and is somewhat different from survey questions and other types of data which has a more complex structure often." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To keep things simple, let's imagine that we are doing a decision making experiment. On each trial of the experiment the subject will be offered a choice between a gamble and a sure amount of money. For instance the subject might be presented with an option like \"Would you prefer \\$10 now or a 50\\% chance to win 20 dollars otherwise nothing.\" If you think about it a bit for this specific pair of options they *on average* give the same expected value because the expected value of the uncertain option is $0.5*\\$20 + 0.5*0 = \\$10.$ If people were perfectly rational they should be indifferent between these two gambles. However, a common phenomena is that people show a biased called **risk aversion** where they prefer a certain reward over a risky gamble. Thus, even though on average you get the same, they do not want to take the gamble for the larger amount of money. There are several explanations of this but one intuitive one is that people don't like regret and the feeling if you get nothing is so bad that makes people prefer the certain \\$10." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So imagine we do an experiment attempting to explore loss aversion. We might present people with several of these types of gambles across multiple trials. The reason is that we want more than one measurement and want to vary things like the probability of the higher reward on the risky gamble (e.g., instead of 50% chance we might want to explore 10% chance of winning). In addition, we might want to change the magnitudes of the rewards across different trials. Finally we will want to run multiple people so that our conclusions aren't tied to one person. How should we structure our data file for this? You might think that we should first write the code for the experiment and then consider the data format. However, I find it is usually helpful to do the opposite. We start with the data format in mind right from the start." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look back at the key idea in a \"tidy\" dataset because we definitely want to be tidy here.\n", "\n", "1. Each _observation_ must have its own row (observations could be _each person_, _each timepoint_, etc.)\n", "2. Each _variable_ must have its own column (variables are some kind of measurement: _gender_, _age_, _score_, etc.)\n", "3. Each _value_ must have its own cell (value are the actual measurement: _female_, _23 years_, _12 points_, etc.)\n", "\n", "So each _observation_ would be a trial and so it needs its own row.\n", "\n", "Each _variable_ would be information about the trial, both the details of what the subject saw on that trial (e.g., the magnitude of the rewards and gambles) and also how they responded (reaction time, what choice they made, etc...).\n", "\n", "Each _value_ must have its own cell means that each of these numbers need to have their own column and not be combined. So let's make a pretend dataframe from scratch to illustrate this example. \n", "\n", "So let's imagine that there are 10 stimuli in the experiment. We can denote them as the value of the certain option and then the value of the high and low reward for the gamble. This means each stimulus/trial in the experiment can be described with four numbers:\n", "\n", "- certain value\n", "- upper risky value\n", "- lower risky value\n", "- probability of upper risky value\n", "\n", "We can fix a few alternatives (let's say enough for 4 trials) using a python dictionary:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "stim={\"certain\": [10, 15, 25, 50], \n", " \"upper_risk\": [20, 40, 30, 32], \n", " \"lower_risk\": [0, 23, 29, 0], \n", " \"prob\": [0.5, 0.5, 0.25, 0.75]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we can pretend to generate some responses from one subject." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "responses = {\"reaction_time\":[200, 400, 500, 600], \n", " \"choice\": [0, 1, 1, 0]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can build up a dataframe for one subject. We will do this by first defining a python dictionary called `trials` which includes all the _variables_ associated with one measurement (trial number, the value of the certain gamble, the probabilty, etc...) as well as all the measurements made on that trial (response time, choice). These start out at the beginning of the cell as empy lists and then a for loop below fills them in." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'certain': [10, 15, 25, 50],\n", " 'choice': [0, 1, 1, 0],\n", " 'lower_risk': [0, 23, 29, 0],\n", " 'prob': [0.5, 0.5, 0.25, 0.75],\n", " 'reaction_time': [200, 400, 500, 600],\n", " 'subject_id': ['some_subject_id0',\n", " 'some_subject_id0',\n", " 'some_subject_id0',\n", " 'some_subject_id0'],\n", " 'trial_num': [0, 1, 2, 3],\n", " 'upper_risk': [20, 40, 30, 32]}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subject_id = \"some_subject_id0\"\n", "\n", "trials = {'subject_id':[],\n", " 'trial_num':[],\n", " 'certain':[],\n", " 'upper_risk':[],\n", " 'lower_risk':[],\n", " 'prob':[],\n", " 'reaction_time':[],\n", " 'choice':[]}\n", "\n", "for i in range(4):\n", " trials['subject_id'].append(subject_id)\n", " trials['trial_num'].append(i)\n", " trials['certain'].append(stim['certain'][i])\n", " trials['upper_risk'].append(stim['upper_risk'][i])\n", " trials['lower_risk'].append(stim['lower_risk'][i])\n", " trials['prob'].append(stim['prob'][i])\n", " trials['reaction_time'].append(responses[\"reaction_time\"][i])\n", " trials['choice'].append(responses[\"choice\"][i])\n", "\n", "trials" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a python dictionary that includes column names and then the data that goes in the columns which is perfect for initializing a Pandas dataframe:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject_idtrial_numcertainupper_risklower_riskprobreaction_timechoice
0some_subject_id00102000.502000
1some_subject_id011540230.504001
2some_subject_id022530290.255001
3some_subject_id03503200.756000
\n", "
" ], "text/plain": [ " subject_id trial_num certain upper_risk lower_risk prob \\\n", "0 some_subject_id0 0 10 20 0 0.50 \n", "1 some_subject_id0 1 15 40 23 0.50 \n", "2 some_subject_id0 2 25 30 29 0.25 \n", "3 some_subject_id0 3 50 32 0 0.75 \n", "\n", " reaction_time choice \n", "0 200 0 \n", "1 400 1 \n", "2 500 1 \n", "3 600 0 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=pd.DataFrame(trials)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is great and shows a very natural way to structure the data from an experiment. Each trial is a row in the dataframe. Each column either describes what the stimulus was on that trial or the response from the participant. In addition two columns code the trial number and the subject idea. This is critical for analysis as we might be interested in order effects on trials and so we need to know which trials were first or last. In addition we want to keep subjects separate. Eventually we will build up a more complex dataframe that includes multiple subjects. For example:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject_idtrial_numcertainupper_risklower_riskprobreaction_timechoice
0some_subject_id00102000.509931
1some_subject_id011540230.505870
2some_subject_id022530290.256261
3some_subject_id03503200.758350
4some_subject_id10102000.506790
5some_subject_id111540230.502051
6some_subject_id122530290.257980
7some_subject_id13503200.757421
8some_subject_id20102000.507291
9some_subject_id211540230.507900
10some_subject_id222530290.252070
11some_subject_id23503200.755261
\n", "
" ], "text/plain": [ " subject_id trial_num certain upper_risk lower_risk prob \\\n", "0 some_subject_id0 0 10 20 0 0.50 \n", "1 some_subject_id0 1 15 40 23 0.50 \n", "2 some_subject_id0 2 25 30 29 0.25 \n", "3 some_subject_id0 3 50 32 0 0.75 \n", "4 some_subject_id1 0 10 20 0 0.50 \n", "5 some_subject_id1 1 15 40 23 0.50 \n", "6 some_subject_id1 2 25 30 29 0.25 \n", "7 some_subject_id1 3 50 32 0 0.75 \n", "8 some_subject_id2 0 10 20 0 0.50 \n", "9 some_subject_id2 1 15 40 23 0.50 \n", "10 some_subject_id2 2 25 30 29 0.25 \n", "11 some_subject_id2 3 50 32 0 0.75 \n", "\n", " reaction_time choice \n", "0 993 1 \n", "1 587 0 \n", "2 626 1 \n", "3 835 0 \n", "4 679 0 \n", "5 205 1 \n", "6 798 0 \n", "7 742 1 \n", "8 729 1 \n", "9 790 0 \n", "10 207 0 \n", "11 526 1 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stim={\"certain\": [10, 15, 25, 50], \n", " \"upper_risk\": [20, 40, 30, 32], \n", " \"lower_risk\": [0, 23, 29, 0], \n", " \"prob\": [0.5, 0.5, 0.25, 0.75]}\n", "\n", "trials = {'subject_id':[],\n", " 'trial_num':[],\n", " 'certain':[],\n", " 'upper_risk':[],\n", " 'lower_risk':[],\n", " 'prob':[],\n", " 'reaction_time':[],\n", " 'choice':[]}\n", "\n", "def generate_subject_data(subject_id, trials):\n", " for i in range(4):\n", " trials['subject_id'].append(subject_id)\n", " trials['trial_num'].append(i)\n", " trials['certain'].append(stim['certain'][i])\n", " trials['upper_risk'].append(stim['upper_risk'][i])\n", " trials['lower_risk'].append(stim['lower_risk'][i])\n", " trials['prob'].append(stim['prob'][i])\n", " trials['reaction_time'].append(np.random.randint(200,1000))\n", " trials['choice'].append(np.random.randint(0,2))\n", " \n", "for subj in [\"some_subject_id0\",\"some_subject_id1\",\"some_subject_id2\"]:\n", " generate_subject_data(subj,trials)\n", " \n", "exp_df=pd.DataFrame(trials)\n", "exp_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, take a careful look at this output. This is a dataframe with 3 subjects worth of data. The first column is the subject id. The second is the trial number but notice how it starts at 0 and counts up to 3 and then stops when a new subject id begins. Next is the stimulus definitions (certain, upper_risk, lower_risk, prob, etc...) and the responses. This is exactly the way people usually code a data file for a cognitive or perceptual experiment. It is \"tidy\" and includes all the information in a handy way we can use to analyze the data later. Here we **generated** the data with a simple program that I wrote in the cell above but generally we write a more complex program to handle this including showing the stimlus to the subject, recording their reaction time and so forth. This simple program is a proxy for that more complex program but shows you the important principle of how to **organize** your data in preparation for analysis. The analysis itself is what we will contemplate over the next several chapters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Split-Apply-Combine Workflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps one of the most common data analysis steps is known at the **split-apply-combine** workflow. Helping with this workflow is one of the more interesting and powerful features of Pandas (and other dataframe) libraries.\n", "\n", "The **split-apply-combine** workflow refers to a very common sequence of data analysis steps that data analysts have to make. Usually a dataset arrives with multiple sub-units inside it. For example, we might get a .csv file that has all the trials of an experiment from several subjects. Often we want to perform analyses that aggregate across these various subject categories. For example we might want to analyze the performance of each subject. Or alternatively we might want to analyze the difficulty of each item on the test.\n", "\n", "Both of these analyses require us to **split** the larger dataframe up into pieces and perform and analysis on each of these pieces. The step where you perform this analysis is called **apply**. So we break the bigger dataframe up into pieces, apply an analysis to each piece. The final step is to **combine** the results of those analyses back into a new dataframe structure for further analysis.\n", "\n", "{numref}`split-apply-combine` tries to emphasize these steps:\n", "\n", "\n", "```{figure} ./images/split-apply-combine.png\n", ":width: 800px\n", ":name: split-apply-combine\n", "\n", "Illustration of the split-apply-combine workflow!\n", "```\n", "\n", "On the left we begin with our original dataframe (`df`) which has two columns. One column is categorical and takes on values 'A', 'B', or 'C' while the second column is numeric. The first step of the split-apply-combine workflow is that we might want to split the original dataframe into smaller groups based on the value of one or more of the columns. For this example it makes sense to break the original dataframe into smaller dataframes called \"groups\" based on the value of `col1` (that is to say if the row is in category 'A','B', or 'C'). The method that does this in Pandas is `.groupby` and we will discuss it in a moment. The result of the groupby operation is a partitioning of the original dataframe into smallers sets (called groups) which are the rows of the original dataframe just reorganized. The reorganization is such that all the rows that share the same value on the specified column are group together. For instance in {numref}`split-apply-combine` all the rows that had value 'A' in `col1` are placed into the same group separate from all the rows that had value 'B' in `col1` and so forth. This is the **split** operation.\n", "\n", "Next there are several things we can do to each of the groups. We can iterate down them in a for loop for instance (more on that later) but also there are methods that will quickly compute various descriptives statistics to each group, or allow you to \"apply\" a custom function to each group. This custom function could be quite complex and so it allows you do do a lot of additional processing of the groups. The syntax for this tends to be very compact and helps to limit bugs so it is helpful to learn.\n", "\n", "Let's begin by the simple example in {numref}`split-apply-combine`:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the dataframe:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2
0A1
1B2
2C3
3C4
4B2
5B5
6A3
\n", "
" ], "text/plain": [ " col1 col2\n", "0 A 1\n", "1 B 2\n", "2 C 3\n", "3 C 4\n", "4 B 2\n", "5 B 5\n", "6 A 3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df=pd.DataFrame({\"col1\":['A','B','C','C','B','B','A'],\"col2\": [1,2,3,4,2,5,3]})\n", "sac_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we will group the rows by the values in `col1`:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df.groupby('col1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result here doesn't print out the individual groups but instead returns a `pandas.core.groupby.generic.DataFrameGroupBy` object. However we can print out the groups using iteration:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A\n", "-----\n", " col1 col2\n", "0 A 1\n", "6 A 3\n", "\n", "\n", "B\n", "-----\n", " col1 col2\n", "1 B 2\n", "4 B 2\n", "5 B 5\n", "\n", "\n", "C\n", "-----\n", " col1 col2\n", "2 C 3\n", "3 C 4\n", "\n", "\n" ] } ], "source": [ "for name, group_df in sac_df.groupby('col1'):\n", " print(name)\n", " print('-----')\n", " print(group_df)\n", " print()\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the rows have been sorted into groups based on the value in `col1`. Since there are three distinct/unique values in `col1` there are three groups here but if there were more distinct values there would be more groups.\n", "\n", "Next we compute the sum() which applies the sum to each group. We can chain these operations in a sequence:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col2
col1
A4
B9
C7
\n", "
" ], "text/plain": [ " col2\n", "col1 \n", "A 4\n", "B 9\n", "C 7" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df.groupby('col1').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result here is a new dataframe where the sum of the values of `col2` for the different subgroups 'A','B', 'C' has been computed.\n", "\n", ":::{tip}\n", "Check the results by hand to make sure you understand what has been calculated here!\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok this example is fairly simple, what if we consider a more complex dataframe with more columns:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4
0AA11.10
1BA22.30
2CA312.00
3CB416.00
4BB222.21
5BB59.00
6AB30.50
\n", "
" ], "text/plain": [ " col1 col2 col3 col4\n", "0 A A 1 1.10\n", "1 B A 2 2.30\n", "2 C A 3 12.00\n", "3 C B 4 16.00\n", "4 B B 2 22.21\n", "5 B B 5 9.00\n", "6 A B 3 0.50" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2=pd.DataFrame({\"col1\":['A','B','C','C','B','B','A'],\"col2\":['A','A','A','B','B','B','B'],\"col3\": [1,2,3,4,2,5,3],\"col4\": [1.1,2.3,12.,16,22.21,9,0.5]})\n", "sac_df_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we `.groupby()` column 1 (`col1`) we get the same groups as before but notice they includes all the rows:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A\n", "-----\n", " col1 col2 col3 col4\n", "0 A A 1 1.1\n", "6 A B 3 0.5\n", "\n", "\n", "B\n", "-----\n", " col1 col2 col3 col4\n", "1 B A 2 2.30\n", "4 B B 2 22.21\n", "5 B B 5 9.00\n", "\n", "\n", "C\n", "-----\n", " col1 col2 col3 col4\n", "2 C A 3 12.0\n", "3 C B 4 16.0\n", "\n", "\n" ] } ], "source": [ "for name, group_df in sac_df_2.groupby('col1'):\n", " print(name)\n", " print('-----')\n", " print(group_df)\n", " print()\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now the apply-combine step computes the sum for all the numeric columns." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col3col4
col1
A41.60
B933.51
C728.00
\n", "
" ], "text/plain": [ " col3 col4\n", "col1 \n", "A 4 1.60\n", "B 9 33.51\n", "C 7 28.00" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2.groupby('col1').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course you don't have to apply the sum to all the columns. If you just want to analyze `col4` you can select it out before applying the sum:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "col1\n", "A 1.60\n", "B 33.51\n", "C 28.00\n", "Name: col4, dtype: float64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2.groupby('col1')['col4'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also group using more than one column as the grouping factors. For example in `sac_df_2` there are actually two columns that have discrete values and so you might want to make groups that are all combinations of those two factors. If so you just pass a list of grouping columns to `.groupby()`: " ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('A', 'A')\n", "-----\n", " col1 col2 col3 col4\n", "0 A A 1 1.1\n", "\n", "\n", "('A', 'B')\n", "-----\n", " col1 col2 col3 col4\n", "6 A B 3 0.5\n", "\n", "\n", "('B', 'A')\n", "-----\n", " col1 col2 col3 col4\n", "1 B A 2 2.3\n", "\n", "\n", "('B', 'B')\n", "-----\n", " col1 col2 col3 col4\n", "4 B B 2 22.21\n", "5 B B 5 9.00\n", "\n", "\n", "('C', 'A')\n", "-----\n", " col1 col2 col3 col4\n", "2 C A 3 12.0\n", "\n", "\n", "('C', 'B')\n", "-----\n", " col1 col2 col3 col4\n", "3 C B 4 16.0\n", "\n", "\n" ] } ], "source": [ "for name, group_df in sac_df_2.groupby(['col1','col2']):\n", " print(name)\n", " print('-----')\n", " print(group_df)\n", " print()\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then the sum operation will apply to all combinations." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col3col4
col1col2
AA11.10
B30.50
BA22.30
B731.21
CA312.00
B416.00
\n", "
" ], "text/plain": [ " col3 col4\n", "col1 col2 \n", "A A 1 1.10\n", " B 3 0.50\n", "B A 2 2.30\n", " B 7 31.21\n", "C A 3 12.00\n", " B 4 16.00" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2.groupby(['col1','col2']).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output here looks a little different because the columns are organized hierarchically (the specific terms is that the index is hierarchical). But we can call `.reset_index()` a function we mentioned earlier to delete the hierarchical index and flatten things back to a standard data frame." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4
0AA11.10
1AB30.50
2BA22.30
3BB731.21
4CA312.00
5CB416.00
\n", "
" ], "text/plain": [ " col1 col2 col3 col4\n", "0 A A 1 1.10\n", "1 A B 3 0.50\n", "2 B A 2 2.30\n", "3 B B 7 31.21\n", "4 C A 3 12.00\n", "5 C B 4 16.00" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2.groupby(['col1','col2']).sum().reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final result here is itself a dataframe and so can be use for further analysis and plotting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing the \"apply\" step" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example we considered so far we summed the columns within in group. However there are actually several types of things you can do to groups in the \"apply\" step:\n", " \n", "- **Aggregation**: computes a descriptive statistic like a sum, mean, max, min, etc...\n", "- **Transformation**: perform a group-specific computation like standardsize (z-score) within a group\n", "- **Filtration**: discard data from small groups or filter data from a group based on the sum or mean in that group (e.g., removing outliers in a group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far we computed the sum of columns within each group. However there are several other functions you can use. For example to compute the median instead:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col3col4
col1
A2.00.8
B2.09.0
C3.514.0
\n", "
" ], "text/plain": [ " col3 col4\n", "col1 \n", "A 2.0 0.8\n", "B 2.0 9.0\n", "C 3.5 14.0" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2.groupby('col1').median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a list of common aggregation functions:\n", "\n", "\n", "\n", "```{list-table} Example aggregation functions\n", ":header-rows: 1\n", ":name: aggregation-table\n", ":width: 200px\n", "\n", "* - Function\n", " - Description\n", "* - `.mean()`\n", " - Compute mean of groups\n", "* - `sum()`\n", " - Compute sum of group values\n", "* - `size()`\n", " - Compute group sizes\n", "* - `count()`\n", " - Compute count of group\n", "* - `std()`\n", " - Standard deviation of groups\n", "* - `var()`\n", " - Compute variance of groups\n", "* - `sem()`\n", " - Standard error of the mean of groups\n", "* - `describe()`\n", " - Generates descriptive statistics\n", "* - `first()`\n", " - Compute first of group values\n", "* - `last()`\n", " - Compute last of group values\n", "* - `nth()`\n", " - Take nth value, or a subset if n is a list\n", "* - `min()`\n", " - Compute min of group values\n", "* - `max()`\n", " - Compute max of group values\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition there is a special `.agg()` (or `.aggregate()`) function that you can pass mulitple functions to from other libraries: " ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
co32col4
col1
A41.60
B933.51
C728.00
\n", "
" ], "text/plain": [ " co32 col4\n", "col1 \n", "A 4 1.60\n", "B 9 33.51\n", "C 7 28.00" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df.groupby('col1').agg(np.sum)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous example used the numpy `np.sum()` function. You can also provide multiple functions at once:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2co32col4
0AA11.10
1BA22.30
2CA312.00
3CB416.00
4BB222.21
5BB59.00
6AB30.50
\n", "
" ], "text/plain": [ " col1 col2 co32 col4\n", "0 A A 1 1.10\n", "1 B A 2 2.30\n", "2 C A 3 12.00\n", "3 C B 4 16.00\n", "4 B B 2 22.21\n", "5 B B 5 9.00\n", "6 A B 3 0.50" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col2
summeanstd
col1
A42.01.414214
B93.01.732051
C73.50.707107
\n", "
" ], "text/plain": [ " col2 \n", " sum mean std\n", "col1 \n", "A 4 2.0 1.414214\n", "B 9 3.0 1.732051\n", "C 7 3.5 0.707107" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df.groupby('col1').agg([np.sum, np.mean, np.std])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which as you can see actually computes mutiple statistics on each group." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{seealso}\n", "You can read more about aggregation in the Pandos docs on [Aggregation](https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation).\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Transform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregation methods take in a grouped dataframe and compute a single number of statistics for each group. In contrast transform applies a function to a column and returns a column of the same size usually. This allows you to change the data within each sub group. A common version of this for behavioral science is to z-score data within a group to detect outliers, etc... Since that is the most common use case we'll focus on that here." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col3col4
0-0.7071070.707107
1-0.577350-0.875545
2-0.707107-0.707107
30.7071070.707107
4-0.5773501.089743
51.154701-0.214198
60.707107-0.707107
\n", "
" ], "text/plain": [ " col3 col4\n", "0 -0.707107 0.707107\n", "1 -0.577350 -0.875545\n", "2 -0.707107 -0.707107\n", "3 0.707107 0.707107\n", "4 -0.577350 1.089743\n", "5 1.154701 -0.214198\n", "6 0.707107 -0.707107" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2.groupby('col1').transform(lambda x: (x-x.mean())/x.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The transform function takes as argument a `lambda` function which is a way to write short one-line functions in python. This function takes 1 argument (`x`) which is the dataframe coresponding to each column of each group. It then computes the z-score of this data within the group and column and the result is combined into a new dataframe. This new data frame has the same column names as the original but we might want to rename them:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col3_transcol4_trans
0-0.7071070.707107
1-0.577350-0.875545
2-0.707107-0.707107
30.7071070.707107
4-0.5773501.089743
51.154701-0.214198
60.707107-0.707107
\n", "
" ], "text/plain": [ " col3_trans col4_trans\n", "0 -0.707107 0.707107\n", "1 -0.577350 -0.875545\n", "2 -0.707107 -0.707107\n", "3 0.707107 0.707107\n", "4 -0.577350 1.089743\n", "5 1.154701 -0.214198\n", "6 0.707107 -0.707107" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformed_df=sac_df_2.groupby('col1').transform(lambda x: (x-x.mean())/x.std())\n", "transformed_df=transformed_df.rename(columns={'col3':'col3_trans', 'col4': 'col4_trans'})\n", "transformed_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To add these transformed columns back into our original data frame as new column we can use the `.join()` function which is like `.concat()` which we leared about earlier but combines the columns from a dataframe rather than appending the rows:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col3_transcol4_trans
0AA11.10-0.7071070.707107
1BA22.30-0.577350-0.875545
2CA312.00-0.707107-0.707107
3CB416.000.7071070.707107
4BB222.21-0.5773501.089743
5BB59.001.154701-0.214198
6AB30.500.707107-0.707107
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col3_trans col4_trans\n", "0 A A 1 1.10 -0.707107 0.707107\n", "1 B A 2 2.30 -0.577350 -0.875545\n", "2 C A 3 12.00 -0.707107 -0.707107\n", "3 C B 4 16.00 0.707107 0.707107\n", "4 B B 2 22.21 -0.577350 1.089743\n", "5 B B 5 9.00 1.154701 -0.214198\n", "6 A B 3 0.50 0.707107 -0.707107" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sac_df_2.join(transformed_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The end result is that we z-scored `col3` and `col4` within the groups implies by `col1`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{seealso}\n", "You can read more about aggregation in the Pandos docs on [Transformation](https://pandas.pydata.org/docs/user_guide/groupby.html#transformation).\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Filtration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the filtration *apply* step you provide a function that returns true or false for each group. Any group that the function returns `True` for is kept for the new dataframe and any group that returns `False` is dropped. \n", "For example this step removes rows from the dataframe that belong to groups with only one or two members:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
00a
11a
22b
33b
44b
55b
66c
77c
\n", "
" ], "text/plain": [ " A B\n", "0 0 a\n", "1 1 a\n", "2 2 b\n", "3 3 b\n", "4 4 b\n", "5 5 b\n", "6 6 c\n", "7 7 c" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})\n", "dff" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
22b
33b
44b
55b
\n", "
" ], "text/plain": [ " A B\n", "2 2 b\n", "3 3 b\n", "4 4 b\n", "5 5 b" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dff.groupby('B').filter(lambda x: len(x) > 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice 'a' and 'c' entries have been removed because those groups have only two members each." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{seealso}\n", "You can read more about aggregation in the Pandos docs on [Filtration](https://pandas.pydata.org/docs/user_guide/groupby.html#filtration).\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further Reading and Resources\n", "\n", "- 10 minute [intro to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)\n", "- A longer standalone video [on how to use pandas](https://www.youtube.com/watch?v=vmEHCJofslg)\n", "- A visual guide to [split-apply-combine](https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e) strategies in Pandas by Towards data science\n", "- [Split-Apply-combine strategy for data mining](https://medium.com/analytics-vidhya/split-apply-combine-strategy-for-data-mining-4fd6e2a0cc99) includes some additional facts and features of this approach" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "```{bibliography} ../../references.bib\n", ":filter: docname in docnames\n", "```" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }