Homework 2 - Python for data analysis

By Todd Gureckis and Brenden Lake. Code shared under the CC BY-SA 4.0 license. Credit given where inspiration was obtained from others!

The goal of Homework 2 is to make sure you understand enough python for the types of data analysis that we will be doing the rest of the semester. At this point you should be pretty comfortable with the Jupyter notebook environment. If you need a refresher on python there are a number of excellent resources you can explore on your own:

  • Google for Education’s Beginner Python course. Jump to section “Python Intro”

  • Corey Schafer’s Python Programming Beginner Tutorials on Youtube. These are great if you prefer to watch videos. Also can skip the first video on installing python because you are using Jupyter and there is nothing to install!

In the homework I will give you some data, give examples of how to use a key programming construct in the analysis of this data, and then give you a chance to try to perform an analysis using what you have learned.


Importing libraries

As you have seen a few times now, before we start we often need to import some additional python libraries. This is one of the key features of python because there are really amazing libraries that will let you do almost anything! Also when you get a bit more advanced you can write your own libraries.

There are several common ways of importing. Let’s say we want to import a package foo that defines a function widget:

  • import foo will import the foo package; any reference to modules/classes/functions will need to be prefixed with foo.; e.g. foo.widget()

  • import foo as bar will import the foo package with the alias bar; any reference to modules/classes/functions will need to be prefixed with bar.; e.g. bar.widget() This is the preferred method especially when the name of the package is long.

  • from foo import widget can be used to import a specific module/class/function from foo and it will be available as widget()

  • from foo import * will import every item in foo into the current namespace; this is bad practice, don’t do it.

As you have seen from past notebooks we often want to import pandas, numpy, and seaborn:

%matplotlib inline 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# put your code here

There are many great libraries for python. How do you learn about them? Check out pypi.org which lets you search for different libraries. Of course not all the libraries are installed on our class JupyterHub but all of the standard ones are. One example of a standard library is the math library. This library implements a bunch of simple math functions (full list). For example, math.sqrt(x) will take the square root of x. Try to execute the folowing cell:

math.sqrt(2)
# put your code here

Getting Data, Dealing with Lists and Dictionaries

As we explored in the labs so far, the most common way to read in data for analysis is from a CSV (comma separate values) file. This is like a simple version of an excel spreadsheet. Indeed you can export files from Excel as a CSV and also from Google Sheets.

df = pd.read_csv("http://myurl.edu/mydata.csv")

However there are many ways to get data into Python. For example, another approach is to use what is known as a data API (or application programming interfaces). Very simply APIs are ways of getting access to data using code. Many websites including Twitter, Yelp, Strava, etc.. provide APIs that let you access and process at least parts of the data on their websites. Often companies release python libraries to make this easier.

In the rest of the homework we are going to explore one example of this using the Petfinder.com website. Petfinder is a website that helps people locate pets for adoption. It maintains a searchable database of pets from many different shelters and makes the data on the pets searchable. There is also a cool python library call petpy which help you to interact with the Petfinder website.

To get started let’s import the petpy library. Here I am not using the import X as Y syntax because petpy is pretty easy to type. I am also importing a few other libraries for displaying images in the Jupyter notebook and downloading files over the internet.

!pip install petpy
import petpy
from IPython.display import Image
import requests

To access Petfinder you have to signup on their website. I have already done this for you so you can use my creditials for class:

MYKEY = 'nygvq6mFXsSNN6g3BLHOE702FNfEmyQIUsHc7brGQTo1bg5wwV'
MYSECRET = 'aEYDUYD9JnCksb2uoIUtkGu9pviy4tJdruZ3v4jx'
pf = petpy.Petfinder(key=MYKEY, secret=MYSECRET)

The variable pf now contains an API interface to the Petfinder website.

animal_types = pf.animal_types()

The new variable we created must contain something but honestly I have no idea what it is right at first. So I will put it in a cell by itself so I can see what it contains.

animal_types

A cool, ok so animal_types looks like a python dictionary which you learned about in a prior lab session. How do I know this? Well I notice that the entire output of the cell above is wrapped in { } which is how you denote a dictionary in python. For example:

my_dictionary = {"pasta": "a type of italian food", "pizza": "another type of italian food"}

In the cell above, I just created a new variable called my_dictionary which has two keys (‘pasta’ and ‘pizza’) and two corresponding values/descriptions. I can lookup the description of each of the key like this

print(my_dictionary['pasta'])
print(my_dictionary['pizza'])
print(my_dictionary['salad'])

Oops, that was expected.. ‘salad’ wasn’t defined in our original dictionary so it can’t be looked up (we got an exception called ‘KeyError’ because the key ‘salad’ could not be found in the dictionary.

The idea with dictionaries is that you can store stuff and easily look it up by name.

Getting back to these pets! What are the keys of this dictionary?

animal_types.keys()

Ok so there is one key, types. This must just mean that the API wraps everything up so lets look inside that key.

animal_types['types']

Ok this is weird but this contains a python list. How do I know that? Because it is wrapped in [ ]. How many things are in the list?

len(animal_types['types'])

ok, what is in the first slot or position of the list?

animal_types['types'][3]

Ah! Another dictionary (indicated by the { }). But now I can kind of understand what is going on. What are the keys of this dictionary?

animal_types['types'][0].keys()

Ok, so the keys are things that are describing a type of pet! We have a field called name and one called coats which sounds fascinating, and colors and genders. Cool!

So what is the first type of animal in this list?

animal_types['types'][0]['name']

Ok, so it is good-boy pups. What are the possible colors?

animal_types['types'][0]['colors']

Really cool stuff. I think a Merle color dog might really suit me! Although I don’t know that that is actually. Let me consult wikipedia really quick!

import wikipedia
wikipedia.summary("merle long coat")

Isn’t python so cool? You can grab wikipedia entiries with one line just by typing import wikipedia.

Ok, so let’s challenge our understanding of working with lists and dictionaries. What are the possible colors for cats according to the petfinder website?

# Try to access the different colors of cats here

Did you get stuck? Give it your best shot but if you really get fustrated you can run the cell below to reveal my solution.

# Uncomment and run next line for a sample solution
# %load hw2-hint1.py
# put your code here

For loops and other ways to repeat things

So far we have kind of been playing with dictionaries and list by accessing individual elements. However sometimes we want to know more general stuff like what is the name of all the types of animals? To do this we kind of want to use the methods we developed above to look inside this data frame but to repeat this for each element within the structure. This brings us to the dreaded for loop. Many students are afraid of for loops because they can get a bit complex. However, they are also a super critial tool for working with data effectively and open up so many more interesting types of analyses to you.

The most common use for a for loop in data analysis is to repeat at step for each record of some data. For instance you might want to print out the value of each item in a list:

my_list = ['a','b','c','d']
for item in my_list:
    print(item)

you can use features of list to reverse the printout.

my_list = ['a','b','c','d']
for item in my_list[::-1]:
    print(item)

or each item in a dictionary:

weapon_strengths = {
    'tomahawk': 20,
    'katana': 60,
    'hand of god': 100,
}

for weapon in weapon_strengths.keys():
    print(f"I have a {weapon}!")

print("\n---\n")

for weapon, strength in weapon_strengths.items():
    print(f"My {weapon} has a strength of {strength}!")

This second example shows how you can repeat the printing command for each key in the dictionary weapon_strengths.keys() or all the items (like a list): weapon_strengths.items().

Lets try it out for the animals! What are the types of animals in the Petfinder system?

for animal in animal_types['types']:
    print(animal['name'])

Here I am repeating the same code for each entry in the animal_types['types'] list. Each entry in that list is first assigned to the variable animal and then I look inside that dictionary and access the name key.

# put your code here

So far we have been exploring repeating loops for each entry in a dataset that we obtained via an API. However, sometimes we just want to repeat things a few times in order to do other stuff. For instance, as you will see you can use python and for loops to draw simple pictures!

In order to use for loops in this way we have to create data. One of the easiest ways to create data is the range() command which creates a list with a sequence of numbers.

list(range(5))
list(range(10))

So if we wanted to repeat the same thing 5 times we could do this:

for x in range(5):
    print("hi")

You can also place for loops inside another for loop if you tab is over appropriately. If this happens you can repeat things in more complex ways. For instance you can print out a square like this:

for i in range(10):
    for j in range(10):
        print("*", end = ' ')
    print("")
# your code here

for loops are really interesting when combined with if statement. This means while you are repeating the loop you can skip thing or do something different.

for i in range(10):
    for j in range(10):
        if i < 2 or i > 7:
            print("*", end = ' ')
        else:
            print(" ", end = ' ')
    print("")

For example the code above prints out the stars for each square only if the row (i) is less than 2 or greater than 7 (otherwise it prints a empty space instead due to the else). Think about this one for a bit:

for i in range(10):
    for j in range(10):
        if (i < 3 or i > 6) or (j < 3 or j > 6):
            print("*", end = ' ')
        else:
            print(" ", end = ' ')
    print("")
# your code here           
# your code here

Enough of this programming! As a reward for completing these tough new problems, we’ll look at some cute pets using for loops. The petfinder API has a function which will return the set of pets currently available for adoption. This takes a number of options but by default will return 20 pet listings.

animals = pf.animals(return_df=True)

This returns a pandas dataframe that contains lots of information about the pet including if it is declawed, house trained, etc…

animals.columns
animals.head()
from skimage import io

def show_animal(animals, number, ax):
    record=animals.iloc[number]
    if record.photos:
        if record.photos[0]:
            if 'medium' in record.photos[0]:
                photo = record.photos[0]['medium']
                img=io.imread(photo)
                ax[number].set_axis_off()
                ax[number].imshow(img)
                

fig, axes = plt.subplots(4, 5, figsize=(12, 12))
ax = axes.ravel()

for i in range(20):  # check out my for loop!
    show_animal(animals,i, ax)

plt.show()

The previous cell requires a bit of matplotlib magic to create a grid of images but the key is that there is a for loop that steps through and displays each image. Try to understand this code and compare it to the loops described above.

Random numbers

In many aspects of analysis you need to generate random numbers. You need this for a few reasons

  1. to create data from a distibution to check your intuitions about randomness (simulation)

  2. to create cognitive models that have “noise” in them so that their behavior is not deterministic

  3. to test your code by providing “fake” data that looks realistic

One of the useful libraries for this is numpy’s random submodule. The full list of available functions in this library is here and reading through can give you a good sense of what is possible.

import numpy.random as npr

randn() gives you a random number of a standard normal distribution (one with the mean = 0 and standard deviation equal to 1.0:

npr.randn()

you can get more than one like this:

npr.randn(10)

lets get even more (1000) and plot them

sns.displot(npr.randn(1000))

if you want the numbers to be uniform you can use rand() which draws random numbers from 0 to 1 (but not including exactly 1.0).

npr.rand()
npr.rand(1000).max()

as you can see the maximum value even of 1000 random rand() numbers is less than 1.0. while for randn() it can be much higher:

npr.randn(1000).max()

randint() will only include the integers between the given range.

npr.randint(0,10,10)
npr.randint(0,100,10)

you can also choose a random element of an array or list so that only certain values can be chosen:

npr.choice(np.array([3,6,1,4]))

notice that even in 100 draws you never see 0 or 9 because they are not in the given array.

npr.choice(np.array([3,6,1,4]),100)

You can also use random to randomly “shuffle” a list (meaning changing the order of things like you would a deck of playing cards).

cards = np.arange(10)
print(cards)
npr.shuffle(cards)
print(cards)

The second list is the same as the first just randomly shuffled.

There are lots of distributions you can draw random numbers from:

Chi-square

sns.displot(npr.chisquare(2,10000))

lognormal

sns.displot(npr.lognormal(0,1,10000))

F
you’ll see this again when we get to ANOVA!

sns.displot(npr.f(2,16,10000))

The cumsum() command computes the cumulative sum of a list. Meaning it is the sum created by adding each entry to the list to the values of all the ones that come before.

np.array([1,2,3,4]).cumsum()

So the result is array([1, 2+1, 3+2+1, 4+3+2+1]). Now we can start adding up random numbers. For example,

npr.randn(20).cumsum()

This is sometimes called a “random walk” because you are taking small steps either up or down each time you add a number (the value from npr.randn()) but how far you move depends on the current sum. You can get a sense of this by plotting the sum as it evolves:

plt.plot(npr.randn(200).cumsum())

Try running the above cell a few times to get a sense of the range of random walks that are possible.

You can also plot a few random walks on the same plot together:

for i in range(5):
    plt.plot(npr.randn(200).cumsum(),alpha=0.2)

Let’s change the code to plot even more of these… maybe like 500!

for i in range(500):
    plt.plot(npr.randn(200).cumsum(),alpha=0.2)

That’s interesting… the sum start at zero (makes sense) and then kind of grows to get bigger as time goes on. Not all paths get bigger of course but some randomly add up a bunch of positive or negative numbers and tend to drift further and further from the starting point.

Now instead of computing the sum, let’s compute the average at each step. So for each step we need to divide by the number of things we have added together so far.

npr.randn(50).cumsum()/(np.arange(1,50+1))

Notice we needed to use np.arange(1,50+1) because we want to start with value 1 for the first number (np.arange(50) would count from 0 to 49 and we would then be dividing by zero on the first sum.

Let’s plot a bunch of these random averages:

for i in range(50):
    plt.plot(npr.randn(50).cumsum()/(np.arange(50)+1), alpha=0.2)

That’s interesting, you can see that early on the average is quite variable but as you add more and more numbers the averages seem to converge around a particular value. As it turns out you are looking at the sampling distribution of the mean for different values of N. If you took a slice out of the plot at say 10 sums and made a histogram of the averages you would get a histogram of the sampling distribution of the mean.

sns.displot(npr.randn(10,50).sum(axis=0)/10)

Interestingly the “envelope” of the sampling distribution of the mean seems to be shrinking according to a particular mathematical relationship which is \(1/\sqrt N\) where \(N\) is the number of numbers you have added to the sum.

In fact we can plot the sqrt(N) on the same plot to see the relationship

for i in range(50):
    plt.plot(npr.randn(50).cumsum()/(np.arange(1,50+1)), alpha=0.2)
plt.plot(1.0/np.sqrt(np.arange(1,50+1)),'k-')

Here, the thick black line is \(1/\sqrt N\) which is how the variance of the sampling distribution changes as the sample size increases! (remember the formula for the standard error of the mean is the standard devision of the sample divided by the square root of N). Now you can see why!

# your code here

Pandas and Data Frames

A DataFrame is like a dictionary where the keys are column names and the values are Series that share the same index and hold the column values. The first “column” is actually the shared Series index (there are some exceptions to this where the index can be multi-level and span more than one column but in most cases it is flat).

names = pd.Series(['Alice', 'Bob', 'Carol'])
phones = pd.Series(['555-123-4567', '555-987-6543', '555-245-6789'])
dept = pd.Series(['Marketing', 'Accounts', 'HR'])

staff = pd.DataFrame({'Name': names, 'Phone': phones, 'Department': dept})  # 'Name', 'Phone', 'Department' are the column names
staff

Note above that the first column with values 0, 1, 2 is actually the shared index, and there are three series keyed off the three names “Department”, “Name” and “Phone”.

Like Series, DataFrame has an index for rows:

staff.index

DataFrame also has an index for columns:

staff.columns
staff.values

The index operator actually selects a column in the DataFrame, while the .iloc and .loc attributes still select rows (actually, we will see in the next section that they can select a subset of the DataFrame with a row selector and column selector, but the row selector comes first so if you supply a single argument to .loc or .iloc you will select rows):

staff['Name']  # Acts similar to dictionary; returns a column's Series
staff.loc[2]

You can get a transpose of the DataFrame with the .T attribute:

staff.T

You can also access columns like this, with dot-notation. Occasionally this breaks if there is a conflict with a UFunc name, like ‘count’:

staff.Name

You can add new columns. Later we’ll see how to do this as a function of existing columns:

staff['Fulltime'] = True
staff.head()

Use .describe() to get summary statistics:

staff.describe()

Use .quantile() to get quantiles:

df = pd.DataFrame([2, 3, 1, 4, 3, 5, 2, 6, 3])
df.quantile(q=[0.25, 0.75])

Use .drop() to remove rows. This will return a copy with the modifications and leave the original untouched unless you include the argument inplace=True.

staff.drop([1])
# Note that because we didn't say inplace=True,
# the original is unchanged
staff

There are many ways to construct a DataFrame. For example, from a Series or dictionary of Series, from a list of Python dicts, or from a 2-D NumPy array. There are also utility functions to read data from disk into a DataFrame, e.g. from a .csv file or an Excel spreadsheet. We’ll cover some of these later.

Many DataFrame operations take an axis argument which defaults to zero. This specifies whether we want to apply the operation by rows (axis=0) or by columns (axis=1).

You can drop columns if you specify axis=1:

staff.drop(["Fulltime"], axis=1)

Another way to remove a column in-place is to use del:

del staff["Department"]
staff

You can change the index to be some other column. If you want to save the existing index, then first add it as a new column:

staff['Number'] = staff.index
staff
# Now we can set the new index. This is a destructive
# operation that discards the old index, which is
# why we saved it as a new column first.
staff = staff.set_index('Name')
staff

Alternatively you can promote the index to a column and go back to a numeric index with reset_index():

staff = staff.reset_index()
staff
example_data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
           'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
           'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
           'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
# Put your code to create the DataFrame here
# Generate a summary of the data
# Calculate the sum of all the visits (the total number of visits)

Turning in homeworks

When you are finished with this notebook. Save your work in order to turn it in. To do this select File->Download As…->PDF.

Homeworks will be submitted according to the instructions provided in class.