{ "cells": [ { "cell_type": "markdown", "id": "6073eb74", "metadata": {}, "source": [ "# Answers -In Class Activity - More Pandas and data processing" ] }, { "cell_type": "markdown", "id": "1f1a34fa", "metadata": {}, "source": [ "First, let's import our packages." ] }, { "cell_type": "code", "execution_count": 1, "id": "27fec022", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import numpy.random as npr\n", "import math" ] }, { "cell_type": "markdown", "id": "6d5c7ba4", "metadata": {}, "source": [ "Second, we want to read in a csv file in the dataframe `df`. The file has a list of various faculty members and their phone numbers (don't worry, the phone numbers are randomly generated. So best not try calling them). We also want to tell `df` that the phone numbers are actually strings rather than traditional numbers, which we do with the `astype` method." ] }, { "cell_type": "code", "execution_count": 2, "id": "193b2fdb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Last nameFirst namePhone
0AdolphKaren9285162643
1AmodioDavid3803269359
2BalcetisEmily2204342277
3Van BavelJay3252450008
4CarrascoMarisa2092652228
5CimpianAndrei6966417505
6CraigMaureen7022020569
7CurtisClay7768322051
8DillonMoira6209318809
9FreemanJon4746939756
10GollwitzerPeter3195372989
\n", "
" ], "text/plain": [ " Last name First name Phone\n", "0 Adolph Karen 9285162643\n", "1 Amodio David 3803269359\n", "2 Balcetis Emily 2204342277\n", "3 Van Bavel Jay 3252450008\n", "4 Carrasco Marisa 2092652228\n", "5 Cimpian Andrei 6966417505\n", "6 Craig Maureen 7022020569\n", "7 Curtis Clay 7768322051\n", "8 Dillon Moira 6209318809\n", "9 Freeman Jon 4746939756\n", "10 Gollwitzer Peter 3195372989" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('faculty.csv')\n", "df['Phone'] = df['Phone'].astype(str)\n", "df" ] }, { "cell_type": "markdown", "id": "bd755d44", "metadata": {}, "source": [ "### Problem 0: Changing the format of a phone number" ] }, { "cell_type": "markdown", "id": "67e11322", "metadata": {}, "source": [ "For this problem, we want to convert the phone numbers to a more readable format [ 9285162643 $\\rightarrow$ (928)516-2643 ]. You may have experience doing this sort of thing by hand with Excel, which can be very cumbersome and error-prone. Let's see how to do this with pandas instead.\n", "\n", "Please write code to make the transformation to (ABC)DEF-HIJK format for each phone number in `df`.\n", "\n", "_Hint_: Write a function `convert_phone` that converts the format of a single phone number (also, remind yourself about Python list slicing). Then, you can apply that function using the `transform` operation described in book section 6.12.1.2." ] }, { "cell_type": "code", "execution_count": null, "id": "9a0496aa", "metadata": {}, "outputs": [], "source": [ "# Your answer goes here\n", "\n", "def convert_phone(x):\n", " assert(len(x)==10)\n", " return '(' + x[:3] + ')' + x[3:6] + '-' + x[6:]\n", "\n", "df['Phone'] = df['Phone'].transform(convert_phone)" ] }, { "cell_type": "markdown", "id": "cec6b441", "metadata": {}, "source": [ "### Problem 1: Making a new column" ] }, { "cell_type": "markdown", "id": "2dcb7d61", "metadata": {}, "source": [ "Using the same dataframe `df`, make a new column that lists the complete name of each professor. For instance, the new column should be called 'Complete name' and the first entry should be the string 'Karen Adolph'.\n", "\n", "_Hint_: You could make a new function and use the same logic as above. Alternatively, you can also try directly summing the relevant columns." ] }, { "cell_type": "code", "execution_count": 4, "id": "d9b3d082", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Last nameFirst namePhoneComplete name
0AdolphKaren9285162643Karen Adolph
1AmodioDavid3803269359David Amodio
2BalcetisEmily2204342277Emily Balcetis
3Van BavelJay3252450008Jay Van Bavel
4CarrascoMarisa2092652228Marisa Carrasco
5CimpianAndrei6966417505Andrei Cimpian
6CraigMaureen7022020569Maureen Craig
7CurtisClay7768322051Clay Curtis
8DillonMoira6209318809Moira Dillon
9FreemanJon4746939756Jon Freeman
10GollwitzerPeter3195372989Peter Gollwitzer
\n", "
" ], "text/plain": [ " Last name First name Phone Complete name\n", "0 Adolph Karen 9285162643 Karen Adolph\n", "1 Amodio David 3803269359 David Amodio\n", "2 Balcetis Emily 2204342277 Emily Balcetis\n", "3 Van Bavel Jay 3252450008 Jay Van Bavel\n", "4 Carrasco Marisa 2092652228 Marisa Carrasco\n", "5 Cimpian Andrei 6966417505 Andrei Cimpian\n", "6 Craig Maureen 7022020569 Maureen Craig\n", "7 Curtis Clay 7768322051 Clay Curtis\n", "8 Dillon Moira 6209318809 Moira Dillon\n", "9 Freeman Jon 4746939756 Jon Freeman\n", "10 Gollwitzer Peter 3195372989 Peter Gollwitzer" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Your answer here\n", "df['Complete name'] = df['First name'] + ' ' + df['Last name']\n", "df" ] }, { "cell_type": "markdown", "id": "a0c2970a", "metadata": {}, "source": [ "### Problem 2: Computing the area and creating a new column" ] }, { "cell_type": "markdown", "id": "5638e223", "metadata": {}, "source": [ "Let's create the rectangle and circle dataframe from last week and call it `df_shapes`." ] }, { "cell_type": "code", "execution_count": 5, "id": "c7e592c2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typewidthheight
0rectangle0.1445975.902562
1circle0.944694NaN
2rectangle9.3504863.828103
3rectangle7.2513380.301459
4circle1.120220NaN
5rectangle8.7689208.915431
6circle0.582969NaN
7rectangle2.2392930.090153
8circle7.015915NaN
9circle6.086628NaN
\n", "
" ], "text/plain": [ " type width height\n", "0 rectangle 0.144597 5.902562\n", "1 circle 0.944694 NaN\n", "2 rectangle 9.350486 3.828103\n", "3 rectangle 7.251338 0.301459\n", "4 circle 1.120220 NaN\n", "5 rectangle 8.768920 8.915431\n", "6 circle 0.582969 NaN\n", "7 rectangle 2.239293 0.090153\n", "8 circle 7.015915 NaN\n", "9 circle 6.086628 NaN" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mytype = np.array(['rectangle','circle','rectangle','rectangle','circle','rectangle','circle','rectangle','circle','circle'])\n", "width = npr.rand(len(mytype))*10.\n", "height = npr.rand(len(mytype))*10.\n", "height[mytype=='circle']=np.nan \n", "df_shapes = pd.DataFrame({\"type\":mytype, \"width\":width, \"height\":height})\n", "df_shapes" ] }, { "cell_type": "markdown", "id": "7b90d30d", "metadata": {}, "source": [ "Next, you should compute the mean 'width' separately for the rectangles and circles.\n", "\n", "_Hint_: you can use `groupby` and `.mean()` from chapter 6.12." ] }, { "cell_type": "code", "execution_count": null, "id": "a7bd9e91", "metadata": {}, "outputs": [], "source": [ "# Your answer here\n", "df_shapes.groupby('type')['width'].mean()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 5 }