{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Answers - In Class Activity - Sampling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "This exercise authored by Todd Gureckis and Brenden Lake, and is released under the [license](/LICENSE.html).\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the chapter, we discussed some basic issues in sampling. In this notebook, you will explore some handy python methods for sampling and consider the implications of sampling on what you understand about some target group (i.e., what you can generalize)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing and using existing functions" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import numpy.random as npr\n", "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 0: Seeding a random number generator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we use the computer to play with random numbers (or random samples), we aren't actually using random numbers. Generally speaking your computer is a deterministic machine so it is unable to make truely random numbers. Intead the numbers your computer gives you are known as pseudo-random because they have many of the properties we want from random numbers but are not exactly and entirely random.\n", "\n", "Anytime we use random numbers in a script, simulation, or analysis it is important to \"seed\" the random number generator. This initialized the random number generator function to a particularly \"state\" and this makes the number in the script random but repeatable. \n", "\n", "Let's experiment with this. First try running the following cell and seeing what the output is. Try running it multiple times and seeing how the numbers change." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5, 7, 6, 8, 4, 3, 6, 9, 9, 8])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "npr.randint(0,10,10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now run this cell:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[9 4 0 1 9 0 1 8 9 0]\n", "[8 6 4 3 0 4 6 8 1 8]\n" ] } ], "source": [ "npr.seed(10)\n", "print(npr.randint(0,10,10))\n", "print(npr.randint(0,10,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, try repeating the cell execution over and over. What do you observe?\n", "\n", "Try restarting the kernel and run the cell again. What do you notice? Compare to other people in your group. Also change the argument to `npr.seed()` and see what happens." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 0 here:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[5 6 8 6 1 6 4 8 1 8]\n", "[5 1 0 8 8 8 2 6 8 1]\n" ] } ], "source": [ "## Enter solution here\n", "npr.seed(9)\n", "print(npr.randint(0,10,10))\n", "print(npr.randint(0,10,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bottom line: Always seed the random number generator at the start of any script that uses random numbers so your code is more repeatable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 1: Sampling from a finite population\n", "\n", "Imagine I create a list with 100 randomly determined values as below. Using the web, research the the numpy random `choice()` function. Use it generate a random sample of size 10 from this data. Do it twice, once with replacement and once without replacement." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "my_data = np.array([75, 25, 59, 63, 48, 29, 3, 17, 68, 39, 9, 62, 61, 52, 64, 45, 90,\n", " 87, 0, 42, 26, 52, 22, 25, 20, 22, 81, 25, 48, 79, 37, 6, 33, 30,\n", " 81, 5, 37, 85, 65, 0, 27, 40, 96, 67, 77, 29, 32, 25, 4, 53, 46,\n", " 7, 51, 65, 46, 91, 60, 52, 93, 26, 2, 42, 18, 19, 97, 45, 78, 33,\n", " 25, 30, 97, 96, 99, 32, 86, 43, 81, 83, 51, 81, 36, 29, 2, 33, 95,\n", " 39, 79, 1, 80, 17, 50, 38, 1, 98, 30, 89, 93, 27, 43, 30])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 1 here:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[60 30 62 85 42 39 36 60 53 19]\n", "[52 25 43 45 30 1 68 7 81 81]\n" ] } ], "source": [ "## Enter solution here\n", "print(npr.choice(my_data,size=10,replace=True))\n", "print(npr.choice(my_data,size=10,replace=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 2: Sampling from a data frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes what we are interested in is sampling from a pandas dataframe rather than a list or numpy array. Why might we want to sample from a dataset? One is to randomly select a subset of the data, for a training vs. test split, if we are doing machine learning projects on the data (we'll talk about this later). Another is if there are too many records to analyze so it makes sense to randomly select a subset and analyze those. Another is to repeatedly sample over and over again from a dataset to do a statistical method called \"boostrapping\" (https://en.wikipedia.org/wiki/Bootstrapping_(statistics))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code loads an example pandas dataset of different penguins. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "penguins_df = sns.load_dataset('penguins')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0Male
1AdelieTorgersen39.517.4186.03800.0Female
2AdelieTorgersen40.318.0195.03250.0Female
3AdelieTorgersenNaNNaNNaNNaNNaN
4AdelieTorgersen36.719.3193.03450.0Female
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "3 Adelie Torgersen NaN NaN NaN \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", "\n", " body_mass_g sex \n", "0 3750.0 Male \n", "1 3800.0 Female \n", "2 3250.0 Female \n", "3 NaN NaN \n", "4 3450.0 Female " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins_df.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 344 entries, 0 to 343\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 species 344 non-null object \n", " 1 island 344 non-null object \n", " 2 bill_length_mm 342 non-null float64\n", " 3 bill_depth_mm 342 non-null float64\n", " 4 flipper_length_mm 342 non-null float64\n", " 5 body_mass_g 342 non-null float64\n", " 6 sex 333 non-null object \n", "dtypes: float64(4), object(3)\n", "memory usage: 18.9+ KB\n" ] } ], "source": [ "penguins_df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Research the pandas `sample()` method and randomly sample 20 penguins from the dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 2a here:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
329GentooBiscoe48.115.1209.05500.0Male
190ChinstrapDream46.916.6192.02700.0Female
207ChinstrapDream52.218.8197.03450.0Male
255GentooBiscoe48.416.3220.05400.0Male
268GentooBiscoe44.913.3213.05100.0Female
224GentooBiscoe47.614.5215.05400.0Male
138AdelieDream37.016.5185.03400.0Female
2AdelieTorgersen40.318.0195.03250.0Female
271GentooBiscoe48.514.1220.05300.0Male
342GentooBiscoe45.214.8212.05200.0Female
60AdelieBiscoe35.716.9185.03150.0Female
322GentooBiscoe47.215.5215.04975.0Female
185ChinstrapDream51.018.8203.04100.0Male
122AdelieTorgersen40.217.0176.03450.0Female
279GentooBiscoe50.415.3224.05550.0Male
267GentooBiscoe50.515.9225.05400.0Male
334GentooBiscoe46.214.1217.04375.0Female
57AdelieBiscoe40.618.8193.03800.0Male
106AdelieBiscoe38.617.2199.03750.0Female
311GentooBiscoe52.217.1228.05400.0Male
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "329 Gentoo Biscoe 48.1 15.1 209.0 \n", "190 Chinstrap Dream 46.9 16.6 192.0 \n", "207 Chinstrap Dream 52.2 18.8 197.0 \n", "255 Gentoo Biscoe 48.4 16.3 220.0 \n", "268 Gentoo Biscoe 44.9 13.3 213.0 \n", "224 Gentoo Biscoe 47.6 14.5 215.0 \n", "138 Adelie Dream 37.0 16.5 185.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "271 Gentoo Biscoe 48.5 14.1 220.0 \n", "342 Gentoo Biscoe 45.2 14.8 212.0 \n", "60 Adelie Biscoe 35.7 16.9 185.0 \n", "322 Gentoo Biscoe 47.2 15.5 215.0 \n", "185 Chinstrap Dream 51.0 18.8 203.0 \n", "122 Adelie Torgersen 40.2 17.0 176.0 \n", "279 Gentoo Biscoe 50.4 15.3 224.0 \n", "267 Gentoo Biscoe 50.5 15.9 225.0 \n", "334 Gentoo Biscoe 46.2 14.1 217.0 \n", "57 Adelie Biscoe 40.6 18.8 193.0 \n", "106 Adelie Biscoe 38.6 17.2 199.0 \n", "311 Gentoo Biscoe 52.2 17.1 228.0 \n", "\n", " body_mass_g sex \n", "329 5500.0 Male \n", "190 2700.0 Female \n", "207 3450.0 Male \n", "255 5400.0 Male \n", "268 5100.0 Female \n", "224 5400.0 Male \n", "138 3400.0 Female \n", "2 3250.0 Female \n", "271 5300.0 Male \n", "342 5200.0 Female \n", "60 3150.0 Female \n", "322 4975.0 Female \n", "185 4100.0 Male \n", "122 3450.0 Female \n", "279 5550.0 Male \n", "267 5400.0 Male \n", "334 4375.0 Female \n", "57 3800.0 Male \n", "106 3750.0 Female \n", "311 5400.0 Male " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Enter solution here\n", "penguins_df.sample(n=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, for part b of this question, in a for loop, 100 times create a random sample of the dataframe and compute the mean body mass of the penguins in your sample. Append all these values to a list and then plot a histogram of these values (using `sns.displot`). Compare it to the mean of the dataset containing all the penguins." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer 2b here:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overall mean is 4201.754385964912\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAFgCAYAAACFYaNMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAQ4klEQVR4nO3dfaxkdX3H8fcHFsUWHyAsZLvsBpuikZoU05UqtAmWaLfUijYqJa0lkcpqxYhaG9SktTVN8KFqn2JZhYAWFawQH6ooUsQaFFwoIBQN1qIsEHapMdI01Sz77R/30L1sd+FC95zvzL3vVzK5M7+Zuef3g73vPffMzNlUFZKk6e3XPQFJWqkMsCQ1McCS1MQAS1ITAyxJTVZ1T2ApNm7cWJdffnn3NCTpscqeBudiD/i+++7rnoIk7XNzEWBJWo4MsCQ1McCS1MQAS1ITAyxJTQywJDUxwJLUxABLUhMDLElNDLAkNTHAktRktAAnWZfkqiS3Jbk1yeuH8bcnuSvJjcPlpLHmIEmzbMyzoe0A3lRVNyR5InB9kiuG+95XVe8ZcduSNPNGC3BV3QPcM1y/P8ltwNqxtidJ82aSY8BJjgSeBVw7DJ2Z5OYk5yc5eC/POSPJliRbtm/fPsU0NUPWrltPkkkua9et716uVqiM/c/SJzkIuBr486q6NMnhwH1AAe8A1lTVKx/ue2zYsKG2bNky6jw1W5JwyrnXTLKtizcdx9g/B1rxpj8he5IDgE8CF1XVpQBVdW9VPVBVO4EPAseOOQdJmlVjvgsiwHnAbVX13kXjaxY97CXALWPNQZJm2ZjvgjgeeAXwzSQ3DmNvBU5NcgwLhyDuADaNOAdJmlljvgviq+z5uMfnxtqmJM0TPwknSU0MsCQ1McCS1MQAS1ITAyxJTQywJDUxwJLUxABLUhMDLElNDLAkNTHAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1ITAyxJTQywJDUxwJLUxABLUhMDLElNDLAkNTHAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1ITAyxJTQywJDUxwJLUxABLUhMDLElNDLAkNTHAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1ITAyxJTQywJDUxwJLUxABLUhMDLElNDLAkNTHAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1KT0QKcZF2Sq5LcluTWJK8fxg9JckWS24evB481B0maZWPuAe8A3lRVzwCeA7w2ydHA2cCVVXUUcOVwW5JWnNECXFX3VNUNw/X7gduAtcDJwIXDwy4EXjzWHCRplk1yDDjJkcCzgGuBw6vqHliINHDYXp5zRpItSbZs3759imlK0qRGD3CSg4BPAmdV1Y+W+ryq2lxVG6pqw+rVq8eboCQ1GTXASQ5gIb4XVdWlw/C9SdYM968Bto05B0maVWO+CyLAecBtVfXeRXd9GjhtuH4a8Kmx5iBJs2zViN/7eOAVwDeT3DiMvRU4B7gkyenA94GXjTgHSZpZowW4qr4KZC93nzjWdiVpXvhJOElqYoAlqYkBlqQmBliSmhhgSWpigCWpiQHWkqxdt54kk12klWDMD2JoGbl7652ccu41k23v4k3HTbYtqYt7wJLUxABLUhMDLElNDLAkNTHAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1ITAyxJTQywJDUxwJLUxABLUhMDLElNDLAkNTHAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1ITAyxJTQywtN8qkkx2WbtuffeKNSNWdU9AardzB6ece81km7t403GTbUuzzT1gSWpigCWpiQGWpCYGWJKaGGBJamKAJamJAZakJgZYkpoYYElqYoAlqYkBlqQmBliSmhhgSWpigCWpiQGWpCYGWJKaGGBJamKAJamJAZakJqMFOMn5SbYluWXR2NuT3JXkxuFy0ljbl6RZN+Ye8AXAxj2Mv6+qjhkunxtx+5I000YLcFV9BfjBWN9fkuZdxzHgM5PcPByiOHhvD0pyRpItSbZs3759yvnNhbXr1pNksoukfW/VxNv7APAOoIavfwG8ck8PrKrNwGaADRs21FQTnBd3b72TU869ZrLtXbzpuMm2Ja0Uk+4BV9W9VfVAVe0EPggcO+X2JWmWTBrgJGsW3XwJcMveHitJy91ohyCSfAw4ATg0yVbgT4ATkhzDwiGIO4BNY21fkmbdaAGuqlP3MHzeWNuTpHnjJ+EkqYkBlqQmBliSmhhgSWpigCWpyZICnOT4pYxJkpZuqXvAf73EMUnSEj3s+4CTPBc4Dlid5I2L7noSsP+YE5Ok5e6RPojxOOCg4XFPXDT+I+ClY01KklaChw1wVV0NXJ3kgqr63kRzkqQVYakfRX58ks3AkYufU1W/OsakJGklWGqAPwH8HfAh4IHxpiNJK8dSA7yjqj4w6kwkaYVZ6tvQPpPkD5KsSXLIg5dRZyZJy9xS94BPG76+edFYAT+7b6cjSSvHkgJcVU8deyKStNIsKcBJfm9P41X14X07HUlaOZZ6COLZi64fCJwI3AAYYEl6jJZ6COJ1i28neTLwkVFmJEkrxGM9HeV/AUfty4lI0kqz1GPAn2HhXQ+wcBKeZwCXjDUpSVoJlnoM+D2Lru8AvldVW0eYjyStGEs6BDGclOdbLJwR7WDgJ2NOSpJWgqX+ixgvB64DXga8HLg2iaejlKT/h6Uegngb8Oyq2gaQZDXwJeAfxpqYJC13S30XxH4PxnfwH4/iuZKkPVjqHvDlSb4AfGy4fQrwuXGmJEkrwyP9m3A/BxxeVW9O8lvALwMBvgZcNMH8JGnZeqTDCO8H7geoqkur6o1V9QYW9n7fP+7UJGl5e6QAH1lVN+8+WFVbWPjniSRJj9EjBfjAh7nvCftyIpK00jxSgL+R5FW7DyY5Hbh+nClJy9x+q0gy2WXtuvXdK9ZePNK7IM4CLkvyO+wK7gbgccBLRpyXtHzt3MEp514z2eYu3nTcZNvSo/OwAa6qe4HjkjwPeOYw/I9V9U+jz0ySlrmlng/4KuCqkeciSSuKn2aTpCYGWJKaGGBJamKAJamJAZakJgZYkpoYYElqYoAlqYkBlqQmBliSmhhgSWpigCWpiQGWpCYGWJKaGGBJamKAJamJAZakJqMFOMn5SbYluWXR2CFJrkhy+/D14LG2L0mzbsw94AuAjbuNnQ1cWVVHAVcOtyVpRRotwFX1FeAHuw2fDFw4XL8QePFY25ekWTf1MeDDq+oegOHrYXt7YJIzkmxJsmX79u2TTVCSpjKzL8JV1eaq2lBVG1avXt09HUna56YO8L1J1gAMX7dNvH1JmhlTB/jTwGnD9dOAT028fUmaGWO+De1jwNeApyfZmuR04Bzg+UluB54/3JakFWnVWN+4qk7dy10njrVNSZonM/sinCQtdwZYkpoYYElqYoAlqYkBlqQmBliSmhhgSWpigCWpiQGWpCYGWJKaGGBJamKAJamJAZakJgZYkpoYYElqYoAlqYkBlqQmBliSmhhgSWpigCWpiQGWpCYGWJKaGGBJamKAJamJAZakJgZYkpoYYElqYoAlqYkBlqQmBliSmhhgSWpigCWpiQGWpCYGWJKaGGBJamKAJamJAZakJgZYkpoYYElqYoAlqYkBlqQmBliSmhhgSWpigCWpiQGWpCYGWJKaGGBJamKAJamJAZakJgZYkpoYYElqYoAlqcmqjo0muQO4H3gA2FFVGzrmIUmdWgI8eF5V3de4fUlq5SEISWrSFeACvpjk+iRnNM1Bklp1HYI4vqruTnIYcEWSb1XVVxY/YAjzGQDr16/vmKMkjaplD7iq7h6+bgMuA47dw2M2V9WGqtqwevXqqacoSaObPMBJfjrJEx+8DrwAuGXqeUhSt45DEIcDlyV5cPsfrarLG+YhSa0mD3BVfRf4ham3K0mzxrehSVITAyxJTQywJDUxwJLUxABLUhMDLElNDLAkNTHA0nK33yqSTHJZu87ztjwanecDljSFnTs45dxrJtnUxZuOm2Q7y4V7wJLUxABLUhMDLElNDLAkNTHAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1KTZR3gtevW+xl4STNrWZ8L4u6td/oZeEkza1nvAUvSLDPAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1ITAyxJTQywJDUxwJLUZFl/FHlS+60iSfcspF4T/xz8zBHruOvO70+2vX3NAO8rO3dMdt4J8NwTmlH+HDwqHoKQpCYGWJKaGGBJamKAJamJAZakJgZYkpoYYElqYoAlqYkBlqQmBliSmhhgSWpigCXNr+HkP1Nd1q5bv0+n78l4JM2vOT/5j3vAktTEAEtSEwMsSU0MsCQ1McCS1MQAS1ITAyxJTVoCnGRjkm8n+U6SszvmIEndJg9wkv2BvwV+HTgaODXJ0VPPQ5K6dewBHwt8p6q+W1U/AT4OnNwwD0lqlaqadoPJS4GNVfX7w+1XAL9UVWfu9rgzgDOGm08Hvj3pRHc5FLivadtjcD2zzfXMtse6nvuqauPugx3ngsgexv7P3wJVtRnYPP50Hl6SLVW1oXse+4rrmW2uZ7bt6/V0HILYCqxbdPsI4O6GeUhSq44AfwM4KslTkzwO+G3g0w3zkKRWkx+CqKodSc4EvgDsD5xfVbdOPY9Hof0wyD7memab65lt+3Q9k78IJ0la4CfhJKmJAZakJisuwEkOTHJdkpuS3JrkT4fxY5J8PcmNSbYkOXbRc94yfGz620l+bdH4Lyb55nDfXyXZ01vsJpFk/yT/kuSzw+1DklyR5Pbh68GLHjuP63l3km8luTnJZUmesuixc7eeReN/mKSSHLpobC7Xk+R1w5xvTfKuReNzt57JelBVK+rCwvuQDxquHwBcCzwH+CLw68P4ScCXh+tHAzcBjweeCvwbsP9w33XAc4fv+fkHn9+0rjcCHwU+O9x+F3D2cP1s4J1zvp4XAKuG6++c9/UMY+tYeDH6e8Ch87we4HnAl4DHD7cPm/P1TNKDFbcHXAv+c7h5wHCp4fKkYfzJ7Hpv8snAx6vqx1X178B3gGOTrAGeVFVfq4X/+h8GXjzRMh4iyRHAbwAfWjR8MnDhcP1Cds1tLtdTVV+sqh3Dza+z8P5xmNP1DN4H/BEP/SDSvK7nNcA5VfVjgKraNozP63om6cGKCzD8768bNwLbgCuq6lrgLODdSe4E3gO8ZXj4WuDORU/fOoytHa7vPt7h/Sz8IO9cNHZ4Vd0DMHw9bBif1/Us9koW9jBgTteT5EXAXVV1026Pncv1AE8DfiXJtUmuTvLsYXxe13MWE/RgRQa4qh6oqmNY2Is6NskzWfgb/A1VtQ54A3De8PC9fXR6SR+pHluSFwLbqur6pT5lD2Nzs54kbwN2ABc9OLSHh830epL8FPA24I/39JQ9jM30egargINZOJz3ZuCS4RjovK5nkh50nAtiZlTVD5N8GdgInAa8frjrE+z6dWRvH53eyq5fgxePT+144EVJTgIOBJ6U5O+Be5Osqap7hl+PHvyVcC7XU1W/m+Q04IXAicOveTCH6wE+wsLxw5uG12mOAG4YXuiZu/UMf962ApcO/1+uS7KThRPXzOt6fpMpetB10LvrAqwGnjJcfwLwzyz8UN8GnDCMnwhcP1z/eR560P277Dro/g0W/sZ/8KD7Sc1rO4FdLyK8m4e+CPeuOV/PRuBfgdW7PWYu17Pb+B3sehFuLtcDvBr4s+H601j4NT1zvJ5JerAS94DXABdm4cTw+wGXVNVnk/wQ+Mskq4D/ZjgVZlXdmuQSFn74dwCvraoHhu/1GuACFkL+eXYdl5wF57Dwa+DpwPeBl8Fcr+dvWPhDf8Ww1/j1qnr1HK9nj+Z4PecD5ye5BfgJcFotVGle1/MqJuiBH0WWpCYr8kU4SZoFBliSmhhgSWpigCWpiQGWpCYGWJKaGGBJavI/yNpxXsq8y4cAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "## Enter solution here\n", "overall_mean = penguins_df['body_mass_g'].mean()\n", "\n", "list_sample_mean = []\n", "for i in range(100):\n", " sample_mean = penguins_df.sample(n=20)['body_mass_g'].mean()\n", " list_sample_mean.append(sample_mean)\n", "\n", "sns.displot(list_sample_mean)\n", "print(\"Overall mean is \", overall_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 3: Stratified sampling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One problem with the simple random samples we made of the penguins is that in each sample we might exclude some important groups of the data. For example, if we only sampled 10 penguins perhaps all of them are male. If we wanted to be more even handed name make sure our samples were _representative_ of the sex differences then we might want to sample from the subpopulations. This is called \"stratified sampling\".\n", "\n", "Please read this example webpage: https://www.statology.org/stratified-sampling-pandas/\n", "on stratified sampling and adapt the code to generate a random sample of 10 penguins that is stratified so that there are 5 male and 5 female examples in the sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problem 3: Answer here" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
184ChinstrapDream42.516.7187.03350.0Female
16AdelieTorgersen38.719.0195.03450.0Female
104AdelieBiscoe37.918.6193.02925.0Female
64AdelieBiscoe36.417.1184.02850.0Female
27AdelieBiscoe40.517.9187.03200.0Female
168ChinstrapDream50.320.0197.03300.0Male
179ChinstrapDream49.519.0200.03800.0Male
123AdelieTorgersen41.418.5202.03875.0Male
101AdelieBiscoe41.020.0203.04725.0Male
194ChinstrapDream50.919.1196.03550.0Male
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "184 Chinstrap Dream 42.5 16.7 187.0 \n", "16 Adelie Torgersen 38.7 19.0 195.0 \n", "104 Adelie Biscoe 37.9 18.6 193.0 \n", "64 Adelie Biscoe 36.4 17.1 184.0 \n", "27 Adelie Biscoe 40.5 17.9 187.0 \n", "168 Chinstrap Dream 50.3 20.0 197.0 \n", "179 Chinstrap Dream 49.5 19.0 200.0 \n", "123 Adelie Torgersen 41.4 18.5 202.0 \n", "101 Adelie Biscoe 41.0 20.0 203.0 \n", "194 Chinstrap Dream 50.9 19.1 196.0 \n", "\n", " body_mass_g sex \n", "184 3350.0 Female \n", "16 3450.0 Female \n", "104 2925.0 Female \n", "64 2850.0 Female \n", "27 3200.0 Female \n", "168 3300.0 Male \n", "179 3800.0 Male \n", "123 3875.0 Male \n", "101 4725.0 Male \n", "194 3550.0 Male " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Enter solution here\n", "penguins_df.groupby('sex').sample(5)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }