name: title class: middle, center, dark # Describing and plotting data (Part 2) --- class: light, center, middle, clear # We already know what lots of numbers look like --- class: light # Lots of Numbers look like this Like this <div class=rtable> <table> <tbody> <tr> <td style="text-align:right;"> -57 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> -16 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> -33 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> -72 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> -35 </td> </tr> <tr> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> 96 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> -41 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> -57 </td> <td style="text-align:right;"> -76 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> -13 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> -75 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:right;"> -63 </td> <td style="text-align:right;"> -84 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> 88 </td> <td style="text-align:right;"> -51 </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> -73 </td> <td style="text-align:right;"> -50 </td> <td style="text-align:right;"> -80 </td> <td style="text-align:right;"> -19 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 26 </td> </tr> <tr> <td style="text-align:right;"> -47 </td> <td style="text-align:right;"> -94 </td> <td style="text-align:right;"> -4 </td> <td style="text-align:right;"> -60 </td> <td style="text-align:right;"> -69 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> -7 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> -46 </td> </tr> <tr> <td style="text-align:right;"> -53 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> -41 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> -65 </td> <td style="text-align:right;"> -93 </td> <td style="text-align:right;"> -26 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 97 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> -59 </td> <td style="text-align:right;"> -88 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> -84 </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> -22 </td> <td style="text-align:right;"> -51 </td> <td style="text-align:right;"> 99 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> -37 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> -68 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> -82 </td> <td style="text-align:right;"> -20 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -29 </td> <td style="text-align:right;"> -62 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> -68 </td> <td style="text-align:right;"> -12 </td> <td style="text-align:right;"> 51 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 73 </td> <td style="text-align:right;"> -29 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> -66 </td> <td style="text-align:right;"> 82 </td> <td style="text-align:right;"> -72 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> -52 </td> <td style="text-align:right;"> -98 </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> 48 </td> <td style="text-align:right;"> 34 </td> </tr> <tr> <td style="text-align:right;"> -35 </td> <td style="text-align:right;"> -95 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -80 </td> <td style="text-align:right;"> -39 </td> <td style="text-align:right;"> 61 </td> <td style="text-align:right;"> 96 </td> <td style="text-align:right;"> -56 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> -95 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> -12 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> -46 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> -92 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> -88 </td> <td style="text-align:right;"> -35 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> -40 </td> </tr> </tbody> </table> </div> --- class: light # Summary numbers We want to reduce the big set of numbers down to a few numbers that we can look at and make sense of. **Sameness (Central Tendency)** - What are all the numbers close to? (topic from last class) **Differentness (Variance)** - How different are the numbers? (topic for this class) --- class: light, center, middle, clear # Graph the numbers to get a better look at the differences --- class: light # Histogram We can see the spread in the data, there are different numbers... <img src="2b_variation_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- class: light, center, middle, clear # Range --- class: light # The Range <img src="2b_variation_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: light # The Range The range is two numbers. - minimum value: the smallest number in the data - maximum value: the largest number in the data `1 4 3 6 5 7 6 8 7 6 9` Range is (1,9) - smallest number is 1 - largest number is 9 --- class: light # min() Use the `min()` function to find the smallest value in a variable in python ``` a=np.random.randint(1,10+1, 12) np.min(x) ``` --- class: light # max() Use the `max()` function to find the largest value in a variable in Python ``` a=np.random.randint(1,10+1, 12) np.max(x) ``` --- class: light # Thinking about the range - Pros: Great way to find out the largest possible difference - Cons: The biggest possible difference is probably not representative of all the differences in the numbers --- class: light # Some numbers Here are two sets of numbers. What is the range? Does it do a good job showing the average differences? `1 5 6 5 4 5 6 5 4 5 6 5 4 100` `1 2 1 2 1 1 1 1 2 2 2 2 1 2` --- class: light # Average differences - It would be nice if we could find a way to measure the average amount of differences. - This average could be a **representative** value that summarizes the differences between the numbers --- class: light # Average differences What should the average difference for these numbers be? ` 1 2 1 2 1 1 1 1 2 2 2 2 1 2` - All of the numbers are 1s or 2s. - The difference between 1 and 2 is 1 - It seems the average difference should be something like 1, or 0.5 if you compare identical numbers too (+ or -) --- class: light # Differences between numbers Consider these 10 numbers: ` 1 3 4 5 5 6 7 8 9 24` - We can see there are some differences, they are not all the same. - We can measure the differences, by finding the difference between each score, and every other scores - e.g., 1-3 = -2, 1-4 = -3, etc. --- class: light # Difference scores <img src="figs-crump/2bdifferences.png" width="650" /> --- class: light # Problem: The sum = 0 <img src="figs-crump/2bdifferences.png" width="650" /> --- class: light # Summarizing the difference scores 1. Even though we can see the difference scores have different values, we can't summarize them in a typical fashion (taking a mean of "differences") -- 2. The sum adds up to 0... -- 3. How can we solve the problem? --- class: light, center, middle, clear # Difference scores from the mean --- class: light # Difference scores from the mean Consider these numbers: ` 1 6 4 2 6 8` 1. We can compute the mean to describe the central tendency of the numbers 2. How far off is the mean for each number? This is the amount of error 3. The difference scores from the mean show how far off (different) each score is from the mean difference score = `\(x_{i}-\bar{X}\)` --- class: light # Difference scores from the mean <img src="figs-crump/2bdiffmean.png" width="650" /> --- class: light # sum = 0, Same problem... <img src="figs-crump/2bdiffmean.png" width="650" /> --- class: light # Mean is the balancing point 1. The mean is the balancing point in the data -- 2. Half of the data is on one side of the mean, the other half is on the other side -- 3. Difference scores from mean will always sum to 0 --- class: light, center, middle, clear # Squared Deviations --- class: light # Squared deviations <img src="figs-crump/2bsquared.png" width="650" /> --- class: light # Squared deviations Why square the deviations (differences between mean and each score)? - Squaring converts all the negative numbers to postive numbers - This allows us to sum them all up, and not get 0! --- class: light, center, middle, clear # SS (Sum of squared deviations) --- class: light # SS (sum of squared deviations) <img src="figs-crump/2bsquaredSS.png" width="650" /> --- class: light # SS (sum of squared deviations) The formula for the sum of squared deviations (SS, also called sum of squares) is: `\(SS = \sum_{i=1}^{i=N} (x_{i}-\bar{X})^2\)` --- class: light # What next? 1. We've found a way to sum up the differences (SS) -- 2. We used the squared differences from the mean, and added them all up -- 3. How can we find the average? Remember, we want a single number that does a good job of representing the differences... --- class: light, center, middle, clear # Variance --- class: light # Variance = SS/N <img src="figs-crump/2bsquaredV.png" width="650" /> --- class: light # Jargon 1. Learning statistics can be confusing because there are many new terms, and some of them refer to normal everyday concepts Everyday words: - Variability & Variance: The things aren't all the same, they have some variability **Statistical Variance**: The average of the sum of the squared difference scores from the mean --- class: light # Variance The average of the sum of the squared difference scores from the mean `\(Variance = SS / N\)` `\(Variance = \frac{\sum_{i=1}^{i=N}(\bar{X}-x_i)^2}{N}\)` Usefulness Pros: The variance provides us with one summary number about the average differences Cons: We squared the differences, so the variance doesn't directly relate to size of the original differences --- class: light # The variance is too big <img src="figs-crump/2bsquaredV2.png" width="650" /> --- class: light # What to do? 1. We are searching for a summary number to represent the differences in our data. -- 2. The variance is too big because of squaring -- 3. What can we do to solve the problem, and make our summary number in the range of the actual differences? --- class: light # Square root the variance 1. Squaring numbers makes them bigger (e.g., `\(2^2 =4\)`) 2. Square rooting numbers brings them back down to their unsquared size (e.g., `\(\sqrt{2^2}=2\)`) 3. Let's square root the variance --- class: light # Square root the variance <img src="figs-crump/2bsquaredV3.png" width="650" /> --- class: light, center, middle, clear # Standard Deviation --- class: light # Standard deviation = sqrt(variance) When we took the square root of the variance, we also did something else, called computing the **standard deviation**. `\(\text{standard deviation} = \sqrt{\text{variance}}\)` `\(\text{standard deviation} = \sqrt{\frac{SS}{N}}\)` `\(\text{standard deviation} = \sqrt{ \frac{\sum_{i=1}^{i=N}(\bar{X}-x_i)^2}{N}}\)` The standard deviation is a summary of the variability in the data that is in the same scale as the original differences --- class: light # Standard Deviation <img src="figs-crump/2bsquaredV3.png" width="650" /> --- class: light # Populations vs samples There are different formulas for the variance and standard deviation, depending on whether your data represents an entire population of scores, or just a sample (a subset of the population). **Population**: Divide by N **Sample**: Divide by N-1 (this is what you will do when you are working with samples later on this class) --- class: light, center, middle, clear # Data sense with Descriptives --- class: light # What if? - Someone told you they had some numbers with: -- - Mean = 100, Standard Deviation = 25 -- - What would most of the numbers be like? - What would be a good summary of the average differences in the data? - What kind of numbers would you expect to see or not see? --- class: light # Animating the standard deviation <img src="figs-crump/2banimSD.gif" style="display: block; margin: auto;" /> --- class: light, center, middle, clear # Python tips --- class: light # Warning - Python has functions for variance and standard deviation... - **But**, careful with regards to whether they divide by N-1 or N --- class: light # Python: Mean difference scores ```python x = np.array([8,2,6,4,6,2,4,4]) diff = x - np.mean(x) ``` ```python np.mean(diff) ``` ``` ## 0 ``` --- class: light # Python: Squared deviations ```python x = np.array([8,2,6,4,6,2,4,4]) (x - np.mean(x))**2 ``` ``` ## array([12.25, 6.25, 2.25, 0.25, 2.25, 6.25, 0.25, 0.25]) ``` --- class: light # Python: SS, sum of squares ```python x = np.array([8,2,6,4,6,2,4,4]) np.sum((x - np.mean(x))**2) ``` ``` ## 30 ``` More explicit code: ```python x = np.array([8,2,6,4,6,2,4,4]) squared_diff = (x - np.mean(x))**2 SS = np.sum(squared_diff) SS ``` ``` ## 30 ``` --- # Python: Variance ```python x = np.array([8,2,6,4,6,2,4,4]) squared_diff = (x - np.mean(x))**2 SS = np.sum(squared_diff) myvar = SS / len(x) myvar ``` ``` ## 3.75 ``` ```python x = np.array([8,2,6,4,6,2,4,4]) np.var(x) ``` ``` ## 3.75 ``` --- class: light Thanks to Todd Gureckis and Matt Crump for slides.