52 | -23 | 57 | 91 | -42 | 34 | -59 | -50 | -80 | -71 | 35 | 6 | 2 | 57 | 49 | -14 |
100 | -48 | -49 | -25 | -75 | 81 | -69 | -5 | 79 | -85 | -5 | -69 | 98 | -11 | 89 | -24 |
-55 | -14 | -51 | 49 | 74 | -71 | 91 | 77 | 68 | 29 | 13 | -81 | 21 | 86 | -32 | 15 |
65 | -22 | 85 | -57 | 1 | 54 | 100 | 76 | -11 | 83 | -60 | 74 | -61 | 30 | 93 | -53 |
90 | 90 | -68 | 51 | 85 | -58 | -56 | 38 | -34 | 10 | 66 | -52 | 14 | -10 | -34 | -42 |
99 | 24 | -30 | -1 | 6 | 46 | -11 | 15 | 6 | 69 | 67 | -17 | -48 | 36 | -62 | -86 |
-24 | -28 | -9 | -13 | 19 | -3 | 5 | 90 | -63 | -28 | -18 | 29 | 92 | 28 | -94 | -25 |
26 | 93 | 21 | 39 | -90 | 62 | -19 | 36 | 14 | -27 | -67 | 3 | -19 | -46 | 69 | 48 |
-45 | 98 | -56 | -48 | 69 | 98 | 31 | -32 | 69 | 68 | -2 | -99 | 31 | 66 | 65 | -80 |
6 | 2 | 57 | -49 | 92 | 65 | -54 | -95 | -73 | -61 | -71 | -61 | 70 | 52 | -1 | 8 |
We can see they aren't all the same. Not much else really. Looking at a bunch of numbers is hard work.
52 | -23 | 57 | 91 | -42 | 34 | -59 | -50 | -80 | -71 | 35 | 6 | 2 | 57 | 49 | -14 |
100 | -48 | -49 | -25 | -75 | 81 | -69 | -5 | 79 | -85 | -5 | -69 | 98 | -11 | 89 | -24 |
-55 | -14 | -51 | 49 | 74 | -71 | 91 | 77 | 68 | 29 | 13 | -81 | 21 | 86 | -32 | 15 |
65 | -22 | 85 | -57 | 1 | 54 | 100 | 76 | -11 | 83 | -60 | 74 | -61 | 30 | 93 | -53 |
90 | 90 | -68 | 51 | 85 | -58 | -56 | 38 | -34 | 10 | 66 | -52 | 14 | -10 | -34 | -42 |
99 | 24 | -30 | -1 | 6 | 46 | -11 | 15 | 6 | 69 | 67 | -17 | -48 | 36 | -62 | -86 |
-24 | -28 | -9 | -13 | 19 | -3 | 5 | 90 | -63 | -28 | -18 | 29 | 92 | 28 | -94 | -25 |
26 | 93 | 21 | 39 | -90 | 62 | -19 | 36 | 14 | -27 | -67 | 3 | -19 | -46 | 69 | 48 |
-45 | 98 | -56 | -48 | 69 | 98 | 31 | -32 | 69 | 68 | -2 | -99 | 31 | 66 | 65 | -80 |
6 | 2 | 57 | -49 | 92 | 65 | -54 | -95 | -73 | -61 | -71 | -61 | 70 | 52 | -1 | 8 |
It would be nice to reduce the big set of numbers down to a few numbers that we can look at and make sense of.
Sameness (Central Tendency)
Differentness (Variance)
Give us summaries of big sets of numbers
Useful single numbers to look at
They tell us about patterns of sameness and differentness
Graphing the numbers gives a quick and dirty sense of what they are like. Here's 200 numbers presented as dots
Sorting the numbers from smallest to largest
Histograms count up the numbers inside specific ranges
Bars show you which bins have more or less numbers in the range
What single number would you say best describes most of these numbers?
Is the red or blue value a better summary of all the numbers?
Central tendency should describe what most of the data is like
We want our summary number to be most like the other numbers. We want it to be a representative value
Central tendency should describe what most of the data is like
We want our summary number to be most like the other numbers. We want it to be a representative value
There are multiple measures of central tendency with different properties
Central tendency should describe what most of the data is like
We want our summary number to be most like the other numbers. We want it to be a representative value
There are multiple measures of central tendency with different properties
Some work better than others depending on the data
The mode is the single most frequently occuring number
1 1 2 2 3 4 5 6 7 7 7 7 7
The mode is 7 because 7 happens the most
Find the mode by counting the occurence of each number, the mode is the most frequently occuring number
If there is a tie, then you have two or three or more modes (depends on how many different numbers tie)
We make 25 numbers, how do we get python to find the mode?
import numpy as npa=np.random.randint(1,10+1, 25)counts = np.bincount(a)max=np.argmax(counts)max, counts[max]
You can always write your own function for the mode. This one is called my_mode
def my_mode(array): counts = np.bincount(a) max=np.argmax(counts) return max, counts[max]a=np.random.randint(1,10+1, 25)my_mode(a)
When should we use mode? Appropriate for many datasets; for nominal data (or oridinal), it may be one of the few reasonable descriptors
The median is the middle number
1 1 2 2 3 4 5 6 7 7 7 7 7
The median is 5 because it is the middle number
Find the median by ordering the numbers from smallest to largest, then take the number in the middle
If there are an even number of numbers, find the two in the middle, and
1 2 3 4 5 6 7 8
Put some numbers in a variable.
a=np.random.randint(1,10+1, 12)np.median(a)
When would the median be a good thing to know?
Suitable for many datasets, and makes sense for ordinal data. More robust to outliers than mean
The Mean (also called average) is the sum of the numbers, divided by the number of numbers
Mean=sum of numbersnumber of numbers
1 1 2 2 3 4 5 6 7 7 7 7 7
Mean=ˉX=∑i=Ni=1xiN
ˉX
bar symbolizes the mean
∑i=Ni=1xi
Summation notation
x
= all the numbers (1,2,3,4...) i
= an index value, representing the first to last and all the numbers in between of x.N
= the number of numbers∑
= instruction to add up numbersx=[4,7,9]
∑i=Ni=1xi=xi=1+xi=2+xi=3=4+7+9=20
index | x |
---|---|
1 | 4 |
2 | 7 |
3 | 2 |
4 | 9 |
5 | 8 |
Sum | 30 |
N | 5 |
Mean | 6 |
index | x | equal_parts |
---|---|---|
1 | 4 | 6 |
2 | 7 | 6 |
3 | 2 | 6 |
4 | 9 | 6 |
5 | 8 | 6 |
Sum | 30 | 30 |
N | 5 | 5 |
Mean | 6 | 6 |
index | x | deviations |
---|---|---|
1 | 4 | -2 |
2 | 7 | 1 |
3 | 2 | -4 |
4 | 9 | 3 |
5 | 8 | 2 |
Sum | 30 | 0 |
N | 5 | 5 |
Mean | 6 | 0 |
Use the mean()
function
#make some numbersa=np.random.randint(1,10+1, 12)np.mean(a)
sum()
sums up the numbers.size
counts up the number of numbers in the variablea=np.random.randint(1,10+1, 12)np.sum(a)
a.size
a=np.random.randint(1,10+1, 12)np.sum(a)/a.size
When would the mean be a good thing to know?
Most appropriate for interval and ratio data. But sensitive to outliers.
Mean (Red), Median (Green), Mode (Blue)
Mean (Red), Median (Green), Mode (Blue)
Outliers are really big or really small values that are unusual compared to the rest of the data
The mean is influenced by outliers, the median is not.
Mean (Red), Median (Green)
The big number (2000) makes the mean really big, because it is included in the sum.
Descriptive statistics help us reduce a large pile of numbers to a few numbers that "describe the data"
Mode, median, mean, are descriptives for central tendency in the data (meant to represent what most of the numbers are like)
Descriptive statistics help us reduce a large pile of numbers to a few numbers that "describe the data"
Mode, median, mean, are descriptives for central tendency in the data (meant to represent what most of the numbers are like)
Measures of central tendency can be "off" by quite a bit depending on the shape of the data, need to look at data to see if they are appropriate
Thanks to Todd Gureckis and Matt Crump for the slides.
52 | -23 | 57 | 91 | -42 | 34 | -59 | -50 | -80 | -71 | 35 | 6 | 2 | 57 | 49 | -14 |
100 | -48 | -49 | -25 | -75 | 81 | -69 | -5 | 79 | -85 | -5 | -69 | 98 | -11 | 89 | -24 |
-55 | -14 | -51 | 49 | 74 | -71 | 91 | 77 | 68 | 29 | 13 | -81 | 21 | 86 | -32 | 15 |
65 | -22 | 85 | -57 | 1 | 54 | 100 | 76 | -11 | 83 | -60 | 74 | -61 | 30 | 93 | -53 |
90 | 90 | -68 | 51 | 85 | -58 | -56 | 38 | -34 | 10 | 66 | -52 | 14 | -10 | -34 | -42 |
99 | 24 | -30 | -1 | 6 | 46 | -11 | 15 | 6 | 69 | 67 | -17 | -48 | 36 | -62 | -86 |
-24 | -28 | -9 | -13 | 19 | -3 | 5 | 90 | -63 | -28 | -18 | 29 | 92 | 28 | -94 | -25 |
26 | 93 | 21 | 39 | -90 | 62 | -19 | 36 | 14 | -27 | -67 | 3 | -19 | -46 | 69 | 48 |
-45 | 98 | -56 | -48 | 69 | 98 | 31 | -32 | 69 | 68 | -2 | -99 | 31 | 66 | 65 | -80 |
6 | 2 | 57 | -49 | 92 | 65 | -54 | -95 | -73 | -61 | -71 | -61 | 70 | 52 | -1 | 8 |
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |