Lecture 4 [lhc/t.m. gureckis]

# Describing and plotting data (Part 1)

---

# Here is a lot of numbers

---

# What can we say about them?

We can see they aren't all the same. Not much else really. Looking at a bunch of numbers is hard work.

---

# Summary numbers

It would be nice to reduce the big set of numbers down to a few numbers that we can look at and make sense of.

**Sameness (Central Tendency)**

- What are all the numbers close to?

**Differentness (Variance)**

- How different are the numbers?

---

# Descriptive Statistics

- Give us summaries of big sets of numbers

- Useful single numbers to look at

- They tell us about patterns of sameness and differentness

---

# Graph the numbers to get a better look

---

# Dot plot (unordered)

Graphing the numbers gives a quick and dirty sense of what they are like. Here's 200 numbers presented as dots

---

# Dot plot (ordered)

Sorting the numbers from smallest to largest

---

# Histograms

Histograms count up the numbers inside specific ranges

---

# Histograms

Bars show you which bins have more or less numbers in the range

---

# So what are these numbers like?

What single number would you say best describes most of these numbers?

---

# Question

Is the red or blue value a better summary of all the numbers?

---

# Measures of Central Tendency

---

# Central Tendency

1. **Central tendency** should describe what most of the data is like

2. We want our summary number to be most like the other numbers. We want it to be a **representative value**

3. There are **multiple measures** of central tendency with different properties

5. Some work better than others depending on the data

---

# Mode

---

# Mode

The mode is the single most frequently occuring number

> 1 1 2 2 3 4 5 6 7 7 7 7 7

- The mode is 7 because 7 happens the most

- Find the mode by counting the occurence of each number, the mode is the most frequently occuring number

- If there is a tie, then you have two or three or more modes (depends on how many different numbers tie)

---

# Finding the Mode in Python

We make 25 numbers, how do we get python to find the mode?

```python

import numpy as np
a=np.random.randint(1,10+1, 25)
counts = np.bincount(a)
max=np.argmax(counts)
max, counts[max]

```

---

# Custom function for the mode in python

You can always write your own function for the mode. This one is called `my_mode`

```python

def my_mode(array):
    counts = np.bincount(a)
    max=np.argmax(counts)
    return max, counts[max]

a=np.random.randint(1,10+1, 25)
my_mode(a)
```

---

# Thinking about the mode

When should we use mode? Appropriate for many datasets; for nominal data (or oridinal), it may be one of the few reasonable descriptors

---

# Median

---

# Median

The median is the middle number

> 1 1 2 2 3 4 **5** 6 7 7 7 7 7

- The median is 5 because it is the middle number

- Find the median by ordering the numbers from smallest to largest, then take the number in the middle

---

# Median (even number of numbers)

If there are an even number of numbers, find the two in the middle, and

> 1 2 3 **4** **5** 6 7 8

- The median is 4.5 because, 4.5 is in between the two middle numbers

---

# Finding the Median in Python

Put some numbers in a variable.

```python

a=np.random.randint(1,10+1, 12)
np.median(a)
```

---

# Thinking about the median

When would the median be a good thing to know?

Suitable for many datasets, and makes sense for ordinal data. More robust to outliers than mean

---

# Mean

---

# Mean

The Mean (also called average) is the sum of the numbers, divided by the number of numbers

`\(\text{Mean} = \frac{\text{sum of numbers}}{\text{number of numbers}}\)`

> 1 1 2 2 3 4 5 6 7 7 7 7 7

- Sum = 1+1+2+2+3+4+5+6+7+7+7+7 = 59
- Number of numbers = 13
- Mean = 59/13 = 4.538462

---

# Mean

`\(\text{Mean} = \bar{X} = \frac{\sum_{i=1}^{i=N}{x_i}}{N}\)`

- `\(\bar{X}\)` bar symbolizes the mean

- `\(\sum_{i=1}^{i=N}{x_i}\)` Summation notation

- `\(x\)` = all the numbers (1,2,3,4...) 
 - `\(i\)` = an index value, representing the first to last and all the numbers in between of x.
 - `\(N\)` = the number of numbers
 - `\(\sum\)` = instruction to add up numbers

---

# Summation example

`\(x = [4,7,9]\)`

`\(\sum_{i=1}^{i=N}{x_i} = x_{i=1} + x_{i=2} + x_{i=3} = 4 + 7 + 9 = 20\)`

---

# Mean in a table

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> index </th>
   <th style="text-align:left;"> x </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:left;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:left;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:left;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:left;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:left;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sum </td>
   <td style="text-align:left;"> 30 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mean </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
</tbody>
</table>

---

# The mean equally divides the sum

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> index </th>
   <th style="text-align:left;"> x </th>
   <th style="text-align:left;"> equal_parts </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:left;"> 7 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:left;"> 8 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sum </td>
   <td style="text-align:left;"> 30 </td>
   <td style="text-align:left;"> 30 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:left;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mean </td>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
</tbody>
</table>

---
class: light

# The mean is the balancing point

- deviation = score minus mean
- sum of deviations will always equal zero

]

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> index </th>
   <th style="text-align:left;"> x </th>
   <th style="text-align:left;"> deviations </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:left;"> -2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:left;"> 7 </td>
   <td style="text-align:left;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:left;"> -4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:left;"> 8 </td>
   <td style="text-align:left;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sum </td>
   <td style="text-align:left;"> 30 </td>
   <td style="text-align:left;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:left;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mean </td>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:left;"> 0 </td>
  </tr>
</tbody>
</table>
]
---

# Finding the Mean in Python

Use the `mean()` function

```python
#make some numbers
a=np.random.randint(1,10+1, 12)
np.mean(a)
```

---

# sum() and length()

- `sum()` sums up the numbers
- `.size` counts up the number of numbers in the variable

```python
a=np.random.randint(1,10+1, 12)
np.sum(a)
```

```python
a.size
```

---

# Mean  = sum()/length()

```python
a=np.random.randint(1,10+1, 12)
np.sum(a)/a.size
```

---

# Thinking about the Mean

When would the mean be a good thing to know?

Most appropriate for interval and ratio data. But sensitive to outliers.

---

# Do descriptive statistics for central tendency actually describe the data?

## It depends on the data

---

# Histogram shape: Bell-Shaped

Mean (Red), Median (Green), Mode (Blue)

---

# Right-skewed

Mean (Red), Median (Green), Mode (Blue)

---

# Outliers

Outliers are really big or really small values that are unusual compared to the rest of the data

---
class: light

# Mean, Median, and outliers

The mean is influenced by outliers, the median is not.

Mean (Red), Median (Green)
<img src="2-Descriptives_files/figure-html/unnamed-chunk-27-1.png" width="400px" style="display: block; margin: auto;" />

---
class: light
# Zooming in

The big number (2000) makes the mean really big, because it is included in the sum.

---

# Always plot your data

---
class: light
# Big ideas

1. Descriptive statistics help us reduce a large pile of numbers to a few numbers that "describe the data"

2. Mode, median, mean, are descriptives for central tendency in the data (meant to represent what most of the numbers are like)

3. Measures of central tendency can be "off" by quite a bit depending on the shape of the data, need to look at data to see if they are appropriate

---

Thanks to Todd Gureckis and Matt Crump for the slides.