You have a list of numbers and you want to calculate its variance. Here are the methods to do that in Python.

### Method 1 - `statistics.variance`

Python comes with a builtin `statistics`

package, containing the `variance`

method.

```
import statistics
a = [-0.372, 0.073, -0.441, -0.577, 0.463, 0.569, -0.559, 0.300, -0.903, 0.442]
var = statistics.variance(a)
print(var)
```

`>> 0.28016716666666663`

### Method 2 - `numpy.var`

If you're doing any sort of data science and analytical computing, you're probably already using `numpy`

.

Numpy comes with the `var`

method - simply pass it an array and it returns the sample variance.

```
import numpy as np
a = [-0.372, 0.073, -0.441, -0.577, 0.463, 0.569, -0.559, 0.300, -0.903, 0.442]
var = np.var(a)
print(var)
```

`>> 0.25215044999999997`

Note that the result differs from the first method - `0.28`

vs `0.25`

. Why is that? We find that out in the next section.

### Method 3 - Using the formulas

You've seen in the first two examples that the values for variance of a list differ.

That's because of the difference between *population *and *sample *variance, and the way `statitics.variance`

and `numpy.var`

methods work by default.

If you're dealing with a fairly large sample, these values should be very close and you shouldn't care much about the difference. That means you can safely skip the math ahead.

If you're still curious, it has to do with *Bessel's correction:* the formula for *sample variance *is not exactly the same as the formula for *population variance.*

To understand what this means and why it is the case, let's start with the definition of the *variance of a random variable*:

$$Var(X)=E\left[\left(X-\mu_{X}\right)^{2}\right]$$

where $\mu_X=E\left[X\right]$ is the expected value of $X$.

In case we're dealing with a finite population, ie. our array of numbers is *not a sample* but the *entire population, *we're interested in the *population variance*. In this case we're not *estimating *the variance, but simply *calculating* it.

For a finite population $x_1, x_2, ..., x_n$ the population variance is given by:

$$

\sigma^{2}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\mu_{X}\right)^{2}

$$

Where $\mu_X$ is the population mean, given by the familiar formula for the arithmetic mean:

$$

\mu_{X}=\frac{1}{n}\sum_{i=1}^{n}x_{i}

$$

If our array of observations, $x_1, x_2, ..., x_n$, is a *sample* from a large or infinite population, we want to *estimate* the population mean from our sample data. To do that we start with the same formula:

$$

\hat{\sigma}^{2}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\mu_{X}\right)^{2}

$$

However, $\mu_X$ is the true population mean, which we don't know so we need to estimate it as well. We'll again use the familiar formula:

$$

\hat{\mu}_{X}=\bar{X}=\frac{1}{n}\sum_{i=1}^{n}x_{i}

$$

Our estimate for sample variance, $\hat{\sigma}^{2}$ is then:

$$

\hat{\sigma}^{2}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}

$$

No surprises so far, but now comes the fun part.

The formulas like the ones above, for sample mean and sample variance, are called *statistical estimators. *Whenever we're dealing with statistical estimators, we ideally want them to be *consistent *and *unbiased. *

A *consistent* *estimator* has the property that as the number of data points used (ie. sample size) increases indefinitely, the resulting estimates converges to the true value of the estimated parameter (see here for a more rigorous definition). Intuitively, for a small sample size, $n$ our estimates for mean or variance will vary a lot across different samples of that small size. The larger the sample size gets, the more consistent our estimates become.

Both our estimators above, for $\hat{\mu}_X$ and $\hat{\sigma}^{2}$, are indeed consistent estimators.

An *unbiased estimator* is such that across a large number samples from the same population, the expected value of the estimator $\hat{\theta}$ is equal to the actual value of the parameter $\theta$, ie. $E[\hat{\theta}]=\theta$.

Our estimator for the mean, $\hat{\mu}$ is in fact unbiased, that is:

$$

E\left[\frac{1}{n}\sum_{i=1}^{n}x_{i}\right]=E[X]=\mu_X

$$

However, the estimator of variance is biased. It can be shown that:

$$

E[\hat{\sigma}^2]=E\left[\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}\right]=\frac{n-1}{n}\cdot\sigma^2

$$

Multiplying both sides by $\frac{n}{n-1}$ we get an unbiased estimator for the sample variance, $\hat{\sigma_S}^2$:

$$

E[\hat{\sigma_S}^2]=E\left[\frac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}\right]=\sigma^2

$$

Using the $n-1$ denominator instead of $n$ when estimating the sample variance is known as *Bessel's correction. *It is used by default in `statistics.variance`

, but not in `numpy.var`

.

We'll conclude with an overview of which methods to use if you care whether you're calculating *sample *or *population *variance:

Sample variance | $$\hat{\sigma}^2=\frac{1}{n-1}\sum_{i=1}^{n-1}\left(x_{i}-\hat{\mu}_X\right)^{2}$$ | `numpy.var(x, ddof=1)` or `statistics.variance(x)` |

Population variance | $$\hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}$$ | `numpy.var(x)` or `statistics.pvariance(x)` |