Calculating Variance in Python - Explained
You have a list of numbers and you want to calculate its variance. Here are the methods to do that in Python.
Method 1 - statistics.variance
Python comes with a builtin statistics
package, containing the variance
method.
import statistics
a = [-0.372, 0.073, -0.441, -0.577, 0.463, 0.569, -0.559, 0.300, -0.903, 0.442]
var = statistics.variance(a)
print(var)
>> 0.28016716666666663
Method 2 - numpy.var
If you're doing any sort of data science and analytical computing, you're probably already using numpy
.
Numpy comes with the var
method - simply pass it an array and it returns the sample variance.
import numpy as np
a = [-0.372, 0.073, -0.441, -0.577, 0.463, 0.569, -0.559, 0.300, -0.903, 0.442]
var = np.var(a)
print(var)
>> 0.25215044999999997
Note that the result differs from the first method - 0.28
vs 0.25
. Why is that? We find that out in the next section.
Method 3 - Using the formulas
You've seen in the first two examples that the values for variance of a list differ.
That's because of the difference between population and sample variance, and the way statitics.variance
and numpy.var
methods work by default.
If you're dealing with a fairly large sample, these values should be very close and you shouldn't care much about the difference. That means you can safely skip the math ahead.
If you're still curious, it has to do with Bessel's correction: the formula for sample variance is not exactly the same as the formula for population variance.
To understand what this means and why it is the case, let's start with the definition of the variance of a random variable:
$$Var(X)=E\left[\left(X-\mu_{X}\right)^{2}\right]$$
where $\mu_X=E\left[X\right]$ is the expected value of $X$.
In case we're dealing with a finite population, ie. our array of numbers is not a sample but the entire population, we're interested in the population variance. In this case we're not estimating the variance, but simply calculating it.
For a finite population $x_1, x_2, ..., x_n$ the population variance is given by:
$$
\sigma^{2}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\mu_{X}\right)^{2}
$$
Where $\mu_X$ is the population mean, given by the familiar formula for the arithmetic mean:
$$
\mu_{X}=\frac{1}{n}\sum_{i=1}^{n}x_{i}
$$
If our array of observations, $x_1, x_2, ..., x_n$, is a sample from a large or infinite population, we want to estimate the population mean from our sample data. To do that we start with the same formula:
$$
\hat{\sigma}^{2}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\mu_{X}\right)^{2}
$$
However, $\mu_X$ is the true population mean, which we don't know so we need to estimate it as well. We'll again use the familiar formula:
$$
\hat{\mu}_{X}=\bar{X}=\frac{1}{n}\sum_{i=1}^{n}x_{i}
$$
Our estimate for sample variance, $\hat{\sigma}^{2}$ is then:
$$
\hat{\sigma}^{2}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}
$$
No surprises so far, but now comes the fun part.
The formulas like the ones above, for sample mean and sample variance, are called statistical estimators. Whenever we're dealing with statistical estimators, we ideally want them to be consistent and unbiased.
A consistent estimator has the property that as the number of data points used (ie. sample size) increases indefinitely, the resulting estimates converges to the true value of the estimated parameter (see here for a more rigorous definition). Intuitively, for a small sample size, $n$ our estimates for mean or variance will vary a lot across different samples of that small size. The larger the sample size gets, the more consistent our estimates become.
Both our estimators above, for $\hat{\mu}_X$ and $\hat{\sigma}^{2}$, are indeed consistent estimators.
An unbiased estimator is such that across a large number samples from the same population, the expected value of the estimator $\hat{\theta}$ is equal to the actual value of the parameter $\theta$, ie. $E[\hat{\theta}]=\theta$.
Our estimator for the mean, $\hat{\mu}$ is in fact unbiased, that is:
$$
E\left[\frac{1}{n}\sum_{i=1}^{n}x_{i}\right]=E[X]=\mu_X
$$
However, the estimator of variance is biased. It can be shown that:
$$
E[\hat{\sigma}^2]=E\left[\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}\right]=\frac{n-1}{n}\cdot\sigma^2
$$
Multiplying both sides by $\frac{n}{n-1}$ we get an unbiased estimator for the sample variance, $\hat{\sigma_S}^2$:
$$
E[\hat{\sigma_S}^2]=E\left[\frac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}\right]=\sigma^2
$$
Using the $n-1$ denominator instead of $n$ when estimating the sample variance is known as Bessel's correction. It is used by default in statistics.variance
, but not in numpy.var
.
We'll conclude with an overview of which methods to use if you care whether you're calculating sample or population variance:
Sample variance | $$\hat{\sigma}^2=\frac{1}{n-1}\sum_{i=1}^{n-1}\left(x_{i}-\hat{\mu}_X\right)^{2}$$ | numpy.var(x, ddof=1) or statistics.variance(x) |
Population variance | $$\hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\hat{\mu}_X\right)^{2}$$ | numpy.var(x) or statistics.pvariance(x) |