If you ever have to process an indeterminate length stream of data, it's useful to be able to compute some basic statistics as you go. The key phrase there is *as you go* - we could easily compute a mean of an arbitrary number of samples $N$

$$ \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i $$

but that means we have to keep $N$ examples in memory! That's $O(N)$ space complexity, which for very a large $N$ is unacceptable.

###### Recursive mean

Fortunately, we can compute the recursive mean given

- The previous mean $\bar{x}_{i-1}$
- The most recent example $x_i$
- The number of examples observed so far $i$

Then all we need to do is as follows. (Note that the use of $i-1$ is Bessel's correction)

$$ \bar{x}_i = \frac{(i-1) \times \bar{x}_{i-1} + x_i}{i} $$

Starting with $\bar{x}_1 = 0$. Note that we "start" from $1$ to avoid division by zero. Think of it as counting the number of observations so far. We can't calculate anything until we've seen at least one anyway. Simple python example:

```
import random;
mean_prev = 0;
i = 0;
for x in [random.uniform(-10,10) for _ in range(10)]:
i += 1
mean = ((i-1) * mean_prev + x) / i;
mean_prev = mean;
print("[{:d}] x={:.2f}, mean={:.2f}".format(i, x, mean))
```

```
[1] x=-0.55, mean=-0.55
[2] x=-4.28, mean=-2.41
[3] x=9.66, mean=1.61
[4] x=7.60, mean=3.11
[5] x=-4.35, mean=1.62
[6] x=8.14, mean=2.70
[7] x=-8.69, mean=1.08
[8] x=-6.55, mean=0.12
[9] x=7.41, mean=0.93
[10] x=6.21, mean=1.46
```

###### Recursive variance and standard deviation

To calculate the variance, we also need the sum of the squares of the of the data points from $0..i$.

$$ \sigma^2_i = \frac{\sum x^2_i}{i} - \bar{x}^2_i $$

Then the standard deviation is simply the square root of the variance.

$$ \sigma_i = \sqrt{\sigma^2_i} $$

It is simple to add this into our python sample:

```
import random;
import math;
mean_prev = 0;
sum_sq = 0;
i = 0;
for x in [random.uniform(-10,10) for _ in range(10)]:
i += 1
mean = ((i-1) * mean_prev + x) / i;
sum_sq += (x-mean)**2;
var = sum_sq / i;
std = math.sqrt(var)
mean_prev = mean;
print("[{:d}] x={:.2f}, mean={:.2f}, std={:.2f}".format(i, x, mean, std))
```

```
[1] x=-9.89, mean=-9.89, std=0.00
[2] x=4.05, mean=-2.92, std=4.93
[3] x=1.69, mean=-1.38, std=4.40
[4] x=-2.57, mean=-1.68, std=3.83
[5] x=1.94, mean=-0.96, std=3.67
[6] x=6.77, mean=0.33, std=4.26
[7] x=9.95, mean=1.71, std=5.02
[8] x=-7.99, mean=0.49, std=5.57
[9] x=-5.04, mean=-0.12, std=5.51
[10] x=-6.69, mean=-0.78, std=5.55
```

And that's all there is to calculating the most common stream statistics.

###### Addendum: Recursive estimate of the covariance

Something I've also had to deal with in multivariate streams is a recursive estimate of the covariance matrix. Fortunately we can do this using the means we have already calculated. I repeat that this provides an *estimation* of the sample covariance - you will obviously calculate a different covariance matrix if you took all the points and calulcated it classically. But if you can fit all points in memory then what's the point of doing this?

$$

C_i(x,y) = C_{i-1} (x,y) \cdot (i-1) + \frac{(x_i - \bar{x}_i)(y_i - \bar{y}_{i-1})}{i}

$$

Here is an example Java implementation from a stream clusterer I am writing:

```
private double[] totals;
private double[] mean;
private double[][] cov;
private int weight;
... (Rest of class redacted) ...
/**
* Adds the example to the cluster by updating the cluster statistics.
* @param example The next stream example to be added to the cluster.
*/
public void add(double[] example) {
weight++;
double[] prevMean = Arrays.copyOf(mean, mean.length);
for(int i=0;i<example.length;i++) {
totals[i] += example[i];
mean[i] = totals[i] / weight;
for(int j=0;j<example.length;j++) {
cov[i][j] = (cov[i][j] * (weight-1) + (example[i] - mean[i]) * (example[j] - prevMean[j]))/weight;
}
}
}
```