# PCA Feature Extraction for Change Detection in Multivariate Streaming Data

I get quite a lot of enquiries asking for a code example of the titular paper:

Kuncheva, Ludmila I., and William J. Faithfull. "PCA feature extraction for change detection in multidimensional unlabeled data." IEEE transactions on neural networks and learning systems 25.1 (2013): 69-80.

So, without further ado, here is a jupyter notebook hosted on colab which demonstrates the concept. The core of the python script (sans SPLL implementation, which can be found here) is included below in case Google decides to kill colab at some point in the future.

from sklearn.decomposition import PCA
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt

# Make a stream with a change halfway along
def make_stream(stream_length, features):
a = rnd.randn(stream_length // 2, 10)

b = np.hstack((rnd.randn(stream_length // 2,5), rnd.randn(stream_length // 2, 5) * 50))
stream = np.vstack((a, b))

# just adding a bit of artificial variance to the data to avoid empty clusters
stream *= rnd.randn(features)
return stream

stream = make_stream(500, 10)

window_size = 50
length = len(stream) - (2 * window_size)

statistic_raw = []
statistic_pca = []

for i in range((window_size * 2), length):

W1 = stream[i:i+window_size, :]
W2 = stream[1+i+window_size:1+i+(2*window_size), :]

# This would be much faster with incremental PCA, but this is just PoC...
pca = PCA()
pca.fit(W1)

# Keep the least variant features
feature_indices = np.where(pca.explained_variance_ratio_ < 0.05)

if np.size(feature_indices) < 2:
# Guard against empty clusters...
print("WARN: ({:.2f}) Skipping SPLL as it would fail with empty cluster...".format(i / length))
st_raw = 0
st_pca = 0
else:
# Transform with W1's coefficients, only keep the least variant features
W1_PCA = pca.transform(W1)[:, feature_indices.tolist()]
W2_PCA = pca.transform(W2)[:, feature_indices.tolist()]

change_raw, _, st_raw = SPLL(W1, W2)
change_pca, _, st_pca = SPLL(W1_PCA, W2_PCA)

if i % 100 == 0:
print(".", end="")

statistic_raw.append(st_raw)
statistic_pca.append(st_pca)

# We see that the feature extraction is beneficial: PCA change score is higher
# before the change and lower after.
plt.plot(np.array(statistic_pca) - np.array(statistic_raw))