Are your kids home for the holidays? Here's how you can teach them least squares, the Kalman filter, and even -as we will see - a hierarchical Bayesian model. I doodled on this topic in an old blog and was inspired to port the notes here by a nice article called Least Squares as Springs written by Professor Joshua Loftus recently. I also recommend his note. He remarks:

I think there's a lot of potential here for statistics education, especially for younger students... I was surprised by how difficult it was to find any references for this elementary idea that provides physics intuition for such an important method as least squares.

I could not agree more and would go further, suggesting that the analog is quite helpful to practitioners and researchers - it helps us think - and communicate. On Professor Loftus' prompting, I'm tempted to adopt Hadley Wickham's beautiful spring plotting example next time I produce a line of best fit for any business purpose!

How does that not convey least squares? Professor Loftus makes the suggestion that educational labs could be set up with analog examples of least squares problems.

The Kalman filter is a special case of least squares (indeed it is a special case of regression, as can be seen by the Duncan and Horn formulation of the Kalman filter). So it must be possible to construct a system of rods and springs that replicate its function. Least squares is simply energy minimization in a Hookean universe - one where forces are proportional to distance.

You'll recall the Kalman filtering setup. We periodically observe Brownian motion subject to gaussian measurement error and attempt to infer the ground truth from those measurements. But rather than dive into a bunch of conditional expectations for gaussian linear systems, we'll look for some physical intuition. We'll pretend that we've been robbed of our computers and are forced to create analog varieties just as in the good old days.

We make an observation that isn't always stressed up front in the statistical literature (Harrison and West for example) or control systems perspective (such as you will find at the at wikipedia entry for "Kalman filter", for example). Then we pursue the analogy between statistics and physics a little further, and show how the updating of a location estimate of a gaussian distribution amounts to a combination of center of mass and reduced mass calculations.

The idea is that we start with masses that represent observations and a prior, and then try to simplify the physical system. Because they contain fixed and fixed and free masses, resolving these physical systems in this fashion requires that we consolidate masses but also change reference frame. This leads to two different types of calculation, both familiar.

Suppose the prior estimate of location for a particle is \(m\) and the prior covariance is \(P\). Suppose we make an observation \(y\) with error variance \(R\). Our posterior belief is gaussian with location \(m'\) say and variance \(P'\). The update is usually written: \begin{eqnarray} m' & = & m + K ( y - m ) \\ P' & = & P(1-K), \ \ {\rm where} \\ K & = & \frac{P}{P+R} \end{eqnarray} However, it is in many ways more natural to use the inverses of covariances instead. If we write \(\varphi = 1/R\), \(p = 1/P\) and \( p' = 1/P'\) and multiply through by \( \frac{P+R}{PR} \) we notice that the Kalman filter update is merely a center of mass calculation: \begin{eqnarray} m' & = & \frac{m/P + y/R} { 1/R + 1/P } = \frac{ pm + \varphi y }{ \varphi + p } \\ p' & = & \frac{1}{P'} = \frac{P+ R}{PR} = \frac{1}{P} + \frac{1}{R} = \varphi + p \end{eqnarray} The analogy works if we treat precision as mass. And in what follows we'll be equally interested in the analogy between force and the derivative of the negative log likelihood function.

This table suggests that in a gaussian world, force is linear in distance. And it is true that we can construct an analogue Kalman smoother with rods and springs as follows:

An "analogue" gaussian smoother using perfect Hookean springs |

Minimizing energy corresponds to maximizing log-likelihood. And setting the derivative of log-likelihood to zero corresponds to finding the equilibrium where forces cancel out.

Furthermore, the fact that combining two pieces of evidence for one latent variable can sometimes be as simple as merging the two observations at their "center of precision" corresponds to a nice accident when forces grow linearly with distance: the impact of two masses on a third is unchanged if they coalesce at their center of mass.

But there is more to the story...

To demonstrate a richer physical analogy, we consider next a Gaussian distribution whose location is assumed unknown, but also gaussian.

Figure 1. Hierarchical model where location of a gaussian distribution is itself gaussian |

Suppose our prior is: \begin{eqnarray} P( x | \mu ) & \propto & e^{-\rho(x-\mu)^2} \\ P(\mu) & \propto & e^{-p(\mu-m)^2} \end{eqnarray} where this time \(m\) represents our guess as to the location of the center of the distribution. Symbolically, we might represent the prior with the following diagram.

Figure 2. Spring diagram representing prior knowledge of the location of a gaussian distribution

Figure 3. Spring diagram representing noisy evidence |

Figure 4. Prior location belief plus a noisy measurement |

We ought to recall here the rules for combining springs in series, or to be more direct, the "reduced mass" trick for replacing a three body problem with a two body problem. In either situation, physics reminds us that the combined action of the rightmost two masses can be simplified:

Figure 5. Prior belief plus a noisy measurement simplified using reduced mass |

We replace the mass \(\phi\) with a reduced mass \(\frac{\phi}{\phi+\rho}\) because the intermediating unit mass reduces the pull. Since it is well covered elsewhere, I will not derive the reduced mass expression but notice why the reduced mass makes sense in the limits. If \(\phi \rightarrow 0\) the relative size of the yellow unit mass is huge and so the mass at \(\mu\) hardly feels the pull from the green mass at \(y\) at all. In the other extreme case, when \(\phi \rightarrow \infty\), the unit mass is sucked into the green mass and is, for all intents and purposes, stationary. Thus it acts like a fixed unit mass pulling the mass at \(\mu\) rather than a floating one.

We proceed to the final simplification of the diagram. This is pretty easy as the two green masses are inertial. Their impact on the yellow mass is equivalent to a single inertial mass at their center of mass. Thus:

Figure 6. Simplification of Figure 4 by reduced mass and center of mass calculation. |

This closes the loop and demonstrates how updating can be performed for the hierarchical model in Figure 1.

As a parting note, we see that the limit \(\rho \rightarrow \infty\) leads to update equations \begin{eqnarray} p' & = & \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} = \frac{p+p/\rho + \varphi}{\varphi +\rho} \rightarrow \varphi + p \\ m' & = & \frac{ m \frac{p}{\rho} + y \frac{\varphi}{\varphi+\rho} } { \frac{p}{\rho} + \frac{\varphi}{\varphi + \rho} } \rightarrow \frac{ pm + \varphi y }{ \varphi + p } \end{eqnarray} which is the Kalman update as before. This is to be expected, since in the limit \(\rho \rightarrow \infty\) the problem of locating a distribution with unknown location (noisily observed) shrinks down to the problem of locating a point mass with unknown location (noisily observed).

And there you have it. An implementation of time series filtering, and the Kalman filter as a special case, that is capable of withstanding a power outage.

If you are interested in the relationship between Physics and Statistics, I suggest going (a lot) further than my noodling. I recently discovered a nice paper by Gabor Szekeley and Maria Rizzo's titled *Energy Statistics: A Class of Statistics Based on Distances *(pdf). It may help you carry over your physics intuition to statistics. It is also very practical. You can use the scipy.stats.energydistance for example (or R energy package), for a rotation invariant distance measure for multivariate distributions. The properties of this metric are non-trivial, as you'll see in the paper. It is possible that you have already encountered energy distance in another guise. Maximum mean discrepancy (MMD) is the term used in Machine Learning and a paper by Sejdinovic et al demonstrates the equivalence. Indeed they argue that MMD provides a generalization and thus more powerful tests.

On the other hand if you are interested in analog computing there's plenty out there. I'm sure you're aware of Babbage but here are a few lessor known examples:

- The use of analog tidal calculators (Kelvin) arguably turned the course of history (link)
- The Julius Totalizator of 1913, discussed in the first sixty seconds of this talk, is celebrated in the totalizator history site maintained by Brian Colon.
- A modern day analog computer on a chip (article)

Send me your favorites or join the discussion here.

## Comments