An Introduction to Robust Statistics

Better MAD than mean.

Ah, the joys of the school maths classroom. I’m not sure how old I was when I first learned the “three Ms” – mean, median, mode – but I’m certain I was in primary school. Given some numbers, finding the mode was easy – just pick the number that appears the most times. Finding the median was easy – just line them up in order and pick the middle one. But the mean? Oh dear. The mean was mean. To find it you had to do a lot of Complicated Adult Maths like adding everything up and dividing and making sure you typed everything into your calculator just right: one badly-placed decimal point could spell your doom and make you do it all again. Even then, the answer you got was usually a gnarly fraction, and since it wasn’t actually one of the data points, could it really be said to be an “average” of them? Isn’t it kind of artificial somehow?

Bar or pub counter illustration. cartoon flat front interior of beer bar  with chairs | Free Vector
A millionaire walks into a bar. The ‘average’ wealth of all bar patrons has just tripled. What wonderful sharing of prosperity.

Nevertheless, when we think of an ‘average’ in our adult lives, we usually think of the mean. It’s the grown-up average. It “uses all of the information”, which… must be inherently good! After all, it couldn’t possibly be that some of the information we have is complete and utter junk, could it?

Spoiler: in the real world, far more often than you would think, some of the information is indeed complete and utter junk. We don’t live in a world of spherical cows in a vacuum and pretending we do doesn’t help us do good science.

“a system which has spherical symmetry, and whose state is changing because of chemical reactions and diffusion … cannot result in an organism such as a horse, which is not spherically symmetrical.” ~Alan Turing, The Chemical Bases of Morphogenesis

Trouble is, if you’ve got a lot of information, it’s pretty hard to work out which of it is good information and which of it is bad information. Sorting out the good from the bad like this is the central premise of my research in anomaly detection. Lots of people do it in lots of different ways, and it’s a bit of a complicated mess sometimes. But let’s put most of that aside for now, and go back to the childhood maths classroom.

Avoiding a Breakdown

Let’s say that the class is trying to count the number of daisies on the school field. To do this, each member of the class is given a hoop with an area of 1 square metre and told to throw it out to a random spot on the field and count how many daisies lie inside it when it lands. Find the average of the childrens’ counts, multiply by the area of the field, and there’s your estimate!

After some delighted hoop-thowing and flower-counting, you obtain the following data points from your study.

$$(31, 17, 14, 22, 185, 27, 236)$$

Rosie and Johnny both swear up and down that their hoops just landed that way in a big daisy clump and they definitely didn’t intentionally set them down there in a way that would bias the results, oh no. You’re slightly skeptical, but who are you to question the wisdom of six-year-olds? You run the calculations.

$$(31 + 17 + 14 + 22 + 185 + 27 + 236)/7 = 532/7 = 71$$

This upsets some of the other six-year-olds, who grumble that Rosie and Johnny ruined the experiment. Not only did they possibly set their hoops down unfairly, they also probably didn’t even count all of those daisies and just guessed! And since they’re guessing really big numbers, guessing them even a little bit wrongly can completely wipe out all the careful counting the rest of the children did.

Which wildflowers have YOU spotted on lockdown walks? - CBBC Newsround
I am reliably informed that the correct biological term for this is ‘quadrat sampling’ and it is usually done with squares. However, I do not care, because I am a mathematician and not a biologist, and my explanatory examples can be whatever I want them to be.

Those kids are absolutely correct to be upset. Anomalous values aren’t just bad because they’re anomalous, they’re bad because they disproportionately influence the dataset and overwhelm the sensitive non-anomalous information it contains.

In the field of robust statistics, there is a concept of a ‘breakdown point’ to determine how robust an estimator is. The breakdown point of an estimator is the proportion of arbitrarily incorrect observations that estimator can handle before giving an arbitrarily incorrect result. That is, how many junk data points (and if a data point is junk, it can be as junk as you want – say Rosie had reported 1000 daisies, or 10000, or a million and three) can be in your data before your estimator becomes junk (not just a little bit biased, but really junk).

The breakdown point of the mean is a big fat 0%. This is because one single junk data point can ruin the whole thing. However, the breakdown point of the median is, in this case, 3/7 (approximately 43%). because any three junk data points – no matter if they’re small junk or big junk or even negative numbers junk – can’t influence the median enough to make it one of the junk points. For really big samples of data, the breakdown point of the median will approach 50%.

To refresh your primary school mathematics, we calculate the median by ordering the values and taking the middle as follows:

$$(14, 17, 22, \textbf{27}, 31, 185, 236)$$

The kids are overall much happier about this method (Connie, in particular, is ecstatic about how her value of 27 was ‘chosen’).

Measures of Spread

After a measure of centrality like the mean (or the median), the most valuable one-number-summary to know about a dataset is a measure of spread. How far away are the data points from each other? How well does your centrality measure actually describe what an ‘average’ (randomly chosen) data point looks like?

In our non-robust world, the standard deviation is the go-to measure of spread. It’s the root mean square of the distances between all points and the mean. Obviously, since the mean is involved in the calculations, the standard deviation has a breakdown point of 0%. This is bad.

Drawing on our previous experience with the median fixing our problems, let’s examine the inter-quartile range (IQR) and see if it helps us. To recap, the IQR is the difference between the first and third quartiles: if the median can be thought of as 50% of the way ‘up’ the dataset, then the IQR is the difference between 25% of the way up and 75% of the way up.

For discrete datasets, how to pick actual numbers for the quartiles is a bit disputed, but for the sake of this post I’m using the first method that appears on wikipedia.

$$(14, \textbf{17}, 22, 27, 31, \textbf{185}, 236)$$

$$185 – 17 = 168$$

Oh no. We have a problem!

The breakdown point of the IQR for this dataset is only 1/7 (or in the case of a larger dataset, approaching 25%). This is because, even though you could tolerate 50% of the data being anomalous provided the data anomalies were spread evenly in both directions, we are concerned with the worst case – all of the anomalies off to the same side.

Can we do any better? Robust statistics tells us that yes, we can.

The Median Absolute Deviation (MAD) is what you get when you apply median-thinking to the way of deriving the standard deviation. It’s best explained by a concrete example.

$$(14, 17, 22, \textbf{27}, 31, 185, 236)$$

Find the distance from each point to the median.

$$(13, 10, 5, 0, 4, 158, 209)$$

Reorder, and find the median of those distances.

$$(0, 4, 5, \textbf{10}, 13, 158, 209)$$

The MAD has a breakdown point of 50%, twice that of the IQR. This is because it doesn’t matter what direction the anomalies occur in – they’ll all be lumped up at the top end of the reordered data. It’s allowed us to calculate a measure of spread that isn’t massively affected by the anomalies in the dataset. Since we live in a world where many people only ‘get’ the standard deviation as a measure of spread, we rescale by a constant factor to make the MAD a (robust) estimate for the standard deviation that is consistent when the data is normally distributed. Turns out that constant is 1.4826. (And my careful choosing of examples to avoid having to deal with decimals was going so nicely, too…)

$$\text{MAD} = 10 * 1.4826 = 14.826$$

Can we ever do better than a breakdown point of 50%? Intuitively, no: if more than half of the data can be whatever we want it to be and not unduly unfluence our estimator, then we can flip our thinking as to which points are ‘anomalous’ and say that less than half the data does unduly influence our estimator. Contradiction.

50% is pretty good going, though, We can get useful results with almost half our dataset being junk. Many of the other robust estimators for other statistical quantities can only wish they had a breakdown point this high.

Library of vector freeuse stock tantrum png files ▻▻▻ Clipart Art 2019
Breakdown points are important. Don’t get mean, get MAD.

For more reading on Robust Statistics, as well as how it applies to anomaly detection as a field specifically, check out this paper from 2018 by Rousseeuw and Hubert.