With multiple vaccine Phase III trials coming back with promising interim results, I’d like to take a moment to talk about a bias in statistics called “survivorship bias”, and how it can skew our thinking.

Wikipedia’s definition of survivorship bias is as follows:

“Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to false conclusions in several different ways. It is a form of selection bias.”

Usually, survivorship bias causes us to be unreasonably optimistic. For instance, a meta-analysis of scientific studies in a field will often produce more evidence supporting a hypothesis than is actually true, because the studies that didn’t find any evidence for that hypothesis never got published (shout-out to the Journal of Articles in Support of the Null Hypothesis for doing what it can to address this problem).

Sometimes survivorship bias can lead us to damaging conclusions. For example, look at this picture of what parts of planes got shot on WW2 air raids, and think about there you would put armour on the planes if you could edit their design.

The obvious (and wrong) answer is to put armour on the parts of the plane with the most holes so it will stop the most bullets. The correct answer, of course, is to put armour on the bits of the plane without bullet holes, because planes that got shot in those places *didn’t make it back to be counted. *

Here, the “people or things” we are interested in are infectious diseases. And the “selection process” is not getting relegated to the history books by the development of a successful vaccine. The battleground of public health is littered with these kinds of corpses. Mumps, rubella, tetanus, polio, whooping cough, smallpox… and yet there are also survivors, like HIV and influenza and measles.

Because HIV and influenza and measles are survivors, they’re the ones we hear about in the papers. And because of that, they’re the ones we unconsciously consider in the back of our minds when weighing up probabilities, including probabilities of success for vaccine development and deployal. COVID has been compared a lot to the flu, for example, but not much if at all to polio (despite the fact that the 2020 pandemic has revived iron lungs as ventilator substitutes in countries that don’t have enough of them).

But each of HIV, influenza, and measles has a reason *why* it hasn’t been consigned to the history books. They are, in summary (and with apologies for the oversimplifications):

- HIV infects by getting eaten by the immune system and then slowly destroying it from the inside, so the standard vaccine idea of “make the immune system eat it quickly before it causes disease” is a non-starter, and most vaccine trials for HIV fail because of this (some even make things worse).
- Influenza mutates really, really fast. It doesn’t have the “check your work for mistakes” proteins many other viruses do, having sacrificed them and the advantages they bring in favour of an infection strategy specifically based around mutating enough to infect the same people over and over. It’s so far been impossible to develop a vaccine that the flu can’t mutate its way around, though we have a good crack at it every year.
- Measles is super duper infectious. Remember how everyone’s been talking about the R rate for COVID which is about 3 in a normal population (without social distancing and the like)? The R rate for measles is about 15, which is far higher than any of the other diseases we have vaccines for. This means that even though we have a vaccine, we need to have 95% coverage to stop measles causing outbreaks, and the presence of just a few people who can’t or won’t get vaccinated stops us from getting rid of it for good.

Enter COVID, stage left, November 2019. COVID is an infectious and dangerous disease, so in estimating how it will behave, we unconsciously compare it to the other infectious and dangerous diseases most prominent in the back of our minds. This leads us to an overly *pessimistic* outlook because of survivorship bias, which is a nice change from the usual.

Truth is, the answer to the question of “why haven’t we eradicated covid yet?” is probably “we haven’t had the time”, rather than something more substantial preventing successful development and deployment of a vaccine. This is borne out by the interim trial results, which report that the standard way of vaccine making works far better for COVID than it does for, for instance, HIV. COVID has “check your work for mistakes” proteins, so it mutates much more slowly than the flu, and we’re not seeing anywhere near the amount of reinfection that would cause us to be seriously worried on this front. And with an R rate of 3, we only need about 75% vaccine coverage to wipe COVID out from society, so the antivax movement isn’t really big enough (yet) to present a COVID-related health problem.

Hopefully, as this pandemic gets mopped up, we’ll see less comparing COVID to the flu, and more comparing COVID to polio, in the graves of the history books where it belongs.

]]>Ah, the joys of the school maths classroom. I’m not sure how old I was when I first learned the “three Ms” – mean, median, mode – but I’m certain I was in primary school. Given some numbers, finding the mode was easy – just pick the number that appears the most times. Finding the median was easy – just line them up in order and pick the middle one. But the mean? Oh dear. The mean was *mean*. To find it you had to do a lot of Complicated Adult Maths like adding everything up and dividing and making sure you typed everything into your calculator just right: one badly-placed decimal point could spell your doom and make you do it all again. Even then, the answer you got was usually a gnarly fraction, and since it wasn’t *actually* one of the data points, could it really be said to be an “average” of them? Isn’t it kind of artificial somehow?

Nevertheless, when we think of an ‘average’ in our adult lives, we usually think of the mean. It’s the grown-up average. It “uses all of the information”, which… must be inherently good! After all, it couldn’t possibly be that some of the information we have is complete and utter junk, could it?

Spoiler: in the real world, far more often than you would think, some of the information is indeed complete and utter junk. We don’t live in a world of spherical cows in a vacuum and pretending we do doesn’t help us do good science.

Trouble is, if you’ve got a lot of information, it’s pretty hard to work out which of it is good information and which of it is bad information. Sorting out the good from the bad like this is the central premise of my research in anomaly detection. Lots of people do it in lots of different ways, and it’s a bit of a complicated mess sometimes. But let’s put most of that aside for now, and go back to the childhood maths classroom.

Let’s say that the class is trying to count the number of daisies on the school field. To do this, each member of the class is given a hoop with an area of 1 square metre and told to throw it out to a random spot on the field and count how many daisies lie inside it when it lands. Find the average of the childrens’ counts, multiply by the area of the field, and there’s your estimate!

After some delighted hoop-thowing and flower-counting, you obtain the following data points from your study.

$$(31, 17, 14, 22, 185, 27, 236)$$

Rosie and Johnny both swear up and down that their hoops just *landed* that way in a big daisy clump and they definitely didn’t intentionally set them down there in a way that would bias the results, oh no. You’re slightly skeptical, but who are you to question the wisdom of six-year-olds? You run the calculations.

$$(31 + 17 + 14 + 22 + 185 + 27 + 236)/7 = 532/7 = 71$$

This upsets some of the other six-year-olds, who grumble that Rosie and Johnny ruined the experiment. Not only did they possibly set their hoops down unfairly, they also probably didn’t even count all of those daisies and just guessed! And since they’re guessing really big numbers, guessing them even a little bit wrongly can completely wipe out all the careful counting the rest of the children did.

Those kids are absolutely correct to be upset. Anomalous values aren’t just bad because they’re anomalous, they’re bad because they disproportionately influence the dataset and overwhelm the sensitive non-anomalous information it contains.

In the field of robust statistics, there is a concept of a ‘breakdown point’ to determine how robust an estimator is. The breakdown point of an estimator is the proportion of arbitrarily incorrect observations that estimator can handle before giving an arbitrarily incorrect result. That is, how many junk data points (and if a data point is junk, it can be *as junk as you want* – say Rosie had reported 1000 daisies, or 10000, or a million and three) can be in your data before your estimator becomes junk (not just a little bit biased, but really junk).

The breakdown point of the mean is a big fat 0%. This is because one single junk data point can ruin the whole thing. However, the breakdown point of the *median* is, in this case, 3/7 (approximately 43%). because any three junk data points – no matter if they’re small junk or big junk or even negative numbers junk – can’t influence the median enough to make it one of the junk points. For really big samples of data, the breakdown point of the median will approach 50%.

To refresh your primary school mathematics, we calculate the median by ordering the values and taking the middle as follows:

$$(14, 17, 22, \textbf{27}, 31, 185, 236)$$

The kids are overall much happier about this method (Connie, in particular, is ecstatic about how her value of 27 was ‘chosen’).

After a measure of centrality like the mean (or the median), the most valuable one-number-summary to know about a dataset is a measure of spread. How far away are the data points from each other? How well does your centrality measure actually describe what an ‘average’ (randomly chosen) data point looks like?

In our non-robust world, the standard deviation is the go-to measure of spread. It’s the root mean square of the distances between all points and the mean. Obviously, since the mean is involved in the calculations, the standard deviation has a breakdown point of 0%. This is bad.

Drawing on our previous experience with the median fixing our problems, let’s examine the inter-quartile range (IQR) and see if it helps us. To recap, the IQR is the difference between the first and third quartiles: if the median can be thought of as 50% of the way ‘up’ the dataset, then the IQR is the difference between 25% of the way up and 75% of the way up.

For discrete datasets, how to pick actual numbers for the quartiles is a bit disputed, but for the sake of this post I’m using the first method that appears on wikipedia.

$$(14, \textbf{17}, 22, 27, 31, \textbf{185}, 236)$$

$$185 – 17 = 168$$

Oh no. We have a problem!

The breakdown point of the IQR for this dataset is only 1/7 (or in the case of a larger dataset, approaching 25%). This is because, even though you could tolerate 50% of the data being anomalous provided the data anomalies were spread evenly in both directions, we are concerned with the worst case – all of the anomalies off to the same side.

Can we do any better? Robust statistics tells us that yes, we can.

The Median Absolute Deviation (MAD) is what you get when you apply median-thinking to the way of deriving the standard deviation. It’s best explained by a concrete example.

$$(14, 17, 22, \textbf{27}, 31, 185, 236)$$

Find the distance from each point to the median.

$$(13, 10, 5, 0, 4, 158, 209)$$

Reorder, and find the median of those distances.

$$(0, 4, 5, \textbf{10}, 13, 158, 209)$$

The MAD has a breakdown point of 50%, twice that of the IQR. This is because it doesn’t matter what direction the anomalies occur in – they’ll all be lumped up at the top end of the reordered data. It’s allowed us to calculate a measure of spread that isn’t massively affected by the anomalies in the dataset. Since we live in a world where many people only ‘get’ the standard deviation as a measure of spread, we rescale by a constant factor to make the MAD a (robust) estimate for the standard deviation that is consistent when the data is normally distributed. Turns out that constant is 1.4826. (And my careful choosing of examples to avoid having to deal with decimals was going so nicely, too…)

$$\text{MAD} = 10 * 1.4826 = 14.826$$

Can we ever do better than a breakdown point of 50%? Intuitively, no: if more than half of the data can be whatever we want it to be and not unduly unfluence our estimator, then we can flip our thinking as to which points are ‘anomalous’ and say that less than half the data *does* unduly influence our estimator. Contradiction.

50% is pretty good going, though, We can get useful results with almost half our dataset being junk. Many of the other robust estimators for other statistical quantities can only wish they had a breakdown point this high.

For more reading on Robust Statistics, as well as how it applies to anomaly detection as a field specifically, check out this paper from 2018 by Rousseeuw and Hubert.

]]>I mention this because I learned a lot from it – not least, how to use proper grammar and sentence structure – but especially because the world of creative writing was the first place I ever encountered the emotional ‘stretch’ associated with large-scale self-motivated projects. At first I was pretty terrible at things like keeping myself on track and maintaining a coherent plan and getting words on paper even when I wanted to curl up and die of anxiety and shame at how terrible my work was. But as I bounced from one story to another, the projects I undertook getting more coherent and complex each time, I knew I was on an upward trajectory heading… somewhere, I guess?

Then I got into university and the Cambridge Mathematical Tripos sacrificed my creative heart on the altar of problem sheets and examinations. But that’s a story for another time.

Now I stand on the brink of a new phase of my life, with an awful lot more maths know-how and relevant industrial experience and a respectable research topic direction, but many of the emotional struggles of my daily life are very familiar from my teenage fanfiction days. I recently attended a workshop organised for the STOR-i MRes cohort entitled “Building Resilience and Tackling your Inner Critic” and delivered by the amazing Tracy Stead. Most of what follows is my notes from that workshop, tidied up and interspersed with my own thoughts on the subject.

Tracy introduced a metaphor that I particularly like: the idea of motivation and confidence as a bucket with holes in. Though the bucket is always draining, certain things drain the bucket faster, and certain things fill it up. It’s nearly impossible to do productive work when the bucket is close to empty. Therefore, we should make sure to keep the bucket topped up and avoid poking additional holes in it – that way, we’ll be more productive overall.

According to the 2019-20 STOR-i MRes cohort, these things drain the bucket:

- Conflicts with other people
- Code doesn’t work, I don’t know why or how to fix it
- When I forget what I’ve read or how to do something ‘basic’.
- Working alone without people who I can talk to that understand what I’m doing.
- Seeing a really good piece of academic work and thinking “I couldn’t possibly measure up to this, so why am I even trying?”

These things, however, top it up:

- Talking to someone supportive
- Getting good results, or quick wins
- Taking a break and doing something else for a while
- Breaking new ground, starting a new project

I note that the only thing that fills the bucket that is both under conscious control (we can’t have research breakthroughs on command) and isn’t ‘do something else’ involves talking to someone. Maybe STOR-i is onto something with its philosophy of “A PhD is not a solo activity”.

Tracy’s “official” answer to the question of things that fill the bucket (or build resilience) is organised into a five-point plan (and I love myself a good five-point plan):

- Relationships – cohorts “part of a team”, networks “people find me valuable”, supervisors “I can ask for help”, getting outside perspective on my problems so they don’t seem so bad,
- Optimism – focus on the future not the past, goals and milestones and visions, figuring out: who do I want to be? Continuously striving for improvement
- Coping skills – build confidence, lead a healthy lifestyle, make your rituals and habits productive ones, minimise unwanted stressors in your life.
- Competence – invest in skills and problem solving, experience, knowledge.
- Emotional intelligence – If I notice, name, choose, and communicate my emotions, that will help me think more rationally about things.

I am a naturally anxious person, and especially so around other people. I absolutely dread meetings with my supervisors. (Sorry Idris & Paul if you’re reading this – it’s not you, I swear, you’re both lovely). My internal monologue in the lead-up to a meeting can get incredibly distorted and critical. This is Tracy’s titular *inner critic*, and is a problem definitely shared in various degrees by many of the MRes cohort. Here are some of the things our inner critics say about meetings with supervisors:

- I might say something stupid.
- I haven’t done enough this week and my supervisors will think I’m lazy.
- The supervisors might think I’m a mistake and I’m not good enough for this project.
- I’m not prepared for this meeting.
- Why haven’t I come up with any good ideas?
- I should know the answer to this.
- What I’ve done didn’t work, so it’s useless and there’s no point sharing it.

Tracy talked about re-framing each of these distorted thoughts into positive to build our motivation (fill that bucket!) and help us move forward.

*What I’ve done didn’t work, so it’s useless and there’s no point sharing it.*What hasn’t worked is a stepping stone towards what will work, so it’s good to share it.*I should know the answer to this.*The meeting is an opportunity to learn from an expert. The point of being here is to learn.*Why haven’t I come up with any good ideas?*It’s going to be great to talk about this with someone who understands the field. It’ll probably help me come up with good ideas.*I’m not prepared for this meeting.*This meeting is not a viva and my supervisors are not grading me on the quality of my preparation or performance.*The supervisors might think I’m a mistake and I’m not good enough for this project.*This stuff is actually really hard and the only reason my supervisors find it ‘easy’ is because they’ve been here a lot longer than me. They know that and they’re not expecting perfection from me.*I haven’t done enough this week and my supervisors will think I’m lazy.*I like my supervisors and it’s good to talk with them, especially during those times when I’m unmotivated and unproductive.*I might say something stupid*. I am definitely going to say something stupid. My supervisors will hopefully correct me. This is a good thing.

I don’t just write stories. I also write and perform poetry and songs. (If you meet me in person and want to put me in a good mood, give me a ukulele and ask me to sing you the song about hiding the dead bodies – it’s one of my favourites).

I have recently started becoming acquainted with the literature in the field of anomaly detection. This has inspired the following song, to the tune of *99 bottles of beer*:

99 papers on my to-read list

99 papers to read

Get through one, skim the references

127 papers on my to-read list.

“Projects” are constrained and time-bound. They have deadlines and well-defined stages and a (mostly) linear sense of progression. If you’re off-track or off-schedule, something is wrong. Contrast with “research” which is inherently expansive and messy, where going off-track is pretty much the norm and progression happens in jolts and lightbulbs interspersed with periods of what can at the time seem like aimless drifting through an ever-expanding mass of awful. (At least, that’s what the second-year PhDs tell me – they call it the “valley of shit”).

To get through this valley with sanity intact, Tracy asks us to pay attention to our environment around us, and design a ‘diet plan’ of daily activities designed to top up our motivation and creativity. Here’s what I came up with:

- Exercise: in the morning and during breaks
- Talk to people doing similar things, to enhance and refine your own perspective.
- Spend time being bored: your mind wanders and you have new ideas.
- Randomness and lack of routine. Breaking up habits.
- Meditation, meditative activities (colouring books, cleaning)
- Hold yourself accountable to self-imposed deadlines by telling other people about them.

Tracy covered a lot more than I’ve written about: impostor syndrome, task prioritisation, how to reflect and learn from experiences in a healthy way rather than dwelling on the negatives, etc. Here are two thoughts I had about writing that cropped up during the rest of the workshop that don’t relate much to anything I’ve said before:

- Software. The brain was not
*actually*designed to write documents in a linear fashion on a computer. We’re hampered so much by Microsoft Word and everything that looks like it (yes, this includes LaTeX) in ways that we don’t realise until we try something different. In my teenage years, I used Scrivener to organise my writing and thoughts in a heavily non-linear fashion. I’ve heard it’s pretty bad at rendering maths, though, so I’m on the lookout for better options for the PhD life. - Cycles of inspiration and criticism, drafting and re-drafting. You can’t edit an empty page. First drafts don’t need to be perfect. Creativity and perfectionism are antagonists, mood-wise. You need both, but you should separate them so they don’t screw each other up.

Head over to https://nextstrain.org/, an open-source project dedicated to genomic sequencing for real-time tracking of pathogen evolution, and teach yourself to interpret phylogenies!

]]>These algorithms usually have starting values, and picking the wrong starting value can affect how long the algorithm needs to run to get a ”good enough” answer. But what if you didn’t need to pick a starting value or an amount of time to run the algorithm, and it just gave you the exact right answer? This report examines one way to do this, called Coupling From The Past or CFTP for short.

Instead of running our algorithm into the future (where no matter how long we run it we’ll always be a tiny bit away from the true answer), CFTP asks us to pretend that our algorithm has been running for an infinite time in the past up until now. It then finds out where we would be if that was the case. Since the algorithm has been running for an infinitely long time, it turns out we must be exactly at the right answer. CFTP works by considering every single possible starting value, and then tries to make them ”couple up” to end up at the same place as quickly as possible by fiddling about with the random numbers the algorithm uses. Then it doesn’t matter where the algorithm started from infinitely far in the past, because all the roads lead here!

CFTP is quite a tricky method. It only works for some algorithms and needs to have random numbers, and running something ”from the past” is a lot harder to get your head around than running it the normal way. In particular, you need to be very careful where you’re getting those random numbers from, or the end answer won’t be right.

Luckily, we can tell the computer to give us the same random numbers we asked it for a while ago by setting a random ”seed”. Without this, CFTP wouldn’t be possible.

See the creator of CFTP’s website http://www.dbwilson.com/exact/ to get more into the mathematics of this algorithm.

]]>The central concept was this: any AI will have a big long explanation as to why it did what it did (maybe in the form of a polynomial or large matrix of weights) that is in most cases too long for a human to understand. You need the explanation to be shorter (while still being an explanation). The way you do this is to impose some kind of “penalty” related to the size of the explanation, and hope that you can get a tradeoff between an AI that classifies well and an AI with short explanations. Example penalties include sparsity constraints in generalised linear models (such as Lasso regression) and restrictions on the allowable depth of decision trees.

However, these are internal methods. They are things you consider when you are designing your AI. Suppose you weren’t allowed to do that – maybe you absolutely *must* get the best classification performance possible and weren’t allowed to factor in things like explainability, or maybe the internal structure of your classifier is just not suited to making its explanations shorter via any method you’ve found so far. What do you do then?

A trained classifier will take an input, for example the image below, and output one of a preset number of categories to which it belongs, for instance either “dog” or “cat”. In actually, classifiers give a score, which is a number between 0 and 1 saying “how doggy vs catty is this picture”. All pictures above a threshold are classified as dogs and those below as cats – setting that threshold is the job of the person who trains the model, and is based on a tradeoff for the error in misclassification either way.

However, the score here is what’s important. The trained classifier can be thought of as a function f:{Space of all possible images of correct resolution} -> [0, 1], without caring about its internal workings at all. It’s even (usually) a continuous function for some kind of notion of “continuity”. And what do we do in OR when we find an interesting continous function?

We optimise it, of course!

Now, the good thing here is we’re not really looking for global optima. We’re not particularly interested in constructing the “doggiest dog that ever did dog”[1], and thank goodness we aren’t, because that image space is extremely high-dimensional (number of dimensions = number of pixels). Instead, we can look at what our optimiser does when we give it something fun from our test set as a starting point.

“Here is a dog – please make it more of a cat”. “Here is a cat – please make it EVEN MORE of a cat”. “Here is a cat that you misclassified as a dog – please make it more cat”[2]. By looking at what the optimiser does in response to this, you can learn what features the classifier is considering “important” in relation to that particular image. Does it make the ears rounder when asked to go more doglike, for instance?

Turns out you don’t even need an optimiser – just by using a high-dimensional gradient estimator, you can create a map of what pixels the classifier considers “locally important” in an image. There’s an old urban myth about a classifier for American vs Soviet tanks that was actually classifying the background light levels as cloudy vs sunny due to the days on which the training set photos were taken, and therefore failed spectacularly in live use[3]. Such a classifier problem is easy to spot if you can tell that the classifier is mainly looking at the backgrounds of an image rather than the features it is supposed to be identifying.

By using an external method such as this to provide explainability to every decision an AI makes, we avoid the tradeoffs internal methods force us to make. External methods add rather than compromise.[4]

[1] Although I might be interested in the “cattiest cat that ever did cat”.

[2] Did I mention I like cats?

[3] Almost certainly a false myth; see https://www.gwern.net/Tanks for the story. It’s probably stuck around because it’s both such a useful explanatory device for elementary machine learning and it lets us poke fun at the incompetence of military science & technology.

[4] Ribiero et. al (2016) “Why Should I Trust You?”: Explaining the Predictions of Any Classifier https://arxiv.org/abs/1602.04938 is a brilliant resource for diving into this further.

]]>Of course not. Some houses don’t have anyone there during the hours a field officer might be visiting them, and those houses will quickly become overrepresented in the sample of houses left on your list. This is an example of “diminishing returns” in action – the more effort you put into something, the less efficient each extra bit of effort gets – and is a feature of many real-world systems that one might want to simulate.

Now imagine that you’re a virtual Census officer knocking on virtual doors. The person coding up the simulation only knows that on average 40% of the doors knocked on will answer the door, but doesn’t have any data beyond this about what happens for each day. The naive method would be to, after every knock, independently have the door be answered with probability 0.4 – however as shown above, this won’t capture the right real-world behaviour.

Why does this matter? Well, consider the two uses for the simulation – to make decisions far in advance of the live operation, and as a benchmark to track progress against during the live operation.

- If we don’t account for diminishing returns by day, we end up looking like we’re doing a lot better than we are halfway through the live operation, as our simulation predicts that we still have a 0.4 probability of people answering the door even though we’ve already visited all of the easiest addresses. This could cause false confidence and lead to eventual undercount because we don’t take appropriate countermeasures for things going wrong.
- If we don’t account for diminishing returns by category (e.g. time of day a Census officer is out and about) then our upfront decisions about scheduling could be thrown off by trying to optimise a simulation that doesn’t reflect the real world.

To expand on that last point: assume we had data on answering the door split by time of day – we know that on average 30% of people answer their doors in the daytime as opposed to 50% in the evenings after work. If we naively plug this into our simulation and then use it to find the optimal schedule, it will tell us to always visit in the evenings. However, in reality some houses are better approached in the daytime, and some mix of daytime and evening will be a better approach.

Agent-Based Modelling (ABM) is a type of simulating where you simulate each individual agent (in this case, Census officer and house) rather than just having numbers for “doors knocked on” and “doors answered”.

This method opens the door (heh) to a solution by having properties associated with an agent (such as a house’s “probability of answering the door in the daytime” and “probability of answering the door in the evening”). Values for these properties can vary among agents. Now, when the virtual officer knocks on the door, instead of a universal 0.3 or 0.5 we instead call on that individual agent’s probability. Easy-to-contact houses are thus removed from the pool of remaining agents and the system starts demonstrating the emergent behaviour of diminishing returns present in the real-world system.

Challenges still remain with this approach. From what distribution and why do you assign those properties to the agents, and how do you get the data needed to inform your choice? Daytime and evening contact rates of a house are probably correlated, but how correlated? Assuming you need a really well calebrated model and have the data to throw at it, is this really a better approach than just estimating a lookup table of contact rates by day? Nevertheless, it’s a powerful tool for the ABM kit.

This post was made drawing on my own experiences working on the collection operation for Census 2021.

]]>A simple example of mathematical fairness is known as the “Principle of the Divided Cloth”, and it goes all the way back to the Talmud. If there is a roll of cloth that person A claims to own all of and person B claims to own half of, how should it be divided? Well, assuming the claims are both reasonable, one would be tempted to split the cloth in proportion to the amount that was claimed by each side, in a 2:1 ratio. However, the Principle of the Divided Cloth instead proposes a 3:1 split, following the logic that only half of the cloth is in dispute – nobody is saying that A doesn’t own the first half – so the disputed part should be split equally.

This principle can be applied to a wide variety of problems where two (or more) parties can come together to create something more valuable than they could working apart, and must decide how to divide that extra value.

In an optimisation problem, you’re trying to locate the “best” solution. “Best” how and for whom, you ask? Good question. The usual answer is “best for whomever is paying me and however they define ‘best'”. However, in applying optimisation methods to complex problems when multiple parties have stakes in the outcome, you can find yourself unwillingly appointed to the position of arbitrator.

The example we’ve looked at in STOR-i relates to the OR-MASTER project about scheduling flight capacity on airport runways. The airport itself would like as much capacity as possible to be used (making it extra money), without going over capacity and causing delays as planes are held up for lack of runways. Each airline wants to be able to schedule its flights to take off and land whenever it wants, which may include during “peak periods” when the runways are too busy to accommodate everyone’s requests. The airlines as a collective want a solution they understand the mechanics of that means they don’t have to play convoluted guessing games with to get the flight slots they want.

There are two conflicts at play here:

- Each airline is competing with every other airline that uses that airport’s runways. We need to ensure the optimisation result is “fair” to every airline.
- The airlines as a collective are in competition with the airport when it comes to what they want from this optimisation. We need to balance the airlines’ need for “fairness” against the airport’s need for “maximum capacity in use”.

The first problem is solved by constructing a mathematical definition of “fairness”, which in this case is approximately “You get out of life what you put into it”. Airlines requesting slots in peak periods are likely to have their slot requests moved around. The more peak slots they request, the more movement they will have to put up with. We can construct a “coefficient of fairness” for each airline that quantified how fairly it has been treated by our optimiser, and then try to keep everyone’s fairness coefficient similar.

The second problem is harder to put in maths, because airlines and airports are not the same thing, so it’s hard to say what “equal treatment” means. Also, there’s nothing to say we should be treating them equally – that decision is better made by experts in the field of aviation than by us.

We solve this by constructing what’s called a “Pareto Frontier”, or a range of solutions that balance one objective (fairness) against another (capacity). Each point on the frontier is “optimal” in the sense that for the same capacity there is no fairer solution and for the same fairness there is no solution with more capacity.

Then, we can give this solution set over to the representatives of the airlines and airports to negotiate over, washing our hands of the problem.

It turns out in practice we can get a lot of what’s called “easy wins” – by sacrificing a small amount from “optimal capacity”, we can make the solution a lot more fair. This is visible in the diagram above as the slope of the frontier being flatter near the top. So by implementing any fairness metric at all in our optimisation, we can keep multiple parties happier.

]]>Here is a link to a post I wrote last year on the Government Statistical Service website, about a talk I organised for them about Ethics in Mathematics. https://gss.civilservice.gov.uk/blog/surely-theres-no-ethics-in-mathematics/

The talk itself can be found as a recording on the Cambridge University Ethics in Mathematics Project website, here: https://ethics.maths.cam.ac.uk/assets/videos/ons.mp4

]]>