What are we talking about when we talk about algorithmic transparency?

The term ”algorithmic transparency”, with variants and variations, has become more and more common in the many conversations I have with decision makers and policy wonks. It remains somewhat unclear what it actually means, however. As a student of philosophy I find that there is often a lot of value in examining concepts closely in order to understand them, and in the following I wanted to open up a coarse-grained view of this concept in order to understand it further.

At a first glance it is not hard to understand what is meant with algorithmic transparency. Imagine that you have a simple piece of code that manipulates numbers, and that when you enter a series it produces an output that is another series. Say you enter 1, 2, 3, 4 and that the output generated is 1, 4, 9, 16. You have no access to the code, but you can infer that the codde probably takes the input and squares it. You can test this with a hypothesis – you decide to see if entering 5 gives you 25 in response. If it does, you are fairly certain that the code is something like ”take input and print input times input” for the length of the series.

Now, you don’t _know_ that this is the case. You merely believe so and for every new number you enter that seems to confirm the hypothesis your belief may be slightly corroborated (depending on what species of theory of science you subscribe to). If you want to know, really know, you need to have a peek at the code. So you want algorithmic transparency – you want to see and verify the code with your own eyes. Let’s clean this up a bit and we have a first definition.

(i) Algorithmic transparency means having access to the code a computer is running as to have a human be able to verify what it is doing.

So far, so good. What is hard about this, then, you may ask? In principle we should be able to do this with any system and so be able to just verify that it does what it is supposed to and check the code, right? Well, this is where the challenges start coming in.

*

The first challenge is one of complexity. Let’s assume that the system you are studying has a billion lines of code and that to understand what the system does you need to review all of them. Assume, further, that the lines of code refer to each other in different ways and that there are interdependencies and different instantations and so forth – you will then end up with a situation where access to the code is essentially meaningless, because access does not guarantee verifiability or transparency in any meaningful sense.

This is easily realized by simply calculating the time needed to review a billion line piece of software (note that we are assuming her that software is composed of lines of code – not an obvious assumption as we will see later). Say you need one minute to review a line of code – that makes for a billion minutes, and that is a lot. A billion seconds is 31.69 years, so even if you assume that you can verify a line a second the time needed is extraordinary. And remember that we are assuming that _linear verification_ will be exhaustive – a very questionable assumption.
So we seem to have one interesting limitation here, that we should think about.

L1: Complexity limits human verifiability.

This is hardly controversial, but it is important. So we need to amend and change our definition here, and perhaps think about computer-assisted verification. We end up with something like.

(ii) Algorithmic transparency is achieved by access to the code that allows another system to verify the way the system is designed.

There is an obvious problem with this that should not be scooted over. As soon as we start using code to verify code we enter an infinite regress. Using code to verify code means we need to trust the verifying code over the verified. There are ways in which we can be comfortable with that, but it is worth understanding that our verification now is conditional on the verifying code working as intended. This qualifies as another limit.

L2: Computer assisted verification relies on blind trust at some point.

So we are back to blind trust, but the choice we have is what system we have blind trust in. We may trust a system that we have used before, or that we believe we know more about the origins of, but we still need to trust that system, right?

*

So, our notion of algorithmic transparency is turning out to be quite complicated. Now let’s add another complication. In our proto-example of the series, the input and output were quite simple. Now assume that the input consistens of trillions of documents. Let’s remain in our starkly simplified model: how do you know that the system – complex – is doing the right thing given the data?

This highlights another problem. What exactly is it that we are verifying? There needs to be a criterion here that allows us to state that we have achieved algorithmic transparency or not. In our naive example above this seems obvious, since what we are asking about is how the system is working – we are simply guessing at the manipulation of the series in order to arrive at a rule that will allow us to predict what a certain input will yield in terms of an output. Transparency reveals if our inferred rule is the right one and we can then debate if that is the way the rule should look. The value of such algorithmic transparency lies in figuring out if the system is cheating in any way.

Say that we have a game. I say that if you can guess what the next output will be and I show you the series 1, 2, 3, 4, and then the output 1, 4, 9, 16. Now I ask you to bet on what the next number will be as I enter 5. You guess 25 and I enter 5 and the output is 26. I win the bet. You require to see the code and the code says: ”For every input print input times input except if input is 5, then print input times input _plus one_”.

This would be cheating. I wrote the code. I knew it would do that. I put a trap in the code, and you want algorithmic transparency to be able to see that I have not rigged the code to my advantage. I am verifying two things: the rule I have inferred is the right one AND that rule is applied consistently. So it is the working of the system as well as its consistency or its lack of bias in anyway.

Bias or consistency is easy when you are looking at a simple mathematical series, but how do you determine consistency in a system that contains a trillion data points and uses a system of over, say, a billion lines of code? What does consistency mean? Here is another limitation, then.

L3: Algorithmic transparency needs to define criteria for verification such that they are possible to determine with access to the code and data sets.

I suspect this limitation is not trivial.

*

Now, let’s complicate things further. Let’s assume that the code we use generates a network of weights that are applied to decisions in different ways, and that this network is trained by repeated exposure to data and its own simulations. The end result of this process is a weighted network with certain values across it, and perhaps they are even arrived at probabilistically. (This is a very simplified model, extremely so).
Here, by design, I know that the network will look different every time I ”train” it. That is just a function of its probabilistic nature. If we now want to verify this, what we are really looking for is a way to determine a range of possible outcomes that seem reasonable. Determining that will be terribly difficult, naturally, but perhaps it is doable. But at this point we start suspecting that maybe we are engaged with the issue at the wrong level. Maybe we are asking a question that is not meaningful.

*

We need to think about what it is that we want to accomplish here. We want to be able to determine how something works in order to understand if it is rigged in some way. We want to be able explain what a system does, and ensure that what it does is fair, by some notion of fairness.

Our suspicion has been that what we need to do to do this is to verify the code behind the system, but that is turning out to be increasingly difficult. Why is that? Does that mean that we can never explain what these systems do?
Quite the contrary, but we have to choose an explanatory stance – to draw from a notion introduced by DC Dennett. Dennett, loosely, notes that systems can be described in different ways, from different stances. If my car does not start in the morning I can described this problem in a number of different ways.

I can explain it by saying that it dislikes me and is grumpy, assuming an _intentional_ stance, assuming that the system is intentional.
I can explain it by saying I forgot to fill up on gasoline yesterday, and so the tank is empty – this is a _functional_ or mechanical explanation.
I can explain it by saying that the wave functions associated with the care are not collapsing in such a way as to…or use some other _physical_ explanation of the car as a system of atoms or a quantum physical system.

All explanations are possible, but Dennett and others note that we would do well to think about how we choose between the different levels. One possibility is to look at how economical and how predictive an explanation is. While the intentional explanation is shortest, it gives me now way to predict what will allow me to change the system. The mechanical or functional explanation does -and the physical would take pages on pages to do in a detailed manner and so is clearly uneconomical.
Let me suggest something perhaps controversial: the ask for algorithmic transparency is not unlike an attempt at explaining the car’s malfunctioning from a quantum physical stance.
But that just leaves us with the question of how we achieve what arguably is a valuable objective: to ensure that our systems are not cheating in any way.

*

The answer here is not easy, but one way is to focus on function and outcomes. If we can detect strange outcome patterns, we can assume that something is wrong. Let’s take an easy example. Say that an image search for physicist on a search engine leads to a results page that mostly contains white, middle-aged men. We know that there are certainly physicists that are neither male or white, so the outcome is weird. We then need to understand where that weirdness is located. A quick analysis gives us the hypothesis that maybe there is a deep bias in the input data set where we, as a civilization, have actually assumed that a physicist is a white, middle-aged man. By only looking at outcomes we are able to understand if there is bias or not, and then form hypothesis about where that bias is introduced. The hypothesis can then be confirmed or disproven by looking at separate data sources, like searching in a stock photo database or using another search engine. Nowhere do we need to, or would we indeed benefit from, looking at the code. Here is another potential limitation, then.

L4: Algorithmic transparency is far inferior to outcome analysis in all sufficiently complex cases.

Outcome analysis also has the advantage of being openly available to anyone. The outcomes are necessarily transparent and accessible, and we know this from a fair amount of previous cases – just by looking at the outcomes we can have a view on whether a system is inherently biased or not, and if this bias is pernicious or not (remember that we want systems biased against certain categories of content, to take a simple example).

*

So, summing up. As we continue to explore the notion of algorithmic transparency, we need to focus on what it is that we want to achieve. There is probably a set of interesting use cases for algorithmic transparency, and more than anything I imagine that the idea of algorithmic transparency actually is an interesting design tool to use when discussing how we want systems to be biased. Debating, in meta code of some kind, just how bias _should be_ introduced in, say, college admission algorithms, would allow us to understand what designs can accomplish that best. So maybe algorithmic transparency is better for the design than the detection of bias?

Data is not like oil – it is much more interesting than that

So, this may seem to be a nitpicking little note, but it is not intended to belittle anyone or even to deny the importance of having a robust and rigorous discussion about data, artificial intelligence and the future. Quite the contrary – this may be one of the most important discussions that we need to engage in over the coming ten years or so. But when we do so our metaphors matter. The images that we convey matter.

Philosopher Ludwig Wittgenstein notes in his works that we are often held hostage by our images, that they govern the way we think. There is nothing strange or surprising about this: we are biological creatures brought up in three-dimensional space, and our cognition did not come from the inside, but it came from the world around us. Our figures of thought are inspired by the world and they carry a lot of unspoken assumptions and conclusions.

There is a simple and classical example here. Imagine that you are discussing the meaning of life, and that you picture the meaning of something as hidden, like a portrait behind a curtain – and that discovering the meaning then naturally means revealing what is behind that curtain and how to understand it. Now, the person you are discussing it with instead pictures it as a bucket you need to fill with wonderful things, and that meaning means having a full bucket. You can learn a lot from each-others’ images here. But they represent two very different _models_ of reality. And models matter.

That is why we need to talk about the meme that “data is like oil” or any other scarce resource, like the spice in Dune (with the accompanying cry “he who controls the data…!”). This image is not worthless. It tells us there is value to data, and that data can be extracted from the world around us – so far the image is actually quite balanced. There is value in oil and it is extracted from the world around us.

But the key thing about oil is that there is not a growing amount of it. That is why we discuss “peak oil” and that is why the control over oil/gold/Dune spice is such a key thing for an analysis of power. Oil is scarce, data is not – at least not in the same way (we will come back to this).

Still not sure? Let’s do a little exercise. In the time it has taken you to read to this place in the text, how many new dinosaurs have died and decomposed and been turned into oil? Absolutely, unequivocally zero dinosaurs. Now, ask yourself: was any new data produced in the same time? Yes, tons. And at an accelerating rate as well! Not only is data not scarce, it is not-scarce in an accelerating way.

Ok, so I would say that, wouldn’t I? Working for Google, I want to make data seem innocent and unimportant while we secretly amass a lot of it. Right? Nope. I do not deny that there is power involved in being able to organize data, and neither do I deny the importance of understanding data as a key element of the economy. But I would like for us to try to really understand it and then draw our conclusions.

Here are a few things that I do not know the answers to, and that I think are important components in understanding the role data plays.

When we classify something as data, it needs to be unambiguous, and so needs to be related to some kind of information structure. In the old analysis we worked with a model where we had data, information, knowledge and wisdom – and essentially thought of that model as hierarchically organized. That makes absolutely no sense when you start looking at the heterarchical nature of the how data, information and knowledge interact (I am leaving wisdom aside, since I am not sure of whether that is a correct unit of analysis). So something is data in virtue of actually having a relationship with something else. Data may well not be an _atomic_ concept, but rather a relational concept. Perhaps the basic form of data is the conjunction? The logical analysis of data is still fuzzy to me, and seems to be important when we live in a noise society – since the absolutely first step we need to undertake is to mine data from the increasing noise around us and here we may discover another insight. Data may become increasingly scarce since it needs to be filtered from noise, and the cost for that may be growing. That scarcity is quite different from the one where there is only a limited amount of something – and the key to value here is the ability to filter.

Much of the value of data lies in its predictive qualities. That it can be used to predict and analyze in different ways, but that value clearly is not stable over time. So if we think about the value of data, should we then think in terms of a kind of decomposing value that disappears over time? In other words: do data rot? One of the assumptions we frequently make is that more data means better models, but that also seems to be blatantly wrong. As Taleb and others have shown the number of correlations in a data set where the variables grow linearly in turn grows exponentially, and an increasing percentage of those correlations are spurious and worthless. That seems to mean that if big data is good, vast data is useless and needs to be reduced to big data again in order to be valuable at all. Are there breaking points here? Certainly there should be from a cost perspective: when the cost C of reducing a vast data set to a big data set are greater than the expected benefits in the big data set, then the insights available are simply not worth the noise filtering required. And what of time? What if the time it takes to reduce a vast data set to a big data set necessarily is such that the data have decomposed and the value is gone? Our assumption that things get better with more data seems to be open to questioning – and this is not great. We had hoped that data would help us solve the problem.

AlphaGo Zero seems to manage without at least human game seed data sets. What is the class of tasks such that they actually don’t benefit from seed data? If that class is large, what else can we say about it? Are key crucial tasks in that set? What characterizes these tasks? And are “data agnostic” tasks evidence that we have vastly overestimated the nature and value of data for artificial intelligence? The standard narrative now is this: “the actor that controls the data will have an advantage in artificial intelligence and then be able to collect more data in a self-reinforcing network effect”. This seems to be nonsense when we look at the data agnostic tasks – how do we understand this?

One image that we could use is to say that models eat data. Humor me. Metabolism as a model is more interesting than we usually allow for. If that is the case we can see another way in which data could be valuable: it may be more or less nutritious – i.e. it may strengthen a model more or less if the data we look at becomes part of its diet. That allows to ask complicated questions like this: if we compare an ecology in which models get to eat all kinds of data (i.e. an unregulated market) and ecologies in which the diet is restricted (a regulated market) and then we let both these evolved models compete in a diet restricted ecology – does the model that grew up on an unrestricted diet then have an insurmountable evolutionary advantage? Why would anyone be interested in that, you may ask. Well, we are living through this very example right now – with Europe a, often soundly, regulated market and key alternative markets completely unregulated – with the very likely outcome that we will see models that grew up on unregulated markets compete with those that grew up in Europe, in Europe. How will that play out? It is not inconceivable that the diet restricted ones will win, by the way. That is an empirical question.

So, finally – a plea. Let’s recognize that we need to move beyond the idea that data is like oil. It limits our necessary and important public debate. It hampers us and does not help in understanding how this new complex system can be understood. And this is a wide open field, where we have more questions than answers right now – and we should not let faulty answers distract us. And yes, I recognize that this may be a fool’s plea, the image of data like oil is so strong and alluring, but I would not be the optimist I am if I did not think we could get to a better understanding of the issues here.

A note on complementarity and substitution

One of the things I hear the most in the many conversations I have on tech and society today is that computers will take jobs or that man will be replaced by machine. It is a reasonable and interesting question, but I think, ultimately wrong. I tried to collect a few thoughts about that in a small essay here for reference. The question interests me for several reasons – not least because I think that it is partly a design question rather than something driven by technological determinism. This in itself is a belief that could be challenged on a number of fronts, but I think there is a robust defense for it. The idea that technology has to develop in the direction of substitution is simply not true if we look at all existing systems. Granted: when we can automate not just a task but cognition generally this will be challenged, but strong reasons remain to believe that we will not automate fully. So, more of this later. (Image: Robin Zebrowski)