Julieta is blogging Archive Pages Categories Tags

Better Than Tit For That

03 September 2014

Outperforming tit for that in the Prisoner’s dilemma

In the winter of 2013 I took a graduate course on Empirical Optimization and Algorithm configuration. In the course we explored general Optimization techniques, and each student was assigned with the task of presenting them to the rest of the class. I was assigned Genetic Algorithms and Genetic Programming. We were responsible for choosing the assigned readings, and for leading the discussion on them with the rest of the class.

Searching for Genetic Programming papers, I stumbled upon a Nature 1993 paper by Martin Nowak and Karl Sigmund, A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game. Since I had 2 weeks to prepare this presentation, I was able to go deep into the details of the paper, and I also wrote a small piece of code that reproduces the main figure. The paper is absolutely amazing, and I decided to write this post to share the insights that I got from it.

The Prisoner’s Dilemma

If you are familiar with the Prisoner’s Dilemma, you might want to skip to the next section.

The Prisoner’s Dilemma is a classical problem in Sociology, Game Theory and Evolutionary Biology. The setting is about 2 players, \( A \) and \( B\), who are caught commiting a crime. In separate rooms, each of them is interrogated by a separate police officer, but they both get offered the same deal:

The seeting could not be more interesting. Of course both players will do better if none confesses, but both also have the chance of walking out free! The fact that defecting pays off, and the knowledge that others defecting will get them ahead of you is often called the tragedy of the commons, and is the essence of the prisoner’s dilemma.

The Dilemma in everyday life

Hopefully, the thieves scenario is unlikely for most of us. However, it is easy to extend this scenario to everyday situations. For example, this is a way of accounting for arms races: sure we would all be better off with less weapons in this world, but no country wants to unarm itself first! Imagine also a drug deal in a back alley. Ideally the buyer will bring the money and the provider will bring the drug, but then the buyer thinks “what if I get ripped off? I better bring a gun!”. The seller has a similar though-process and decides to bring a gun as well. Both would be better of without guns involved, but the Prisoner’s Dilemma makes them both bring guns to the deal. Finally, another classical example is a divorce: both partners would be better off without lawyers invovled in the process – I’ve been told that lawyers charge a lot of money. But then being affraid of their ex-partners getting lawyers involved makes both parties end up hiring expensive lawyers to litigate in their favour.

In a more general version, we can set up a pay matrix, with pays \(R, S, T\) and \(P\), such that betraying pays off (\(T > R > P > S\)), but globally it is better to cooperate ( \(2R > T + S \) ). The following matrix should make it easier to visualize the pays, and also offers my interpretation of the payoffs variables names’ choice

Pay matrix for the prisoners dilemma Of course here we are talking about rewards, not years in jail! We also refer to actions as “Cooperation” (C) and “Defection” (D).

The Iterated Prisoner’s Dilemma

In the setting that I mentioned before, the dilemma only happens once. \(A\) and \(B\) only make one choice, and then their fates are decided. In real life, however, daily interactions often involve the same people over and over again. We can think of this case as an extension of the Prisoner’s Dilemma, where the dilemma is solved many times. Now things are a little different; we have more incentive to cooperate, because we know that in the future this will lead to larger gains.

I am not kidding when I say that this problem is of interest to many areas, and a favorite for first-year classes. From a Psychology point of view, the Prisoner’s Dilemma is interesting to explain moral choices. It is introduced in the first few sessions of the Coursera class on Morality by Paul Bloom. From a Game-Theory point of view, the Prisoner’s Dilemma offers a simple test-bed for decision theory. In fact, it is introduced in the very first class of this Open Yale Course on Game Theory. For an Evolutionary Biology point of view, these models often help explain animal behavior. You might want to check this lecture on Genomic Conflict. It is because of the broad interest of this problem that Nowak and Sigmund’s result is so remarkable.

Axelrod’s Tournaments and tit-for-tat

Seeing the potential insight that could be won from having computers play the Iterated Prisoner’s Dilemma, in the 1980’s Robert Axelrod organized a tournament were people from around the world were asked to submit their programs, and made them play the Prisoner’s Dilemma over and over again. I do not know much about the details of the tournament, but the takeaway point is that a very simple strategy, called tit-for-tat, made the most points – and therefore won the tournament. The behavior of this strategy is very simple:

Tit-for-tat as a model for human behavior

It can be argued that this is, broadly, the way humans behave. If we feel that someone is unfair to us, we tend to reply equally. We are also usually nice to our friends, people who we consider are usually nice to us. Also, in first encounters with strangers, people also tend to be nice and polite to each other. This achievement recieved wide notoriety – a “conspicuous success” as Nowak and Sigmund call it – and is often the point up to where most first-year courses talk about the Prisoner’s Dilemma.

What is wrong with Axelrod’s Tournament?

Nowak and Sigmund argue that the simulation in Axelrod’s Tournament was too simple. Particularly, the fact that everything is deterministic is quite unreal. We do not live in a “deterministic cyber-world”. In real life, even humans make mistakes. It is hard to be nice to everyone when we are angry, and it is also very easy to be nice to everyone when we are happy. This means that we don’t play tit-for-tat in a perfect way. Rather, it makes sense to model choices as probabilities instead of deterministic rules.

Second, and this is mostly of interest to Evolutionary Biology, if a strategy in a population is very successful, we can expect others to copy it (or in biology, for the gene that causes such behavior to have more offspring). As time goes on, a very successful advantage will have to fare against others of its own kind, which might not prove successful in the long run. How to evolve behaviors that cooperate in the long run when locally it seems to make sense to defect is a very interesting question in Evolutionary Biology.

Evolving Cooperation: The Experiment

Setting out to evolve strategies is an alternative to Axelrod’s Tournament. It removes humans from the loop by expecting good strategies to emerge, rather than to have someone program them. It also takes into account time dynamics, instead of just populations. Simulating strategies imposes 2 technical challenges: 1) how can we simulate mutations within the population? and 2) how can we simulate the Prisoner’s Dilemma in an efficient way?

Simulating mutations

First of all, remember that strategies are defined by probabilities. Each player is represented by a vector \( \{r, s, t, p\} \), where the numbers represent the probabilities that the player will cooperate depending on the reward they received on the previous iteration (resp. the same variable names in capital letters). If we plan to implement Darwinian evolution dynamics, we must make sure that good strategies replicate more, but we must also account for mutations. Since all we need to spawn new players is to define the vector \( \{r, s, t, p\} \), and this is defined by probabilities, we can sample from a probability distribution that represents mutations. Nowak and Sigmund propose using the U-shaped distribution \( [\pi x ( 1-x)]^{-1/2} \) which is apparently familiar to geneticists. The 4 numbers can be drawn independently, mimicking mutations as ‘mistakes’ in the genome copying process.

Note that this distributions encourages exploring the extremes of cooperation and defection, but is also limited to the range 0.001 - 0.999, so that there is never a deterministic rule.

Simulating the Iterated Prisoner’s Dilemma

We now have to figure out how to simulate the Prisoner’s Dilemma in an efficient way. Since players are represented by probabilities, one may think that a simulation with biased coins should do the job. There is, however, a much more efficient way of doing it. Notice, for example, that having only probabilities that respond to the previous iteration, leaves open the question: what should the player do in the first iteration? As it turns out, what players do in the first iteration does not really matter. This is because no matter what the first play is, the overall payoff for an infinite number of iterations will always be the same. To see how this works, we need to make a transition matrix \(P\) that simulates the next choice given the previous reward. Since this depends on the actions of 2 players, we have to model the possible combinations of actions.

Take, for example, the case when in the previous example player 1 cooperated, but player 2 defected. This takes us to the second row. What was the payoff that player 1 got? \(S\) (as in Sucker), and what did player 2 get? \(T\) (as in Traitor). Therefore, the chances of player 1 cooperating again are \(s1\), and of defecting \( (1-s1) \). For player 2, the chances of cooperating are \(t2\), and the chances of defecting are now \( (1-t2) \). We cam similarly fill out the rest of the rows of the matrix.

Transition matrix for the Iterated Prisoner's Dilemma

Given an initial 1-by-4 vector \( k_1 \), e.g. \( k_1=[0,0,1,0] \), we can simulate one iteration of the dilemma by multiplying \(k_2 = k_1P\). Similarly, we can get 2 iterations by multiplying \(k_3 = k_1PP \). 3 Iterations are given by \(k_4 = k_1PPP\) and, in general, \(n\) iterations are given by \(k_{n+1} = k_iP^n\). It is very easy to verify numerically that, for a sufficiently large \(n\), we will always end up with the same distribution! This known as the stationary distribution. It can also be obtained by solving a system of linear equations. This is an awesome result! By introducing probabilities instead of deterministic rules we have also saved ourselves lots of computation!

The main result: Pavlov is no Simpleton

So now we know how to simulate mutations, and how to efficiently simulate multiple iterations of the dilemma. We have everything we need to reproduce the experiment of Nowak and Sigmund!

According to the paper, adding stochasticity and evolution to the simulation should show that tit-for-tat has a hard time being the dominant strategy in the population. Instead, if we run the simulation for a fair amount of time, we will see that a strategy that tends to continue with the same strategy if the reward was good (\(T\) or \(R\)), and switch if the reward was bad (\(S\) or \(P\)) will dominate. Since this strategy seems to be very reactive (win-stay, lose-shift), it had previously been called simpleton, but Nowak and Sigmund call it Pavlov, due to the famous experiment by Ivan Pavlov on classical conditioning. The main finding is that, in spite of its simplistic nature, evolutionarily Pavlov tends to fare better than tit-for-tat in the long run.

The paper does not give details on what the population size was, so to get quick results I ran it with 20 individuals for 1,000,000 iterations. The figure that I got looks like this

Prisoner's Dilemma with population size of 20

While it is hard to see a strategy dominating the game for a long time, around iterations 200,000 to 300,000 we see that Pavlov \(\{1, 0, 0, 1\}\) enjoys the longest streak. However, 20 individuals make for a very small population. I ran the code with 100 individuals, keeping the ratio of mutants – the amount of random strategies tried every now and then – to 0.05 (i.e. 5 individuals) Running for about a day and a half, I got the following figure

Prisoner's Dilemma with population size of 100

Interestingly enough, we see Pavlov dominating early on, being overtaken by other less stable strategies until iterations 200,000 to 300,000, when a strategy that looks a lot like tit-for-tat \(\{1, 0, 1, 0\}\) takes over. Tit-for-tat loses dominance until iterations 550,000 to 700,000, when a Pavlov-like strategy again takes over. I imagine that running the code with a population of 1,000 will get results closer to those of Nowak and Sigmund.

As Nowak and Sigmund say, this suggests that the conditioning behavior that we see in animals – and in humans – is well founded in Evolutionary Dynamics. As they boldy put it, Pavlov is no simpleton.

The code

I wrote a small piece of matlab code that reproduces the results over here. Feel free to play with it! Just change the parameters at the top of runner.m, and run the same file to see the figure build up in real time.

Future work

My code is fairly slow right now. It could surely benefit a lot from parallelism, or from being implemented in a lower-level language. Also, the population dynamics that I implemented – that is, how the fitness of each individual is measured – is now done in a round-robin style, which means that everyone plays everyone. This is not very realistic because, in populations, individuals do not interact all with all; rather, a small groups tend to interact a lot, and some individuals never meet. The other downside is that the implementation is \(\mathcal{O}(n^2)\), where \(n\) is the population size. This will not scale well to larger populations.

blog comments powered by Disqus i
Fork me on GitHub