Have you ever stopped to look at a situation and thought, ‘what were the chances of that happening?’. Or, if you’re an overthinker you definitely know whats it’s like to imagine different scenarios and their possible outcomes. Even in business, there are strategies to employ to ensure success. All of these are inherently based on cause and effect.

In terms of the above scenarios, when you need a quantifiable number, you can use maths and statistics to answer your questions. But if you are going down this route, you need to dive in and start separating causation and correlation; as this will greatly affect your results. Pretty straightforward, right? Well, that just depends on how familiar you are with mathematics, statistics and its terminology.

In this book summary readers will discover:

- Causation
- The right way to interpret data
- The Ladder of Causation
- Confounders and Mediators
- Determining the relationship between correlation and causation

## Key lesson one: Causation

Causation and correlation are often confused. The former refers to the actual factors which influence a result and the latter are those that appear to have an influence on the result but do not. The fact the causation was often disregarded early on in mathematics was also a confusing factor in itself. This was due to the fact that causation could not be proven mathematically. Well, not until 1912 that is, when geneticist Sewall Wright proved that it could.

Wright did this when he studied the markings found on guinea pigs. He wanted to find out to what extent the markings were hereditary. So, the markings were the effect and inheritance was the cause. By beginning with a mathematical diagram, he connected causes to outcomes and then used data to answer his question. Wright also developed something called a path diagram which used the greater than sign to signify that something had an effect on another factor. He was able to make these diagrams into algebraic equations using his collected data.

Wright’s methods did not sit well with the scientific community at first and his approach for determining causation from correlations were ignored. It is only in recent years that his work is once again being considered and causation is being welcomed as a principle.

## Key lesson two: The right way to interpret data

No matter what you are working on, you need to collect data in order to analyze it. But, unless you know what you are doing, data can be completely misunderstood. This was exactly what happened when the smallpox vaccine was introduced in the 18th century. The data indicated that more deaths were caused by the vaccine than the disease itself!

How did this happen? Well, the data was misinterpreted. They simply compared the number of deaths of those who were not vaccinated against those that were vaccinated prior to death. Obviously, because the number of vaccinated children was higher than those not vaccinated, the number of fatalities in the former were higher. However, if we had to truly analyze the data, we first have to consider how many children would have died if no one had been vaccinated at all. This is the number of deaths that should be compared to the number of deaths that occurred post-vaccination.

So, it’s easy for data to be misunderstood if not correctly analyzed. But, you also have to be cognizant that data can find connections everywhere. For example, would you think that a kid’s shoe size has a relationship to their reading ability? It might seem crazy but, when analyzed properly, you can determine that older kids have bigger shoe sizes and are better readers than younger kids with smaller shoe sizes.

It is for reasons like this that we need to look further than the initial observation of data.

## Key lesson three: The Ladder of Causation

The Ladder of Causation was developed to ensure that we look beyond the initial observation of data. It has three rungs. The first rung is based on our nature to try to make connections to what we see around us. Stuck on this rung are animals and, interestingly enough, Artificial Intelligence programs. In terms of animals, consider something like a bird of prey. It will track its prey, carefully watching the way it moves. By doing this, it attempts to predict where the prey will move next. It does not need to know why it is moving. Self-driving cars face a similar problem because they too are programmed to react to observations. Therefore, different scenarios would have to be programmed into the car for it to be able to react accordingly. Data collection itself is also present on this first rung because what is data collection but the documentation of passive observations?

The second rung of the ladder is defined as actively influencing outcomes. The best way to describe it is to consider what would happen if we actively do something. For example, if we changed how much people paid for toothpaste, would the sale of dental floss be affected? Computers cannot be programmed to ask these types of questions which is why they remain on the first rung of the ladder. In order to test the effect of something, it is best to carry out a controlled experiment. This is common in many scientific studies where one group is exposed to a variable and another group is not. The results and effect of the variable can be seen clearly.

The third rung of the ladder is dedicated to humans and our ability to imagine how different actions can lead to different results. This uses counterfactual models. Our ability to picture what would happen if another action were to be taken, is something that computers cannot do. Humans are capable of ignoring some causal relationships because we can assess what is normal and what is not. We know which variable has been introduced. Computers will assess all causes as equal.

The three rungs of the Ladder of Causation are important. If you understand the three rungs, you can understand causal questions.

## Key lesson four: Confounders and Mediators

There are complicating factors that need to be identified on the rungs of the Ladder of Causation. These factors are called confounders. Confounders have an effect on the participants of an experiment and the results. They are generally found in the second rung of the ladder as action is required to adjust an experiment if they are found. A simple example would be age as a confounder if your experimental group was much older than your control group.

However, they are not always so easy to identify which is why randomization in an experiment is important. Randomization is not always possible given the area of study, but it is still best practice to introduce it into experimental design.

In addition to confounders, another variable that contributes to analysis is mediators. Mediators are variables that tell us why one cause leads to a given result. A simple example of this is to consider houses that have alarms in case of a fire. Smoke acts as a mediator which lets us know there is a fire. Mediators exist on the third rung of the ladder as they are associated with counterfactuals. They are useful but can also be identified incorrectly. The most famous example in history would be that of scurvy. We know now that Vitamin C can prevent scurvy but back then when doctors noticed that sailors that ate citrus fruits were getting better, they attributed it to the fruit’s acidity. This was the incorrect mediator and resulted in the deaths of many sailors who thought they were safe with the acidity provided by lime juice.

## Key lesson five: Determining the relationship between correlation and causation

How do we actually work out if correlation suggests causation? Firstly, you need to draw causal diagrams, much like Wright did when considering his guinea pigs. The causal diagram uses arrows to clearly depict factors that directly affect each other and confounders and mediators can be identified easily. Once you have your causal diagrams you can develop the mathematical formulae to calculate the possibility of a relationship between correlation and causation.

To put this in a working example, consider the testing of the effectiveness of a drug that lowers blood pressure. The causal diagram will have arrows that link the drug and blood pressure, lifespan and blood pressure and, the drug and lifespan. Age in this situation can affect both lifespan and blood pressure which therefore makes it a confounder. The diagram will now give you the formula to calculate the probability of a lifespan of an individual who has taken the drug.

The way this formula is determined follows a logical, step-by-step process. This gives it the potential to be implemented by computers. Imagine being able to programme the cause and effect process into a computer! All we would have to do is enter the data and assumptions before asking a question. The computer can then process the data and determine if the question can be answered. If it can, the mathematical formula would be created. This would provide an answer as well as the statistical uncertainty of the answer. If this were possible the benefits would be life-changing for everyone! Hopefully, it’s just a matter of time before we can ask computers causal questions.

**The key takeaway from The Book of Why is:**

Causation is an important concept in research. Failure to understand causation, correlation and how to correctly analyse data can lead to major mistakes. It is possible to determine if correlation implies causation if you follow the logical process outlined in this summary. Causal diagrams along with the identification of confounders and mediators are key to this process. For now, it is up to us to understand this process but, in future, we may be able to hand over this processing to computers as they learn to ask why themselves.

**How can I implement the lessons learned in The Book of Why:**

Experimental design is a crucial step in any study. Ensure that you identify your variable or cause and that you have a control group that will support your results. Randomization is key in obtaining unbiased and accurate results. It will also counteract any effects that confounders may have on your experiment.