Statistics is used so often these days, it’s hard not to open any form of media without seeing a statistic screaming at you. Before, statistics were only used by those in research that had access to data and technical know-how. However, nowadays, statistical software has made it possible for almost anyone to be a statistician. This, coupled with fact that data is can be easily accessed means that statistics can be used for more than just research purposes.
Marketing, political campaigns and even news media utilise statistics to persuade the public. Worse yet, statistics is misconstrued to make a point without actually being true. Even at the best of times, mistakes can be made in statistics and without carefully assessing the data and results, we can get readily misled. So, ready to learn more about the art of statistics?
In this book summary, readers will discover:
- Statistics, data and bias
- Statistics in the media – accuracy versus storytelling
- Some common mistakes when interpreting statistics
Key lesson one: Statistics, data and bias
The field of statistics deals with data in all its forms. Data can be divided into five components from beginning to end. These are problem, plan, data, analysis and conclusion or PPDAC. Statisticians are tasked with working with data in this order. They identify the problem, design a strategy to deal with it, gather the relevant information and analyze it to make an appropriate conclusion.
This order of events is applied to a multitude of situations. One such case in which the author was involved was the investigation into serial killer Harold Shipman. Shipman was a doctor in the United Kingdom who in 1998 was charged with the deaths of 215 of his patients and also possible involved in another 45. A task force was set up to figure out if his murders could have been picked up earlier. This was the problem they were faced with. The plan they came up with was to collect information in order to compare the deaths of Shipman’s patients with other deaths in the area to identify any discrepancies. With a plan in place, they then began collecting the relevant data from 1977 onwards. The data was then analyzed and depicted as graphs. The graphs revealed that Shipman’s practice recorded a much higher number of deaths as compared to others in the area and the deaths all occurred mainly in the afternoon between 1 pm and 5 pm. The conclusion they were able to make as a result of this study was that if the data had been monitored, Shipman’s murders could have been revealed at least 15 years before he was actually arrested.
This example depicts how statistics can be used. The same process is followed most of the time no matter what you are researching. One thing that you can be sure of though is that human judgement is involved in every process and this means that data can be subjective. For example, if we had to determine how many boulders there were on the surface of the planet, a definition of a boulder must be given. What is a simple rock to some could be a boulder to others. Data can also get skewed when changes to measurements are made at some point during the study. For example, if we had to consider the number of phone calls made to a suicide hotline in the last five years, we might find a drastic increase as compared to the previous five years. However, the reason for this is not because there are more people contemplating suicide but rather that there have been campaigns to increase awareness regarding the hotline in recent years. Thus, we should never take it for granted that any data is a true representation of what is really happening and sometimes how it is interpreted can further skew the data.
It is because of this that question design is one of the hardest parts of statistics. Not only do you have to be mindful of the data wanted but the tone and language used when asking the question are also crucial. There have been numerous studies done that have shown that language influences how people feel about the question and thus the way they answer can be not their true feelings. The other ways in which questions can skew data are the answers that people have to choose from. They may be too limited and thus not a true reflection. These examples show just how much statisticians have to deal with. Before they can even begin analysing it, there is a chance that it can be skewed and thus misleading.
Furthermore, bias can also arise after analysis when the data is presented. Graphical devices have become the norm when presenting data as they can be easily visualized and understood. However, for them to be accurate, they have to be designed carefully. Statisticians even consult with psychologists to determine how the results will be perceived! Take for example a report on the mortality rates due to heart disease at various hospitals in your city. A graph would be used to show you the clear differences between the hospitals. However, would you order them from highest mortality to lowest or lowest to highest? It is a small consideration but the order may be incorrectly perceived as a ranking system and that would be incorrect. The hospital with the highest mortality could actually be the best and therefore received the most patients.
Lastly, when considering scientific literature, there is a large amount of positive bias. This simply means that most of the time scientists publish data that support the hypothesis made as opposed to data that does not. The problem with this is that it is possible for scientists to get false positives in their studies and therefore wrongfully report on them thinking that they were proven right during the study. It is because of this that you should not assume that research is all conclusive just because it is published in a scientific journal.
Key lesson two: Statistics in the media – accuracy versus storytelling
Once the results of a study have been published, it is at the mercy of the media. Although there is work being done to ensure that journalists understand statistics and how data should be interpreted, there is always a risk that the media will sacrifice accuracy for a better story. It’s easy to take data and sensationalize it to provoke emotions from the public and this is exactly what some organizations do to increase views and reactions.
We have all seen it before. Headlines splashed across newspapers or online stating that “studies have shown” or “this many per cent of” but they have no real basis for the claims. Even the author was victim to such sensationalism when he made a comment in jest and it was reported by various media outlets shortly thereafter. Then, there is also the exaggeration of some statistical results to deliver a strong emotional reaction from the public.
Key lesson three: Some common mistakes when interpreting statistics
The use of exaggerated risks can sometimes strike fear into some people, especially when it is a report about health. Take for example a report released by the World Health Organization that said that eating processed meat led to an 18 per cent increased risk of developing bowel cancer. This was widely reported by the media at the time. What they failed to state was that this number was relative to the 6 per cent risk for people who did not eat processed meat. This means that people who do eat processed meat had an increased 18 per cent of 6 per cent. This number does not look so scary thereafter as it is 7.08 per cent and thus only a 1 per cent increase in absolute terms.
Another common mistake is the use of averages. Unless specified, you cannot trust an average. In fact, statisticians even make jokes about averages because they are misused so badly. There are three types of averages. The mean average is calculated by adding up all the numbers in a data set and dividing it by how many numbers were in the data set. The median average is the number that lies in the middle of a data set when arranged in ascending order and the mode average is the number that appears the most in a data set. These three averages are appropriate in different situations and unless it is clearly stated which one is used, it should not be taken as the whole truth.
The next mistake is assuming that correlation implies causation. This has been a very common mistake and if not properly reported on, results can be greatly misconstrued. Take for example the headline ‘If you live in a retirement village, your life expectancy is reduced.” Does this imply that a retirement village is bad for one’s health? Or simply that older people live in retirement villages? Just because the data correlates we should not be quick to assume that one causes the other. You have to carefully assess the data.
Lastly, probability is a concept that everyone battles to understand. It is counterintuitive and leaves many people scratching their heads. This alone tells you that reporting a probability of something occurring without further explanation will just lead to a misinterpretation of results.
The key takeaway from The Art of Statistics is:
The use of statistics has spread far and wide due to data that can be easily attained and the development of software that is easy to use. However, this also means that people other than statisticians can analyze data and misinterpret data. When done correctly, statistics can be a powerful tool but when misused and misunderstood, the results can be disastrous. It is therefore important that we understand how data works, what can go wrong with research and the questions we need to ask about results. It is only when we are truly aware of statistics that we can begin to understand it better.
How can I implement the lessons learned in The Art of Statistics:
Don’t believe every statistic you hear! Everything should not be taken for granted. Ask questions, look for their source of the data used and also check if they are reporting on what averages they are using or whether it is the absolute or relative risk. A bit of scrutiny is needed.