The data (right?) Lie – basic science

by Edgard Pimentel

Correlations, causalities and wrong conclusions

A good strategy for getting information about the world and preparing for it is to observe. Check the weather and grab an umbrella, take a look around before crossing the street. We search for data, analyze and make decisions. The process sounds simple, but it is quite complicated at times. The data can be many, it is subject to inaccuracies, the methods of analysis are not always the best and most importantly, our question could be wrong. After all, is there an argument against data?

Thales of Miletus, who founded a cosmology independent of myths, is also known as an avid observer – according to legend, he even fell into a well while wandering and observing the stars. But armed with dates, he would have predicted a solar eclipse and determined the date of the solstice. And according to Aristotle, he would have predicted favorable harvests and even concluded that the earth is round.

You weren’t alone, Thales. Hipparchus, Eratosthenes, and Ptolemy are just a few who collected observations and data to answer basic questions about the world. The accuracy of the Ptolemaic model is impressive, even under the geocentric hypothesis. The fall of the geocentric paradigm and the Copernican revolution or Kepler’s laws benefited from the data that the Dane Tycho Brahe had collected in the complex on the island of Ven.

In these cases, a series of observations led to predictions. But the relationship between the data and predicted phenomena is not clear. Was it causality? Would the wintry weather conditions bring good harvests in the following seasons? Or is there just a strong connection between these facts?

Causality is subtle and related to the idea of implication. It occurs when one fact leads to another: one billiard ball collides with another, causing it to move; Steam in a kettle triggers a mechanism. In the universe of data, the idea is the same. Suppose that an increase in government spending leads to an increase in aggregate demand and thus in employment. Whenever the data suggests the first existed, we can wait for the second. Also, we can use the first one to produce the second one. Causality comes very close to the idea of a rule or a model.

The correlation is different. It can be the result of causality or mere chance – and it can be wrong! In the book “Spurious Correlations”, Tyler Vigen brings together amusing examples of correlations. The number of graduate students in civil engineering in the US correlates strongly with the consumption of mozzarella cheese. The number of doctoral students in computer science correlates strongly with the sales of comics. A favorite: the number of students enrolled in American universities is almost perfectly the same as the number of home accidents from falling TVs.

And? Very high correlations can also occur between unrelated facts. And they can be useful: if we know that there will be a lot of engineering doctors next year, is it worth investing in mozzarella? And if the number of first-year students in universities increases, wouldn’t it be the case to pay more attention to television at home? Not that there is any rule dictating the relationship between these facts. Still, a look at the data can show us a way.

So far the discussion has been, shall we say, platonic; the data would be correct and would describe exactly what we expected. In reality, that’s not how things work. See the IBGE 1991 and 2000 censuses. The data from each questionnaire (microdata) contain very valuable information. In particular, they allow us to compare different dimensions of economic and social life in the country in two moments. But there are some details.

The national currency was not the same in 1991 and 2000, nor was the number of municipalities in the country. In other words, despite correcting the data and examining very experienced analysts, there are subtleties that can lead to inaccuracies if the actors in the process are not articulated. As in the recent case of allegedly expired vaccines, where a multidimensional information effort has led practitioners to review data and conclusions and how they are obtained. From the point of view of data analysis, the learning and refinements resulting from these processes become social goods and improve people’s lives.

Whether through causality, whether through unimaginable connections or even the weirdness of the conclusions, the die has been cast. Just ask.

Edgard Pimentel is a mathematician and professor at PUC-Rio.

Subscribe to Serrapilheira’s newsletter for more news from the Institute and the Ciência Fundamental blog.

Leave a Reply Cancel reply