by Nina GidelDissler
Introduction
Understanding the world around us has always been an objective of humankind. One effective method to go about this has involved formulating a theorem and proving its validity with mathematical demonstrations before then comparing it against realworld observations. This has especially been the case in physics and chemistry over the last centuries: in his periodical classification of elements from 1869, Mendeleïev not only classified known elements, but demonstrated the existence of elements observed for the first time during the 20th century.
However, in many fields like medicine, scientists must often go the other way around and first observe a phenomenon long enough before being able to claim that it follows certain rules. They must gather a large set of measurements and use statistics to help them confirm if there truly is an underlying phenomenon explaining their observations.
Statistical analyses are therefore pillars of scientific discoveries. But manipulated improperly, they can be extremely misleading! A famous example of this is Simpson’s paradox. Whether you are an embryologist, a statistician, or simply intrigued by mathematical puzzles, this paradox is bound to catch your attention. It’s a perfect example of how data can trick us into drawing the wrong conclusions, even when we think we’re being careful.
What is Simpson’s Paradox?
Simpson’s paradox is a statistical phenomenon where trends that appear in the overall dataset vanish or reverse when the data is divided into subgroups.
This makes Simpson’s paradox the nemesis of clinical evaluations: it can mislead scientists into thinking that there is no underlying phenomenon occurring in their observations or, even worse, lead them to draw conclusions that are the opposite of what is actually happening!
Hard to imagine, right? Well, here is an applied example!
Simpson’s paradox applied: the relationship between tobacco and fitness
Let’s consider that you are a doctor in the early 20th century. Smoking is a widespread and socially accepted habit, but your daily consults give you the feeling that tobacco might not be great for the human body.
To check if your intuition is correct, you gather a group of participants of 150 smokers and 150 nonsmokers and submit them to a fitness test. The results are the following:
This tells you that smoking improves physical abilities! You laugh at your original skepticism until you take a closer look at your measurements:
So, tobacco increases physical abilities overall, but decreases them when looking at men and women separately? This suggests that tobacco does negatively impact fitness. However, since most smokers are men, smokers, in general, appear to have better physical abilities than nonsmokers. In this example, gender is called a “confounding variable”: it is a parameter that has a known impact on the “result” (fitness score) independently from the other “explicative variable” (smoking status).
Simpson’s Paradox applied in IVF
We can also observe the same paradox in IVF when looking at the association between sperm concentration and fertilization rate.
Imagine you were looking at the complete patient population of an IVF center, where both conventional IVF and ICSI are performed. You might observe that patients with a low sperm concentration have higher fertilization rates: this is because patients with a low concentration often benefit from ICSI and end up with similar or superior fertilization rates! Not accounting for the fertilization method as a confounding variable would have led to the incorrect conclusion.
You might believe that the presented examples were farfetched and that no clinician in their right mind would have forgotten to consider gender in an evaluation of fitness or the fertilization method in an evaluation on fertilization rate! You are correct, as those were meant to be easytounderstand cases. But confounding variables are confounding for a reason: if we take the field of IVF as an example, scientists themselves know there are many things yet to discover (for example, we still can’t say for certainty what is a ‘good’ kinetic development of an embryo). This is why it is crucial to account for as many candidate confounding variables as possible, even those where the impact is uncertain.
Solutions
How do we account for confounding variables and avoid falling into the trap of Simpson’s paradox? Below, we will outline 3 main options that we also implement at ImVitro: subgroup analysis, multivariate regressions and treebased models.
Subgroups analysis
The most common approach is performing subgroup analyses. This method is straightforward in its implementation and interpretation: you check if the overall study conclusion remains valid over subsets of data based on suspected confounding factors.
💡 Figures 1 and 2 illustrate the results of subgroups analysis for 1) the fitness and 2) the fertilization problem: when grouping the data per gender and fertilization method, the “true” conclusions appear.
Figure 1: Impact of tobacco on fitness score over the complete population (left) and per gender (right)
Figure 2: Impact of sperm concentration on fertilization rate over the complete population (left) and per fertilization method (right)
Yet, subgroup analyses face two major limitations depending on the problem at hand. First, by dividing the population into smaller groups, one inevitably reduces the sample size of each group, which might hinder results if the original sample size is not that big. Second, in the case of quantitative confounding variables (e.g. woman’s age), one must subjectively choose a threshold on which subgroups are based (e.g. <35 y.o vs >35 y.o) which might lead you to miss important associations.
Multivariate regressions
Multivariate regressions are mathematical models that, provided with a set of measurements, learn the association between explicative variables and a result. Explicative variables are considered together rather than separately to understand the phenomenon. This means any factor suspected of being a confounder can be included in the regression analysis to account for its impact. Not only do multivariate regressions maintain the sample size of the data (as opposed to subgroup analyses), but they also have the advantage of handling any type of variable (binary, categorical, continuous etc) which makes them more objective in their analysis compared to subgroups where the clinician must subjectively define a threshold.
💡 Figure 3 illustrates the result of a multivariate regression between

x: a student’s ability (the explicative variable, quantitative)

y: the effort required by the student to perform a task (the result, quantitative)

while accounting for z: the teacher overseeing the student (the confounding factor, categorical)
Figure 3 – Multivariate regression of the combined impact of student ability and teacher on effort required for a task
We can see in this example that looking at teacher 1, the effort increases with the ability, as opposed to teacher 1 where the effort decreases with the ability: Simpson’s Paradox! One explanation can be that teachers have different pedagogical methods:

Teacher 1 asks students to work in groups of 2, one senior and one junior. The senior student must compensate for the junior, which increases the effort of the senior student and reduces the effort of the junior student.

Teacher 0 asks students to work individually but will focus on junior students to help them catch up. In the end, most students end up providing the same effort, with more or less help from the teacher, to complete the task.

Teacher 1 asks students to work individually and autonomously: junior students have a hard time completing the task whereas senior students provide low effort.
If a subgroup analysis had been performed on the data, one group per teacher would have been necessary, which would have drastically reduced the sample size!
However, multivariate regressions often fail to capture nonlinear or complicated relationships between variables and the result. In other terms, they can understand a variable where “higher is better” (e.g. high AMH D3 positively impacts fertility) or where “lower is better” (e.g. low woman’s age positively impacts fertility), but cannot handle more complex trends such as “lower is better, but too low is bad” (e.g. woman’s BMI: being obese or underweight can both negatively impact fertility).
Treebased models
Similarly to multivariate regressions, treebased models learn from the data they are provided with to understand the association between explicative variables (including confounding ones) and the result. Contrary to multivariate regressions, treebased models are not limited to linear relationships between variables and the result and can thus uncover more complex phenomena.
💡 Figure 4 below illustrates the interpretation by a treebased model of the relationship between choice of travel mode (walk vs car) and multiple explicative variables such as the distance of the travel, the number of cars owned, the density of traffic, the age of traveler and the number of bicycles owned.
Each variable is reported on the xaxis and its impact (SHAP value) on the prediction is reported on the yaxis: a positive impact means the variable impacted the prediction toward “car” whereas a negative prediction impacted the prediction toward “walk”.
Figure 4 – Interpretation of the impact of multiple variables on travel mode choice according to a treebased model
We can see that some variables present a relationship close to linear: the higher the traffic density or number of bicycles, the more likely the person is to take the bike (which would have been properly captured by a regression). Conversely, distance seems to have a relationship closer to logarithmic on the result, as it does not matter beyond 20km.
Finally, the impact of the number of cars or age is not even monotonic. The people most likely to take a car are between 25 and 50 y.o., but also those above 70 y.o. And people with several cars are likely to drive, but those with more than 5 cars are not as likely as those with 2 or 3! These more complex relationships would not have been captured properly by multivariate regressions.
Application of the three methods in IVF: the impact of female BMI on fertility
It is commonly known that a woman’s BMI plays an important role in the ability to conceive and should thus be considered as a potential confounding variable in any study on fertility.
💡 As previously presented, the first approach could be to check the subgroups of normal vs overweight vs obese women using generic definitions. The results would show that categories with higher BMI have a reduced ability to conceive (Figure 5).
In this case, the main limitation to a subgroup approach is that the weight categories are not customized for IVF, and might therefore not be the most consistent splitting for fertility. Conversely, a multivariate regression would likely also show that an increased BMI is associated with lower ability to conceive, without being biased into focusing on generic BMI categories.
Lastly, Figure 6 shows that toolow BMI can also be bad for fertility. This subtle impact of the confounding variable could only have been caught using a treebased approach, as it would have gone beyond understanding the “easy” negative impact of high female BMI on fertility.
Note: for our most attentive readers, you might notice that the results of Figure 6 consider etiology as a potential confounding variable on live birth by reporting the results on subgroups of etiology.
Figure 6 – Detailed analysis of the impact of female BMI on fertility
Comparison of the three methods
One limitation that transcends subgroups, multivariate and treebased approaches is that they first require all confounding variables to be identified, as the methods can not “guess” that a factor is biasing the results if they are not explicitly told it exists. However, as long as all confounding variables are identified and considered, each method has its pros and cons, and it is eventually up to the scientist to select which one is the most adapted for their study.
Conclusion
Simpson’s paradox happens when confounding variables are not considered in the statistical analysis of a phenomenon, leading to incorrect or even reversed conclusions. It is particularly important to be aware of it in IVF which is a heavily multifactorial treatment, with not just one but two or three patients involved.
Knowing its existence is the first step away from misinterpretations, and several methods can be used to handle it, each with strengths and limitations depending on the problem at hand. Most importantly, one must take a step back from their study to think about which parameters could potentially be confounding their conclusions, which is not a trivial question. Even in fields like embryology, many associations have been studied and understood but embryologists are still discovering the impact of many factors.
Ultimately, statistics are a tremendously powerful tool for discoveries in medicine. As everyone knows, with great power comes great responsibility and these statistics should be handled with care: statistics never lie… but it might not be answering the question you think!