Sounds of Unsound ‘P’! Tuning the Data or Striking the Wrong Note?…

Editor’s note: For long, many scientists’ careers have been built upon the pursuit of a single statistical value of p<.05. In many disciplines, that’s the cut off beyond which results can be declared “statistically significant,” i.e., the results obtained were not by fluke. Though this isn’t what it actually means in practice.

In this article, we try to highlight the hypothesis myopia suffered by researchers and analysts, in the pursuit of gathering shreds of evidence in support of a theory while ignoring explanations against it or its rationale. In the end, does statistical significance is equivalent to clinically meaningful data? Or it is an outcome of torturing the data enough so that the data will confess at some point?

One day recently, I was not sure when my wife asked me if she was required to fast for a thyroid function blood test that her doctor had prescribed. So I googled for an answer. The search fetched me first some pieces of information specific to the thyroid test procedures followed by the Cleveland Clinic.

These were related to three tests viz. Thyroid Stimulating Hormone (TSH), Thyroxine (T4) and Microsomal Thyroid Antibodies (TPO) and pronounced that none of them required fasting, and could be tested anytime during the day. A moment later, I opened another page of my search result. It was a research paper published by a medical journal regarding the effect of fasting as compared to not-fasting in the interpretation of thyroid function tests.

Interestingly, the conclusion of the study stated that TSH levels showed a statistically significant decline postprandial (i.e., after a meal/without fasting) in comparison to fasting values. So, I came across an evidence-based viewpoint not favoring the existing procedures of testing followed in most of the diagnostic centers (as in practice in Cleveland Clinic).

I got curious nonetheless, about this particular study; for, it used statistical techniques to back a divergent idea. After reading the paper, I noted the conclusion was effectually undermining the objective the study set for itself at the beginning. The research paper at the beginning itself, clearly declared, it addressed the question: whether a fasting or non-fasting sample would make a clinically significant difference in the interpretation of thyroid function tests. However, doing injustice to this aim, it didn’t sufficiently focus on how the observed variance between fasting and non-fasting samples, which was somewhat expected in any case, should matter clinically.

Especially when it was known that statistical significance is merely a necessary condition, not a necessary and sufficient condition; the conclusion actually led to nowhere. If finding a statistical significance implied and was implied by clinical significance, it would have been ‘necessary and sufficient’ to make an impact on diagnostic practices. So, their statement in conclusion at the end evidently was a self-limiting one, not analyzing the clinical implication of the difference for which a statistical significance testing was conducted. For many clinical studies that look into the effects of particular treatment/factor, it’s very often a problem of sufficiency-deficit, being ultimately trapped in a quandary about statistical significance vis-a-vis clinical significance.

Certain other studies, however, have also shown that early morning blood samples were taken after overnight fasting give rise to higher TSH levels compared to those taken later in the day with no fasting. Mary Shomon, the author of the New York Times best-seller “The Thyroid Diet Revolution: Manage Your Metabolism for Lasting Weight Loss” while discussing on various factors that can potentially influence the TSH level such as medication, pregnancy, etc., considered the fasting/non-fasting variation to be especially problematic in clinical diagnosis of thyroid malfunction.

So there was hardly any knowledge addition in knowing that the fasting/non-fasting difference was statistically significant. In any case, a firm answer to the question my wife asked remained elusive as medical science was still not prepared to recognize the fasting/non-fasting variation in TSH and free T4 levels as clinically relevant (i.e., a ‘sufficiency’ condition not automatically implied).

Nevertheless, it was quite evident to me that researchers and analysts, in many cases suffer from hypothesis myopia in the pursuit of collecting evidence in support of a hypothesis while ignoring explanations against it or its rationale. There is a saying among statisticians: Torture the data, and the data will confess. So they don’t stop wrenching the data till they show a statistical significance and the moment they get it they don’t go beyond. This tendency is not new and brought about in the past, perilous consequences for many a hyped discovery failing to protect the claimed statistical significance. Incidentally, I caught some glimpses of the various abuses of statistical significance from the web. They speak volumes of the degree of concern for the scientists.

Clinical significance necessarily means Statistical Significance, but the converse is not always true.
Illustration by Meghna Chakrabarti

What of course was bothering the scientists from a statistical point of view was related with the researchers’ taking recourse to purposeful dredging and tweaking of data (p-hacking) until the elusive statistical significance is reached to invalidate a hypothesized proposition. The hypothesis (null hypothesis) is often plain guesswork about a phenomenon without expending sufficient efforts to describe or analyze the practical significance of the theory and the risk of the conclusion being subsequently found irreproducible or inadequate in effect-size.

The keenness and motivation of researchers to anyhow publish papers based on statistical significance which were, later on, proved false-positive assumed so alarming a proportion over the decades that scientists even started looking for an estimate of what percentage of published results were subsequently proved wrong. Giving an idea of this quest concerning certain fields of science, in particular, the video here demonstrates a really impressive effort to make people appreciate how such malicious practices with manipulated statistical evidence is not doing any good to science.

So, no wonder why a commotion has been created of late by more than 800 scientists who called for denouncing the use of statistical significance in scientific inferences. In the sections to follow after this paragraph, I shall just focus on two factors which I consider central to the problems about the conceptual recourse of the researchers giving rise to data manipulation.

The problem may have partially rooted in dichotomania!

Statistical hypothesis testing based on samples of numeric observations on a quantitative characteristic (generally a continuous variable) under study in any field of science is a widely used technique and makes use of a test statistic to determine whether to reject a null hypothesis about the characteristic or not. Essentially, it’s a tool in the hands of researchers to probabilistically conclude whether, given one set of results (found by observations taken on a variable), a particular null hypothesis about the nature of variation of the variable as opposed to an alternative hypothesis is significantly plausible or not. The technique thus allows you to dichotomize the range of possibilities either as acceptable (say, white) or as not-acceptable (say, black) ones. Thereby you tend to ignore the bigger picture, a holistic view of the whole spectrum of colors. In the case I cited at the beginning, the study merely focused on the two distinguishable shades of variability (i.e., variations due to fasting & not-fasting) in TSH and free T4 levels among patient and non-patient categories, one purportedly being more acceptable than the other while both are not necessarily in conflict. With so much overlap between their ranges of variation specific to the categories of the states of the sampled individuals, the ‘p’ is not indicative enough.

Why should you like to focus your lens on two shades only when the whole is so colorful beyond your view?
Illustration by Meghna Chakrabarti

A measurable quantity which can be represented by a continuous variable not necessarily shows a dichotomous character (e.g., black and white). So, its range of variation shouldn’t be unnecessarily bisected in two parts across a cut-off line. That is to say that it may not always be essential to find what makes a value of a continuous variable to be on one side of a cut-off line rather than being on the other side of it when the cut-off itself is artificial to the variable.

On the contrary, by application of statistics in science, we intend to study the natural variation of a phenomenon (not a man-made difference) or the effect of a man-made variation on the natural variation of an event. It relies on a fundamental premise that it is just natural for a quantity representing a phenomenon to vary over a range of values (in which every imaginable number of the variable is likely to occur). The variation most often doesn’t depend so exactly on some other factor/variable(s) that it can be described by a mathematical function, though an inexplicable relationship may exist. Unlike laws of Physics which define exact relationships between physical quantities, laws of mathematical exactitude don’t always exist for the theories in certain other fields of science (e.g., psychology, life sciences, etc.) and their phenomena.

Methods of statistical inferencing are, therefore, applied most extensively in these areas for showing the effect of or, association with other factors (variables) that may exist and influence the variability of the experimented variable. In doing so, researchers do have to introduce assumptions about the nature of variation of the observed numbers. According to Sander Greenland’s findings, cognitive biases of researchers (dichotomania being one of them) including untenable assumptions, play mysterious roles for the null-hypothesis-significance-testing (NHST) that often results in false-positive significance. Principally because, biased intuitive reasoning usually takes over the logical consistency in inferential arguments as The American Statisticians (Vol. 73, No. 51, 2019; Editorial) puts the essence of it as follows.

A label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical non-significance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics.

Applying a standard theoretical model on observed data may be untenable

It is widely acknowledged by a large section of the science community that there is a considerably high chance that a variable representing a phenomenon may not follow a particular probability distribution (theoretical model) as is required to be valid for testing of hypotheses. Thus prior investigation into the pattern of variability with a large number of observations to know how much close or distant the pattern is from the presumed model is absolutely critical. Else, blindly accepting a probability distribution reifying the very nature of the variability as being conforming to the standard model may actually be a far-fetched imposition.

In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred under the condition that the null hypothesis is true. In other words, a result is statistically significant, by the standards of the model as assumed, when p < α [where α, the predefined significance level (generally very small, 0.05 or smaller) is the probability of committing a type-I error i.e. the error in rejecting the null hypothesis, when it is actually true; and p (called the p-value of the study) is the probability of obtaining a result at least as extreme, given that the null hypothesis was true]. Though it is intended to keep the chance to commit type-I error minimum by taking α = 0.05, say i.e. taking a 5% chance only to be wrong to reject the null hypothesis when it is actually true under the model distribution assumed for the population at large, it may be quite likely that the distribution, in reality, being non-conforming to the assumed model doesn’t have so low a probability for that wrong to materialize.

Chance to commit a Type-I error may not actually be as small as you wanted it to be if you have assumed a probability distribution for the variable that happens to be unrealistic. Illustration by Meghna Chakrabarti

A few years back, the American Statistical Association (ASA) released a policy statement aiming to halt misuse of p-values. This was the first time that the 177-year-old ASA made explicit recommendations on such a foundational matter in statistics. Explaining the significance of the ASA recommendation, Nature emphasized on the need to weighing the evidence instead of blindly accepting a p-value of 0.05 or less to mean that a finding is statistically significant:

A p-value of 0.05 does not mean that there is a 95% chance that a given hypothesis is correct. Instead, it signifies that if the null hypothesis is true, and all other assumptions made are valid, there is a 5% chance of obtaining a result at least as extreme as the one observed. And a p-value cannot indicate the importance of a finding; for instance, a drug can have a statistically significant effect on patients’ blood glucose levels without having a therapeutic effect.

Recognizing the seriousness of the tendency of a large section of scientists and statisticians finding themselves constrained to selectively publish their results based on a single magic number, a very recent special issue of The American Statisticians has prescribed what not to do with p-values and significance testing. This advisory comes along with 43 innovative and thought-provoking papers to guide the researchers about what to do as well to face research questions. In essence, the recommendations, envisage a new world order for hypothesis testing, where studies with “p < 0.05” and studies with “p > 0.05” are not automatically in conflict, and therefore, researchers will see their results more easily replicated—and, even when not, they will better understand why.

This blog originated out of a dinner table conversation at the Chakrabarti household, where Satyabrata Chakrabarti (the dad and former Deputy Director General at Central Statistics Office, Government of India), tries to convince his two daughters how statistical significance may or may not be equivalent to clinically meaningful data

Cover image by Meghna Chakrabarti (L), Author Satyabrata Chakrabarti (M) and content research/editing/blog design by Rituparna Chakrabarti (R)

This image has an empty alt attribute; its file name is brain-1-2.png

We publish using the Creative Commons Attribution (CC-BY) license so that users can read, download and reuse text and data for free – provided the authors, illustrators, and the primary sources are given appropriate credit.