From the Front Row: Using biostatistics and P-value in public health research

Published on March 20, 2023

Joe Cavanaugh, professor and head of the University of Iowa Department of Biostatistics is this week’s guest. He chats with Amy and Anya about the central role that biostatistics plays in public health and medical research and explains the concept of P-value and its use in biostatistics.

Find our previous episodes on Spotify, Apple Podcasts, and SoundCloud.

Amy Wu:

Hello, everyone, and welcome back to From the Front Row. Behind a lot of public health evidence is numbers, and biostatisticians are the ones behind the scene, interpreting those numbers every day.

One staple of many biostatistical tests is a number called the p-value. Most of us are taught that if the p-value is smaller than 0.05, you found something statistically significant. And if it’s larger than 0.05, your numbers were probably due to random chance. Because the p-value is used so widely in statistics, this concept has a huge impact on evidence-based decisions and research impacted. But it turns out there is a lot of controversy behind the p-value. We have Dr. Cavanaugh, one of our biostatistics professors, on the show today to talk with us about just that.

Dr. Cavanaugh has published more than 160 peer-reviewed papers, and his research contribution span a wide range of fields, from cardiology to health services utilization to sports medicine to infectious disease, just to name a few. Outside of that, he’s an elected fellow of the American Statistical Association, an elected member of the International Statistical Institute, and has received several awards for teaching and mentoring. We’re glad to have him at the college and we’re very excited to have him on the show today to break down p-values for our audience.

I’m Amy Wu, joined today by Anya Morozov. If it’s your first time with us, welcome. We’re a student-run podcast that talks about major issues in public health and how they are relevant to anyone, both in and outside the field of public health. Welcome to the show, Dr. Cavanaugh.

Joe Cavanaugh:

It’s great to be here. Thank you for inviting me.

Amy Wu:

Of course. Before we get into the topic of today’s episode, can you tell us a little bit about your background and what folks can do with a biostatistics degree?

Joe Cavanaugh:

Yeah, absolutely. Most biostatisticians, they followed a rather non-linear path to the discipline. I went to a small STEM college as an undergraduate, Montana Tech in my hometown of Butte, Montana, and I received bachelor’s degrees in computer science and mathematics. And I found that I enjoyed computing, but I really wanted to program my own ideas.

I had a couple of programming jobs, one at a utility company and one at a national lab, but I was programming the ideas of engineers and physicists. And to me, the creative aspect of computing is the development of the algorithms. And as far as my mathematics degree goes, I really liked my math courses, but I realized I liked the more applied side of math.

So one of my undergraduate mentors suggested that I consider work in statistics, and I followed his advice and received my PhD in statistics from the University of California Davis. I spent the first 10 years of my academic career in a department of statistics at the University of Missouri. And then in 2003, I slightly changed course from stat to biostat, and moved here to the University of Iowa. So this is my 20th year as a Hawkeye.

What I liked about biostatistics is that it allows you to use your quantitative skills to help solve important practical problems. And the field has always been one where demand greatly exceeds supply, so the job market is excellent. And as far as the types of jobs that our graduates pursue, it’s pretty wide-ranging. We have some that work at pharmaceutical companies, biomedical research facilities, such as the Mayo Clinic, Fred Hutchinson Cancer Center, Memorial Sloan Kettering Cancer Center, high tech companies like Facebook and Google, and government agencies like the FDA, the NIH, and the CDC.

Amy Wu:

Awesome. Thanks for sharing. As a biostatistics student, I’m very excited to hear that there are a lot of job prospects out there.

Joe Cavanaugh:

Absolutely.

Amy Wu:

And I do enjoy the breadth of work that I have the potential of going into.

Anya Morozov:

Yeah, so we mentioned a little bit in the introduction about p-values. If they’re smaller than 0.05, generally you’ve found something statistically significant, and if they’re larger than 0.05, your findings might be due to random chance. But could you explain a little more and remind us what a p-value is for non-statisticians?

Joe Cavanaugh:

Yeah, absolutely. So in many statistical applications, you’re going to build a statistical model to investigate the possible association between an explanatory variable and an outcome of interest. So an example might be to investigate the association between influenza and flu vaccinations to determine the extent to which your risk is reduced if you’re vaccinated.

So to test this association, you formulate two hypotheses. There’s a null hypothesis, which assumes that the association or the so-called effect doesn’t exist. And then there’s an alternative hypothesis, which assumes that the association or effect does exist and is potentially important. So the p-value, it’s computed by assuming hypothetically that the null hypothesis is true, and then finding the probability of obtaining data similar to the data observed in your study under that assumption. So if this probability then is small, this might cause you to doubt the veracity of the null hypothesis and to view the alternative hypothesis as more credible.

So in more succinct terms, you can view the p-value perhaps as a type of conditional probability. It’s trying to address the question just what is the probability of the data, given that the null hypothesis is true. And to some extent, the p-value is founded on the notion of a proof by contradiction, because you’re assuming that the null hypothesis is true, and then you’re trying to determine whether or not the data discredits that assumption by getting a low probability.

Amy Wu:

Yeah, so just also for the non-statisticians out there, a lower p-value would basically make you favor the alternative hypothesis over the null?

Joe Cavanaugh:

That’s correct. Yeah. And as you alluded to earlier, Amy, a common practice, which I’ll comment about in a bit, is to compare the p-value to a level of significance that is set at 0.05. 0.05 probability. So if the p-value is less than that, often you declare the result as being statistically significant. And if it’s greater than that, then you say that you don’t have the burden of proof met in order to reject the null in favor of the alternative. But that practice is problematic, so I think we’re going to get to that issue in just a bit.

Amy Wu:

Yeah, of course. So on that subject, could you give us a brief history on the controversy surrounding p-values? Or example, last semester in your distinguished faculty lecture, you mentioned that the Basic and Applied Social Psychology journal decided to ban all p-values in 2015. So for non-statisticians, can you explain why they would do this, what are some of the pitfalls of p-values, et cetera?

Joe Cavanaugh:

Yeah, certainly. There’s a few different questions there that I’ll try to address, Amy. To begin a bit of history about the p-value, it’s been around for about a century. It’s often credited to Karl Pearson, who introduced the concept in 1925. It was introduced in the context of hypothesis testing, which is a paradigm designed for very specific types of studies, namely randomized experiments. So in biostatistics, clinical trials would be an example of a randomized experiment.

But as it turns out, since then, p-values have become much more pervasively used and often, I would argue, in context for which they were not designed. And they’re often misapplied and misinterpreted and used to justify conclusions that really are not warranted.

So over the last decade or two, scientists have become more concerned with reproducibility. There’s been a lot of backlash against p-values, and some scientists have suggested that the best way of dealing with the problems caused by the p-value is to just banish it altogether. But from my perspective, that is neither a practical nor an ideal solution.

Having said that, there are many problems with the p-value. One of the most significant issues is based on the practice that you alluded to earlier, comparing the p-value to the 0.05 level of significance in deciding whether to reject the null hypothesis and declare the existence of an effect if the p-value is less than 0.05. Now, the reason this is a problem is that the p-value can assume any value between zero and one. So it’s a continuous measure that should be evaluated on a spectrum of evidence.

To illustrate that idea a bit further, there’s very little practical difference between a p-value of 0.04 and 0.06. So making completely different decisions based on these two p-values is not rational. To say that if you have a p-value of 0.04, that the effect is significant, that you should pay attention to it. But if you have a p-value of 0.06, you haven’t met the burden of proof, and therefore, you shouldn’t doubt the null hypothesis.

Now, one of the problems that has resulted from this binary decision-making, it’s a practice known as p-hacking. That’s the practice of repeatedly analyzing data using different analytic techniques to obtain a p-value that is less than 0.05. So you might come up with a variety of different models, and the first time you get a p-value for the effect of interest of .14, you’re not happy with that. You reformulate the model, you get a p-value of 0.08, and you’re still not happy. Reformulate it again, you get a p-value of 0.04, and then you say, “Okay, I finally achieved the burden of proof, that level of significance.” P-hacking is one of the reasons that many studies are not reproducible.

So another problem with the p-value is that it allows you to assess statistical significance, but not clinical or practical significance. To explain the difference between those two ideas, suppose that we have a treatment for hypertension that’s designed to reduce systolic blood pressure, so you conduct a clinical trial to try to assess the efficacy of the treatment. Now, the result is statistically significant if you can establish that the mean change in blood pressure is non-zero, but the result is clinically significant if the mean change is substantial enough to impact a person’s health. You could argue that that’s a higher bar to attain than statistical significance.

So as it turns out, one can obtain a small p-value that leads to statistical significance if a small change is estimated with a high degree of accuracy. In that setting, you’re quite sure that the change is non-zero, but you’re also quite sure that the change is minor. And a small p-value can arise when an effect is accurately estimated, but it’s estimated to be small. Small enough that it probably is not clinically important or practically meaningful.

So how can you get around that problem? Well, confidence intervals or Bayesian credible intervals, they’re more informative because they provide a range of plausible values for the effect of interest. And the center of the interval, it represents what you could think of as the most likely value for the effect. It’s often the point estimate, the so-called point estimate of the effect. And in the width of the interval reflects the accuracy of the effect estimate. And both the point estimate and its measure of accuracy are very important in coming up with an overall assessment of the effect of interest.

So the problem with the p-value is it’s taking these two important pieces of information, the point estimate and the measure of accuracy, often called the standard error, and conflating these two quantities by combining them into one number. And once you’ve collapsed those two quantities into one number, there’s no way of separating them out and determining what the two quantities are individually.

Amy Wu:

Yeah, so I had one follow-up question. If you could briefly explain what reproducibility is in the context of p-values and p-hacking.

Joe Cavanaugh:

Yeah, absolutely. A study is reproducible if you can conduct a very similar study with the same outcome of interest, the same explanatory factor of interest, and get a similar result. And because all studies are inherently flawed to various degrees, reproducibility is very important because it’s the aggregation of evidence over a variety of different studies that starts giving us a definitive understanding of a particular phenomenon.

So for instance, we now widely accept the fact that there are a lot of bad health conditions that are a result of smoking cigarettes. But 50 years ago, that was not widely known. And all cigarette smoking studies are observational. You can’t do a randomized experiment where you break subjects into two groups and say, “You’re going to smoke, and you’re not going to smoke.” But because we have found over time that the ill effects of smoking are reproducible in different observational studies, that preponderance of evidence over a variety of different studies has led us to the conclusion that you shouldn’t smoke.

So the problem that has resulted with reproducibility in some studies is you’ll have a paper, say, that’s published, and it declares an effect as being statistically significant, and then the authors will say, “This is a conclusive result.” And other authors may say, “Well, I think that this phenomenon is important enough to investigate in a separate study with a different database, different population of interest,” and they may find no evidence at all of the same effect. And you could imagine that one of the reasons why that could happen is if you have authors that are repeatedly analyzing the data in order to get a p-value that’s less than 0.05 and then they publish the result once they find the right analysis that will give them that result, then it’s going to lead to a study that is not reproducible. So that has become a major issue recently in science where you have a study that is investigating a very important phenomenon. Other investigators want to see if they can replicate the result, and they’re unable to do so.

Amy Wu:

Yeah. Well, thanks for extending on reproducibility. I also just wanted to retouch on clinical significance versus statistical significance. So what I’m hearing is that results can be statistically significant but not clinically relevant or important, say, to medical professionals, for example, or they could be both. So you’re kind of saying that p-values kind of can only speak to the former, where they’re only statistically significant and not-

Joe Cavanaugh:

That’s exactly correct, Amy. So you could imagine a situation where you have, say, a small effect where any physician would take a look at the effect and say, “Not worth taking that drug, because if it’s going to have such a minor impact on your health, that it’s just not worth it.” And you could imagine another study where you have a large effect, say in the hypertension example, where you have a drug that could potentially reduce your systolic blood pressure by as much as 20 to 30 points, which would be a major game changer.

Now, in either of those settings, if you have a highly accurate estimate, you’ll get a very small p-value, and so you’ll be able to establish statistical significance. In both of those settings, you can say fairly conclusively that the effect is non-zero. But in one case, it’s non-zero, but it’s very small and very close to zero. And in the other case, it’s non-zero and it’s substantial in magnitude. In the latter setting, you would have clinical or practical significance, whereas in the former setting, you would not. But in both of those settings, you would have statistical significance.

Anya Morozov:

So it’s almost like the p-value is creating this incentive for folks in research to really strive for statistical significance, and the clinical or public health significance can become secondary to that if you’re focused so much on whether or not you’re hitting that 0.05 or whatever p-value you’ve set as statistically significant.

Joe Cavanaugh:

Yeah, that’s exactly right, Anya. And I will say that part of the problem is you’re incentivized, through the publication process, to have statistically significant results. So another problem that results in a lack of reproducibility is called publication bias. So there’s the feeling among editors of journals and reviewers of articles that if you don’t establish an effect that is statistically significant, then we haven’t learned anything. And yet a null result, a null finding can often be as informative or even more informative than a result that is statistically significant.

But if the journals are going to practice this tendency where they’re going to favor results that report statistically significant findings and ignore results that don’t report such findings, then basically you’re getting a very biased representation of a particular phenomenon. So you might have, for instance, a subtle effect. And in some studies, it’s showing up as statistically significant. In other studies, it’s not. And yet the only studies that are being published are those where you have statistical significance. So if you search the literature, you think, “Well, this effect is showing up consistently in a large variety of different studies,” because you haven’t seen all of the studies where it hasn’t shown up as being significant, due to the fact that those studies are not published.

Amy Wu:

So we just talked a little bit about the potential pitfalls of p-values. In your aforementioned lecture, you mentioned that the p-value is not going away. So are p-values still relevant?

Joe Cavanaugh:

Yeah, I definitely think that they are, Amy. They still have relevance in the settings for which they were designed, hypothesis testing in the context of randomized experiments such as a clinical trial that’s performed to assess the efficacy of a new drug by comparison to a placebo. But there’s a saying that you might have heard that goes along the following lines, when all you have is a hammer, every problem looks like a nail. And unfortunately, researchers often treat the p-value like a hammer and use it as a tool rather indiscriminately for problems where it is contextually inappropriate.

So from my perspective, p-values still have a place in statistics, a very prominent place, but they should probably be used much less pervasively. And in settings where a p-value is inappropriate, there are other inferential tools that are available. They’re not perhaps as widely known, but as statisticians, we should promote the use of those tools rather than always producing p-values because we think that is what is expected.

Anya Morozov:

Yeah, so along those lines, p-values can be useful in some settings, but not all. You have done some work on one alternative to the p-value, called the discrepancy comparison probability, or DCP. Can you tell us a little bit about that alternative?

Joe Cavanaugh:

Yeah, yeah, I’d be happy to. To provide a little bit of background, the p-value is often used to test for the existence of an effect in the context of a statistical model. That’s how we typically see p-values used in research, especially observational studies.

Now, the model under the alternative hypothesis, it contains the effect of interest, and the model under the null hypothesis does not. And the model often contains other variables of interest as well. So as an example, think of a prognostic model that is formulated to predict the onset of heart disease for middle-aged individuals. Now, the effect of interest might be a measure of physical activity, because we know that if you’re physically active, that that should reduce your risk of future heart disease. But you’ll probably want to include other variables in the model that could impact this relationship, such as age, BMI, cholesterol level, blood pressure, sex, ethnicity.

Now, if the p-value is small, we reject the null model in favor of the alternative model, and we claim that there is an effect. But the problem in this context is that the p-value can only be defined and interpreted under the assumption that one or the other model represents truth, because that’s the hypothesis testing paradigm. You have a null hypothesis, in this case, a null model, an alternative hypothesis, in this case, an alternative model, and both represent incompatible states of nature. One represents the truth, and one does not. That’s the hypothesis testing paradigm, and it’s up to you to try to use the data in order to try to decide which of those two competing states of nature is the most credible.

So where do you run into problems when you’re comparing two models in a hypothesis testing setting? Well, models, they’re only approximations to reality. They don’t represent reality. So the entire paradigm of hypothesis testing and p-values is really misaligned with statistical modeling. There’s a quote that is a favorite among statisticians that is attributed to George Box, who was a very famous statistician. It goes like this, “All models are wrong, some are useful.” So to unpack that quote, all models are wrong because all models are approximations to reality. They don’t represent reality. Some are useful because some are sufficiently accurate approximations for the inferential purpose at hand.

So the discrepancy comparison probability, or the DCP, it represents the probability that the null model is closer to the truth, to reality, than the alternative model, or that the null model is less discrepant from the truth or reality than the alternative model. And importantly, the DCP, it doesn’t assume that either model represents the truth. So basically assessing the probability that the null model is a better approximation to the truth than the alternative model.

Now, like the p-value, the estimated DCP is going to be close to zero if the alternative model is markedly better than the null model. But unlike the p-value, the estimated DCP will be close to one if the null model is markedly better than the alternative model. So it tells you something if it’s small, if it’s close to zero. And it tells you something if it’s close to one, if it’s large.

So that actually points out another flaw with the p-value, and that is that a small p-value, it represents evidence against the null hypothesis, in favor of the alternative hypothesis. But as it turns out, a large p-value really doesn’t tell you anything, and that it represents an absence of evidence rather than evidence in favor of the null hypothesis. And you’ve heard that adage, absence of evidence is not evidence of absence. But often when researchers will come up with a large p-value, 0.5, they’ll say, “Well, this provides evidence that the null hypothesis is credible,” and that’s not the case.

But based on the way that the DCP is set up, it will lean towards one if the null model is a better approximation to reality than the alternative model, and it will lean towards zero if the alternative is a better approximation to reality. So it does provide evidence in support of either model, but again, only thinking of how well the models approximate the truth, not by trying to think that either model represents the truth.

Amy Wu:

So along the lines of the holistic philosophy behind the DCP, with these non-binary or interpretations along the spectrum or continuum, could you talk about the role of biostatisticians in interpreting these complex and nuanced medical or public health type problems?

Joe Cavanaugh:

Yeah, absolutely. One point that I’d like to make is that statistical methods require advanced training. And part of the problem with the misuse of p-values is that sophisticated analyses are often conducted by researchers without the appropriate training. So Amy, you’re working on a graduate degree in biostatistics, and Anya, you’re working on a graduate degree in epidemiology, where you have to learn a lot of biostatistics.

So epidemiologists who are well trained in modern statistical methods, and biostatisticians, they’re more aware of what a p-value can tell you and what it cannot tell you. Also, they’re likely more aware of alternative measures of statistical evidence that have been introduced during the past few decades. So if you’re working on a particular study and you’re convinced that a p-value is not the best measure of statistical evidence, that you might be aware, if you have training in more advanced methods, of some of these alternatives.

Now, having said this, scientific paradigms are hard to change. And because p-value are so predominant in biomedical and public health research, it will take time to change the culture so that p-values are used more sparingly and in context where they’re more appropriate. But from my perspective, biostatisticians and statisticians, they really need to be willing to push the envelope and use some of these more modern and sophisticated inferential tools, such as say, Bayesian posterior probabilities, Bayes factors, likelihood ratio, statistics, information criteria, such as the Akaike information criterion, the Bayesian information criterion.

Now, these phrases probably sound unfamiliar to students who’ve had an introductory course in statistics but haven’t gone beyond that course. And in that introductory course, they’re likely to remember two constructs, the p-value and the confidence interval. But if you’ve had more advanced training in biostatistics, you’re likely more aware of some of these tools, and there may be settings that arise in your research where you feel like you should advocate in favor of using something other than the p-value in order to address the inferential question of interest. And if we, as statisticians, always default to the p-value because we think that’s what editors of journals and referees of articles are going to expect, then the culture is never going to change.

Anya Morozov:

Well, I’m glad you’re here at the college and just generally kind of advocating for more nuance in how we interpret results of studies. I also think it kind of shows the importance of communication in any field, even biostatistics. I feel like that’s one where traditionally, I think maybe communication isn’t as important to being able to do biostatistics, but you have to be able to communicate. If you are proposing that change and not using the p-value to a lab full of people who maybe aren’t biostatisticians, you have to be able to talk about these other methods and why there may be a better fit.

Joe Cavanaugh:

That’s very well said, Anya. In fact, I will often say to our students that there is this perception when you’re in graduate school that being really good at math or being really good at computing, that those are the most important skills for a biostatistician. And they are, without question, important skills, and those are the types of skills that will often allow you to get good grades in your coursework.

But the most important skill, from my perspective, is to have very good oral and written communication skills for exactly the reason that you mentioned, because everything that we do is collaborative, and you don’t want to talk to your collaborators as though they have the same background that you do. A physician is not going to talk to a patient the same way that they would talk to another physician with expertise in that area.

So it’s a real art for an epidemiologist or a biostatistician to be able to distill the essence of an inferential result and communicate what it can tell you and what it can’t tell you in such a way that your collaborators understand what you’ve done with their data and what conclusions are warranted, what conclusions are not warranted. And it’s not easy to do. It takes a lifetime of practice. So I completely agree with you. And I think perhaps if we all communicated better as biostatisticians, then perhaps that would be a step in the right direction as far as using more appropriate inferential tools. Because when a setting does arise where you feel that p-value isn’t appropriate, you can articulate the reasons why.

Anya Morozov:

Very well said. Now we’ll move on to our last question on the show. This is one that we ask to all of our guests. It can be related to p-values, biostatistics, or just everyday life. But what is one thing you thought you knew but were later wrong about?

Joe Cavanaugh:

Yeah. Well, this was a fun question to think about, Anya, and so I wanted to provide an academic answer and a completely non-academic answer. I’ll start with the academic answer. And this does tie in with our discussion about p-values. I think that learning about the politics and the messiness of publishing scientific research was a real eye-opener for me. When I was young, I believed that good research was published, and bad research was not. Now I realized that most research is imperfect, and that the evaluation of research is highly subjective. Also, to publish research, you often need to sacrifice idealism for pragmatism. But I would claim that once you understand the rules of the game, and that includes the problems that are endemic to the publication process, you realize that you can still conduct and publish good work and do so with ethics and integrity, but you’ll often need to battle to defend your principles.

So that’s one thing that is academic, related to research, that I thought I knew, had sort of an idealistic oversimplified view, and then later found out that I was misguided.

So here’s my non-academic answer, and this just occurred to me this morning. I’m a big fan of football, both college and the NFL. My favorite college team is, of course, the Hawkeyes, and my favorite NFL team is the Buffalo Bills. The Buffalo Bills were really bad for a long time, and in 2018, they drafted a new quarterback, Josh Allen, and I thought they’d made a horrible mistake, that they’d wasted this high first round draft pick on someone who would be a complete bust. And now, as it turns out, Josh Allen is one of the best quarterbacks in the NFL, and the Bills are actually a good team. So I’ve never been so happy to be so wrong. That’s my fun answer.

Amy Wu:

All right. Well, thanks, Dr. Cavanaugh, for joining us for this episode. It was very helpful to hear you explain p-values, their history, their pitfalls, their future, and also novel biostatistical methods, like the DCP, which you have worked on. And we’re very lucky that you’ve been able to explain it from a biostatistician’s perspective to our non-statistician audience. So yeah, thank you.

Joe Cavanaugh:

Thank you for having me, Amy and Anya. It’s been a pleasure.

Speaker 4:

That’s it for our episode this week. Big thanks to Dr. Joe Cavanaugh for joining us today. This episode was hosted and written by Amy Wu and Anya Morozov, and edited and produced by Anya Morozov. You can learn more about the University of Iowa College of Public Health on Facebook. And our podcast is available on Spotify, Apple Podcasts, and SoundCloud.

If you enjoyed this episode and would like to help support the podcast, please share it with your colleagues, friends, or anyone interested in public health.

Have a suggestion for our team? You can reach us at cph-gradambassador@uiowa.edu.

This episode was brought to you by the University of Iowa College of Public Health. Until next week, stay healthy, stay curious, and take care.