Layperson’s Guide to the Science Reproducibility Crisis

10 minute read

Published:

Science is largely responsible for the tremendous progress civilization has made over the last few centuries. It has also shaped how we see ourselves and our place in the universe. However, like any other human endeavor, science is not perfect. Recent years have seen a growing debate about the so-called “reproducibility crisis,” which refers to the inability to reproduce the results of many scientific studies. This is a serious issue that affects not only the scientific community but also society as a whole. In this post, I provide an accessible overview of the issue.

This post is based on an invited article that I wrote for a Uruguayan news outlet on February 2019. The original article is in Spanish and can be found here.

Reproducibility crisis and science culture

Science advances through a mechanism in which different hypotheses are generated and then tested experimentally, gradually refining our understanding of the world and leading us closer to near-absolute truths (e.g., the universe is 13.8 billion years old, Earth is warming due to human activity, etc.). However, this process is not entirely objective, because it is carried out by people within a particular cultural framework—specifically, a scientific culture—that determines which methods of doing and communicating science are acceptable, which evidence is considered valid, and which topics are interesting or controversial.

Like any other culture, scientific culture is strongly influenced by the personal interests of its members (career and prestige), by institutions that may function poorly (such as universities, funding agencies, and scientific societies), by inertia that perpetuates unhelpful customs (“we’ve always done it this way”), and by historical contingencies. This matters because, although the scientific method has proven effective in the long run (over decades and centuries), if scientific culture promotes good practices, knowledge advances more quickly, public resources are used more efficiently, and the population gains more reliable scientific information.

This scientific culture is at the center of the current debate on what is called the “reproducibility crisis.” To understand the debate, it’s important to clarify what reproducibility means. In general, scientific publications present the results of experiments used to argue for or against a hypothesis. Science is an ongoing discussion among existing hypotheses, advancing on the basis of accumulated experimental evidence. But for this process to work, published results must be reproducible, meaning that repeating the experiments yields similar results. This makes sense: if you do an experiment to study the effect of a low-sugar diet on heart health, you would expect similar outcomes if you repeated it; otherwise, the original results would be of limited value. Although complete reproducibility of all scientific publications is not feasible, a growing number of scientists argue that the proportion of non-reproducible studies (those that yield different results when repeated) is much higher than acceptable, and that flaws in scientific culture are to blame. (It’s worth noting that some fields are more affected than others: an estimated 50% of psychology studies cannot be reproduced; biology is next in line for concern, while physics is mentioned far less often.)

Why does this matter?

Knowing what the problem is, a natural question arises: why does this matter for society? Its importance is evident in three main areas. The most visible effect appears in popular science communication (although the media also share some of the blame). A key recent example of the reproducibility crisis is the work of Dr. Brian Wansink, who ran a lab at the prestigious Cornell University in the United States studying people’s eating behaviors. Far from being a scientist cloistered in his lab, Wansink was a major figure in popularizing ideas about the psychology of eating: he published several successful popular science books and frequently appeared on TV shows, in documentaries, and in the media. His work generated widely shared recommendations like using smaller plates to serve food or avoiding eating while watching TV. However, Wansink’s scientific career came to an abrupt end in 2018 when it was discovered that his publications were riddled with errors and negligence. As a result, the conclusions he had drawn (and which had been widely publicized) were not in fact supported by solid scientific evidence. Although Wansink’s case is extreme in how far he took these poor scientific practices, many of those same practices—albeit in less extreme forms—are common in various research areas.

Wansink’s story also illustrates another societal impact of reproducibility problems: wasted resources. Over its lifetime, his lab spent millions of dollars in taxpayer funding and relied on the work of many bright people to run experiments (which could have been a solid investment if done properly). These effects then multiply: other labs invest resources building on Wansink’s results, only to find them irreproducible, which increases confusion in the scientific literature. Many young researchers trained in his lab also missed out on learning how to conduct solid, reproducible research, perpetuating issues into the next generation of scientists. Finally, Wansink’s work was the foundation of large government programs (costing millions of dollars) that aimed to implement his findings to improve eating habits in U.S. schools. It’s likely these interventions will not work, and that the effort and resources could have made a positive impact if based on more robust evidence. Although few individual researchers reach Wansink’s level of social influence, the cumulative effect of many “gray-area” practices—far more common and culturally accepted—is exponentially greater.

A third effect of reproducibility problems is their impact on advances in medicine and technology. For example, it has been reported that a large percentage of findings in cancer biology cannot be replicated by other scientists or by companies developing drugs for the disease. This creates greater uncertainty in drug development, requiring more time and money (billions of dollars) to test potential drugs, instead of being able to focus on fewer, but more reliable, candidates. A more recent debate concerns a potential reproducibility problem in machine learning (a branch of artificial intelligence), an area experiencing explosive growth in both academia and industry. These examples raise serious questions about the human and economic costs of a lack of robustness in science.

Causes for the reproducibility crisis

If the reproducibility problem is bad for society and for science, why does it happen? The causes are numerous and complex, but can be grouped into three main categories:

1) Gathering and interpreting data from complex systems (like living beings) is hard. There are many variables that can cause two repetitions of the same experiment to yield different results, either due to poor experimental design or random chance. Although there are many tools for experimental design and statistical analysis, in most scientists’ education these fields play a secondary or even tertiary role.

2) Scientists are often evaluated based on criteria that do not reward reproducible science but rather the sheer number of publications. Securing a publication that meets a journal’s minimal methodological requirements (which are not very high) is easier and faster than publishing a robust, reproducible study (which, for instance, might require larger sample sizes or multiple repetitions of an experiment). Consequently, it can be better for a researcher’s career to publish many less reliable papers than fewer but more trustworthy ones.

3) Another key factor in a scientist’s career—also assessed by funding agencies and universities—is how interesting the “story” behind the results appears to be (think of the appealing, intuitive stories Wansink told). Naturally, this can lead to a focus on narrative over empirical strength, such as downplaying (or burying) contrary evidence and overstating supportive data. Over time, these practices can make a published story look rock-solid when it’s really a product of selective evidence and exaggeration.

What can be done?

Fortunately, there is reason for optimism. The international scientific community is increasingly aware of these issues, and there are new initiatives underway to address the three main causes mentioned above. Although it is a difficult task—like any cultural change—some fields are already showing important shifts. Examples include introducing new standards for statistical analysis and experimental design, revising methods for evaluating researchers, and requiring greater transparency (e.g., making all generated data publicly accessible), which allows for better error detection. Universities, funding agencies, societies, and scientific journals must support these initiatives against the inertia in academia to improve science’s overall social impact. It remains to be seen how long it will take for these efforts to be integrated into the culture, and whether they will succeed in boosting reproducibility in the scientific literature.

Finally, it helps to end with a note on how science fundamentally works. Reaching a scientific truth is typically a continuous process over many years. At any given time, multiple hypotheses compete, and individual studies provide evidence for or against them, but rarely is a single study decisive. As evidence accumulates, the scientific community reaches a consensus on certain aspects of the world and then builds on those conclusions. This is important because it means that science does not depend on each individual study being infallible; rather, it depends on weighing each study according to its value and robustness. The reproducibility problem makes it more difficult to use published results to advance knowledge, but it does not stop science from functioning as it always has. What does this mean for people who want to stay informed about scientific developments? It means that no single study can tell us whether eating chocolate or having a daily glass of wine is good or bad for our health, whether a certain educational method is best for our children, or whether the key to happiness lies in getting an hour of sun every day or forcing ourselves to smile. Collections of studies—such as science-based books—are generally a better source of reliable information, though as in the case of Wansink’s books, there can still be pitfalls. We simply need to maintain a healthy level of skepticism about what we read, recognize the difference between scientific consensus and individual findings claiming to have found definitive answers, and choose trustworthy sources that prioritize the robustness of the evidence over “telling a good story.”