Better Methods for the Reduction Part of the Three Rs

G. Scott Lett, Ph.D.
CEO, The BioAnalyics Group LLC

In past articles, I’ve highlighted research efforts that study the impact of environmental enrichment (EE) on research outcomes. There is mounting evidence that improved environmental conditions result in more relevant research results. But what challenges face researchers who want to make a change for the better? Published literature, when available, isn’t always clear and consistent about what constitutes best practice for a particular animal species, nor is it clear what the impact is in terms of research endpoints. Not all researchers can afford to take time out from their regular research to undertake a study using rigorous experimental design to answer these questions. How do we make the transition to EE and make sense from the growing mountain of literature? In this article, I write about meta-analysis, an approach to analyzing published experimental results from multiple prior studies.

The Opportunity and the Challenge
Environmental Enrichment Fights Cancer and Improves Research Results—What Now for the Biomedical Researcher?

In the October 2010 issue of The Enrichment Record, Emily Patterson-Kane and I reported on the research of Cao et al, published in the July 9 issue of Cell. The researchers spent 5 years, used 1500 mice and painstakingly demonstrated significant effects of environmental enrichment (EE) on cancer outcomes. Based upon Cao et al and other studies, the evidence now suggests that EE is not just a more humane option for research animals, but is necessary to develop better animal models of human diseases. However, we also pointed out some major challenges in making the transition to EE. For the researcher, it is important to understand how the change to EE will affect her results. It is also important to know “best practice” for the care of research animals, and that the standards have not been set for every species.

Not many researchers can afford to undertake a 5-year 1500-mouse study to determine best practices and measure the effects. In another study, Hanno Würbel (2007) used 432 mice and experiments run in replicate in multiple laboratories to support the conclusion that EE does not disrupt standardization of experiments. Undertaking such studies in every laboratory will produce valuable data, but seems to sacrifice one of the three Rs of animal testing (Reduction) in favor of another (Refinement). Small studies can help make the transition more affordable, but may miss significant effects, due to small sample size. Published studies may give inconclusive or conflicting results, causing us to wonder which results to believe.

Sample Size and Statistical Power

Before we discuss meta-analysis, let’s look at the relationship between sample size and the reliability of research results. It is well understood that there is a great deal of variability in biomedical research. There are both biological sources of variability and technical sources. It is no surprise that similar mice don’t all respond identically to the same treatment. A researcher cannot measure the response of all mice, so we use data from a small group of mice (the sample) to predict the behavior of all similar mice (the population). For example, a researcher may want to measure the startle response time of a group of mice. The distribution of startle response times of normal mice might look something like the traditional “bell curve” as seen in figure 1.

Figure 1: Hypothetical Distribution of Response Times in Mice. Most are near 7.5 milliseconds.

In this hypothetical sample, the average response time is 7.5 milliseconds. We can see that some mice have response times as high as 8.5 ms and more, but most tend to cluster around 7.5 ms. A researcher would like to take a small sample of mice and measure their response times in order to predict the response times of all similar mice. How many mice are required to get a good estimate of the responses? Suppose three researchers each measure the response times of 3 mice each

Here’s what their results might look like:


                     Steve      Amy       Ted

Mouse 1       7.8           7.5          7.7

Mouse 2       7.6           7.7          7.3

Mouse 3       7.5           7.2          7.4

Average       7.6           7.5           7.5


Figure 2: Hypothetical Response Time Measurements: Researchers get different results from similar mice.

We see that Amy and Ted measured average response times of about 7.5 ms, both at the “true” mean of 7.5 ms. Not bad! Steve, on the other hand, measured an average response of about 7.6 ms. Does this mean he made a mistake in measurements? No, it is just the natural biological variability of this type of mouse. The expected variation for a sample size of 3 of these mice is 0.14 ms, so all three researchers were well within the expected error range.

The expected error goes down as the sample size goes up. If Steve had used 9 mice instead of 3, his expected error would go from 0.14 down to 0.08, and if he used 100 mice, his expected error goes down to 0.02 ms.

How many mice does Steve need? This is an important question of study design. Suppose we have two groups of mice; one group with standard cages and environment and the other group housed in EE conditions. The two populations might have slight but important differences in startle response times, but the difference is difficult to see in small studies because their “bell curves” overlap, as seen in figure 3, which illustrates a hypothetical example.

The average response for the EE group is 7.7 milliseconds, compared to 7.5 milliseconds for the standard group, but if Steve uses only 3 mice in each group, there’s a 75 percent chance he won’t detect a significant difference. In fact, there is a 17 percent chance the EE group will appear to have a SHORTER response

time than the standard mice! If Steve wants an 80 percent chance of detecting a significant difference, he must use at least 20 mice in each group. The probability of correctly detecting a true effect is called the statistical power of the study. Good study design attempts to balance the power of the study with the desire to conserve precious resources, like  animals, money and time.

Figure 3: Hypothetical Responses for Standard and EE Mice. Overlapping distributions make it more difficult to detect a difference.


In statistics, a meta-analysis combines the results of several published studies into a larger “meta-study.” In the simplest form, a meta-analysis identifies a common measure of effect size across all the studies, in order to get better estimates of the true effect size than those derived in a single study under a given single set of assumptions and conditions. Another aim is to identify small but important differences in effect sizes that might be missed in a single study.

Finally, meta-analysis can help to identify hidden biases in published studies. The idea is really quite simple: by combining the results of published studies, we might get a better picture of best practices and effects of EE than could be seen in any particular published paper.

Karl Pearson is credited with the first published meta-analysis in 1904, studying the effects of inoculation against enteric fever. Combining studies with small sample sizes, he attempted to overcome the problem of reduced statistical power caused by the small samples. Gene V. Glass is credited with first using the term “meta-analysis” and is widely recognized as the modern founder of the method.

Meta-analysis has been successfully used to study environmental enrichment. For example, Averos et al (Applied Animal Behaviour Science, 2010) studied the effects of enrichment on the performance of pigs, and Janssen et al (An enriched environment improves sensorimotor function post-ischemic stroke, Neurorehabil Neural Repair, 2010) were able to sort through conflicting reports and show efficacy of EE using meta-analysis.

The File Drawer Problem—Biased Published Results

One potential weakness of meta-analysis is the dependence on published studies, which may create exaggerated outcomes. It is very hard to publish studies that show no significant results. For any given research area, one cannot know how many studies have been conducted but never reported and the results filed away. Remember Steve’s study design with 20 sample mice in each group? 20 percent of the time he won’t detect a difference between standard and EE mice, and he may not be able to publish the results. If all the results were published, we expect to see a bell curve distribution of differences between EE and standard mice. However, if the insignificant results are never published, we see a distribution that looks more like figure 4.

Figure 4: Hypothetical Distribution of Results, showing only the “significant results.” 20 percent of the results are too insignificant to be published and remain in the researchers’ file drawers.

This file drawer problem results in the distributions that are biased, skewed or completely cut off, and the significance of the published studies can be overestimated. Savvy meta-analysts use techniques to detect these biases and correct for them, but it would be much better to retain these results.


Good study design can optimize precious resources, including animals, money and time. Since researchers cannot afford to size their samples to produce results 100% of the time, meta-analysis can help sort through existing data and develop best practices for EE. Meta-analysis can be an effective tool for moving to better animal care while practicing the “Reduce” of the 3 Rs. The “file drawer problem” can limit our ability to re-use data in meta-analysis. We call for public repositories, where the unpublished and published data can be made available to the research community, providing better information for future meta-analysis.

Enrichment Record October 2011

Volume 9, October 2011

%d bloggers like this: