Measuring inconsistency in meta-analyses.

Julian P T Higgins, statistician,¹ Simon G Thompson, director,¹ Jonathan J Deeks, senior medical statistician,² and Douglas G Altman, professor of statistics in medicine²

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Short abstract

Cochrane Reviews have recently started including the quantity I² to help readers assess the consistency of the results of studies in meta-analyses. What does this new quantity mean, and why is assessment of heterogeneity so important to clinical practice?

Systematic reviews and meta-analyses can provide convincing and reliable evidence relevant to many aspects of medicine and health care.¹ Their value is especially clear when the results of the studies they include show clinically important effects of similar magnitude. However, the conclusions are less clear when the included studies have differing results. In an attempt to establish whether studies are consistent, reports of meta-analyses commonly present a statistical test of heterogeneity. The test seeks to determine whether there are genuine differences underlying the results of the studies (heterogeneity), or whether the variation in findings is compatible with chance alone (homogeneity). However, the test is susceptible to the number of trials included in the meta-analysis. We have developed a new quantity, I², which we believe gives a better measure of the consistency between trials in a meta-analysis.

Go to:

Need for consistency

Assessment of the consistency of effects across studies is an essential part of meta-analysis. Unless we know how consistent the results of studies are, we cannot determine the generalisability of the findings of the meta-analysis. Indeed, several hierarchical systems for grading evidence state that the results of studies must be consistent or homogeneous to obtain the highest grading.²^-⁴

Tests for heterogeneity are commonly used to decide on methods for combining studies and for concluding consistency or inconsistency of findings.⁵^,6 But what does the test achieve in practice, and how should the resulting P values be interpreted?

Go to:

Testing for heterogeneity

A test for heterogeneity examines the null hypothesis that all studies are evaluating the same effect. The usual test statistic (Cochran's Q) is computed by summing the squared deviations of each study's estimate from the overall meta-analytic estimate, weighting each study's contribution in the same manner as in the meta-analysis.⁷ P values are obtained by comparing the statistic with a χ² distribution with k-1 degrees of freedom (where k is the number of studies).

The test is known to be poor at detecting true heterogeneity among studies as significant. Meta-analyses often include small numbers of studies,⁶^,8 and the power of the test in such circumstances is low.⁹^,10 For example, consider the meta-analysis of randomised controlled trials of amantadine for preventing influenza (fig 1).¹¹ The treatment effects in the eight trials seem inconsistent: the reduction in odds vary from 16% to 93%, with some of the confidence intervals not overlapping. But the test of heterogeneity yields a P value of 0.09, conventionally interpreted as being non-significant. Because the test is poor at detecting true heterogeneity, a non-significant result cannot be taken as evidence of homogeneity. Using a cut-off of 10% for significance¹² ameliorates this problem but increases the risk of drawing a false positive conclusion (type I error).¹⁰

An external file that holds a picture, illustration, etc.
Object name is higj33530.f1.jpg

Fig 1

Eight trials of amantadine for prevention of influenza.¹¹ Outcome is cases of influenza. Summary odds ratios calculated with random effects method

Conversely, the test arguably has excessive power when there are many studies, especially when those studies are large. One of the largest meta-analyses in the Cochrane Database of Systematic Reviews is of clinical trials of tricyclic antidepressants and selective serotonin reuptake inhibitors for treatment of depression.¹³ Over 15 000 participants from 135 trials are included in the assessment of comparative drop-out rates, and the test for heterogeneity is significant (P = 0.005). However, this P value does not reasonably describe the extent of heterogeneity in the results of the trials. As we show later, a little inconsistency exists among these trials but it does not affect the conclusion of the review (that serotonin reuptake inhibitors have lower discontinuation rates than tricyclic antidepressants).

Since systematic reviews bring together studies that are diverse both clinically and methodologically, heterogeneity in their results is to be expected.⁶ For example, heterogeneity is likely to arise through diversity in doses, lengths of follow up, study quality, and inclusion criteria for participants. So there seems little point in simply testing for heterogeneity when what matters is the extent to which it affects the conclusions of the meta-analysis.

Go to:

Quantifying heterogeneity: a better approach

We developed an alternative approach that quantifies the effect of heterogeneity, providing a measure of the degree of inconsistency in the studies' results.¹⁴ The quantity, which we call I², describes the percentage of total variation across studies that is due to heterogeneity rather than chance. I² can be readily calculated from basic results obtained from a typical meta-analysis as I² = 100%×(Q - df)/Q, where Q is Cochran's heterogeneity statistic and df the degrees of freedom. Negative values of I² are put equal to zero so that I² lies between 0% and 100%. A value of 0% indicates no observed heterogeneity, and larger values show increasing heterogeneity.

Go to:

Examples of values of I²

The principal advantage of I² is that it can be calculated and compared across meta-analyses of different sizes, of different types of study, and using different types of outcome data. Table 1 gives I² values for six published meta-analyses along with 95% uncertainty intervals. The upper limits of these intervals show that conclusions of homogeneity in meta-analyses of small numbers of studies are often unjustified.¹¹^,13^,15^-¹⁹

Table 1

Heterogeneity statistics for examples of meta-analyses from the literature. Meta-analyses were conducted using either meta or metan in STATA¹⁵

				Heterogeneity test
Topic	Outcome/analysis	Effect measure	No of studies	Q	df	P	I²(95% uncertainty interval)^*
Tamoxifen for breast cancer¹⁶	Mortality	Peto odds ratio	55	55.9	54	0.40	3 (0 to 28)
Streptokinase after myocardial infarction¹⁷	Mortality	Odds ratio	33	39.5	32	0.17	19 (0 to 48)
Selective serotonin reuptake inhibitors for depression¹³	Drop-out	Odds ratio	135	179.9	134	0.005	26 (7 to 40)
Magnesium for acute myocardial infarction¹⁸	Death	Odds ratio	16	40.2	15	0.0004	63 (30 to 78)
Magnetic fields and leukaemia¹⁹	All studies	Odds ratio	6	15.9	5	0.007	69 (26 to 87)
Amantadine¹¹	Prevention of influenza	Odds ratio	8	12.44	7	0.09	44 (0 to 75)

df=degrees of freedom.

^*Values of I² are percentages. 95% uncertainty intervals are calculated as proposed by Higgins and Thompson.¹⁴

The tamoxifen and streptokinase meta-analyses, in which all the included studies found similar effects,¹⁶^,17 have I² values of 3% and 19% respectively. These indicate little variability between studies that cannot be explained by chance. For the review comparing drop-outs on selective serotonin reuptake inhibitors with tricyclic antidepressants, I² is 26%, indicating that although the heterogeneity is highly significant, it is a small effect.

The reviews of trials of magnesium after myocardial infarction (I² = 63%) and case-control studies investigating the effects of electromagnetic radiation on leukaemia (69%) both included studies with diverse results. The high I² values show that most of the variability across studies is due to heterogeneity rather than chance. Although no significant heterogeneity was detected in the review of amantadine,¹¹ the inconsistency was moderately large (I² = 44%).

Figure 2 shows the observed values of I² from 509 meta-analyses in the Cochrane Database of Systematic Reviews. Almost half of these meta-analyses (250) had no inconsistency (I² = 0%). Among meta-analyses with some heterogeneity, the distribution of I² is roughly flat.

An external file that holds a picture, illustration, etc.
Object name is higj33530.f2.jpg

Fig 2

Distribution of observed values of I² based on odds ratios from 509 meta-analyses of dichotomous outcomes in the Cochrane Database of Systematic Reviews. Data are from the first subgroup (if any) in the first meta-analysis (if any) in each review, if it involved a dichotomous outcome and at least two trials with events. Meta-analyses conducted with metan in STATA¹⁵

Go to:

Further applications of I²

I² can also be helpful in investigating the causes and type of heterogeneity, as in the three examples below.

Methodological subgroups

Figure 3 shows the six case-control studies of magnetic fields and leukaemia broken down into two subgroups based on assessment of their quality.¹⁹ If heterogeneity is identified in a meta-analysis a common option is to subgroup the studies. Because of loss of power, non-significant heterogeneity within a subgroup may be due not to homogeneity but to the smaller number of studies. Here, the P values for the heterogeneity test are higher for the two subgroups (P = 0.3 and P = 0.009) than for the complete data (P = 0.007), which suggests greater consistency within the subgroups. However, the values of I² show that the three low quality studies are more inconsistent (I² = 79%) than all six (I² = 69%) (table 2). Substantially less inconsistency exists among the high quality studies (I² = 15%), although uncertainty intervals for all of the I² values are wide.

An external file that holds a picture, illustration, etc.
Object name is higj33530.f3.jpg

Fig 3

Meta-analyses of six case-control studies relating residential exposure to electromagnetic fields to childhood leukaemia.¹⁹ Summary odds ratio calculated by random effects method

Table 2

More advanced applications of l² for assessing heterogeneity in three published meta-analyses. Meta-analyses were conducted with either meta or metan in STATA¹⁵

				Heterogeneity test
Topic	Outcome/analysis	Effect measure	No of studies	Q	df	P	I² (95% uncertainty intervals)^*
Magnetic fields and leukaemia¹⁹	All studies	Odds ratio	6	15.9	5	0.007	69 (26 to 87)
	High quality	Odds ratio	3	2.4	2	0.31	17 (0 to 91)
	Low quality	Odds ratio	3	9.4	2	0.009	79 (32 to 94)
Human albumin for critically ill²⁰	Death	Risk ratio	24^†	15.3	23	0.88	0 (0 to 17)
	Death	Risk difference	30	36.7	29	0.15	21 (0 to 50)
Tamoxifen to prevent recurrence of breast cancer¹⁷	All studies	Peto odds ratio	55	108.2	54	0.00002	50 (32 to 63)
	Total within groups^‡	Peto odds ratio	—	59.9	52	0.21	13 (0 to 39)
	Between groups^‡	Peto odds ratio	3 groups	48.3	2	<0.00001	96 (91 to 98)

df=degrees of freedom,

^*Values of l² are percentages. 95% uncertainty intervals are calculated as proposed by Higgins and Thompson.¹⁴

^†Studies with no events in either treatment group do not contribute to this analysis.

^‡Subgroup defined by duration of tamoxifen treatment.

Heterogeneity related to choice of effect measure

A systematic review of clinical trials of human albumin administration in critically ill patients concluded that albumin may increase mortality.²⁰ These studies had no inconsistency in risk ratio estimates (I² = 0%) and a narrow uncertainty interval. Table 2 shows the heterogeneity statistics for risk differences as well as for risk ratios. Six trials with no deaths in either treatment group do not contribute information on risk ratios, but they all provide estimates of risk differences. Using P values to decide which scale is more consistent with the data²¹ is inappropriate because of the differing numbers of studies. I² values may validly be compared and show that the risk differences are less homogeneous, as is often the case.²²

Clinically important subgroups

I² can also be used to describe heterogeneity among subgroups. Table 2 includes results for the outcome of recurrence in the meta-analysis of trials of tamoxifen for women with early breast cancer. There was highly significant (P = 0.00002) and important heterogeneity (I² = 50%) among the trials.¹⁶ However, a potentially important source of heterogeneity is the duration of treatment. The authors divided the trials into three duration categories and presented an overall heterogeneity test, a test comparing the three subgroups, and a test for heterogeneity within the subgroups. I² values corresponding to each test show that 96% of the variability observed among the three subgroups cannot be explained by chance. This is not clear from the P values alone. The extreme inconsistency among all 55 trials in the odds ratios for recurrence (I² = 50%) is substantially reduced (I² = 13%) once differences in treatment duration are accounted for.

Go to:

How much is too much heterogeneity?

A naive categorisation of values for I² would not be appropriate for all circumstances, although we would tentatively assign adjectives of low, moderate, and high to I² values of 25%, 50%, and 75%. Figure 2 shows that about a quarter of meta-analyses have I² values over 50%. Quantification of heterogeneity is only one component of a wider investigation of variability across studies, the most important being diversity in clinical and methodological aspects. Meta-analysts must also consider the clinical implications of the observed degree of inconsistency across studies. For example, interpretation of a given degree of heterogeneity across several studies will differ according to whether the estimates show the same direction of effect.

Advantages of I²

Focuses attention on the effect of any heterogeneity on the meta-analysis
Interpretation is intuitive—the percentage of total variation across studies due to heterogeneity
Can be accompanied by an uncertainty interval
Simple to calculate and can usually be derived from published meta-analyses
Does not inherently depend on the number of studies in the meta-analysis
May be interpreted similarly irrespective of the type of outcome data (eg dichotomous, quantitative, or time to event) and choice of effect measure (eg odds ratio or hazard ratio)
Wide range of applications

Summary points

Inconsistency of studies' results in a meta-analysis reduces the confidence of recommendations about treatment

Inconsistency is usually assessed with a test for heterogeneity, but problems of power can give misleading results

A new quantity I², ranging from 0-100%, is described that measures the degree of inconsistency across studies in a meta-analysis

I² can be directly compared between meta-analyses with different numbers of studies and different types of outcome data

I² is preferable to a test for heterogeneity in judging consistency of evidence

An alternative quantification of heterogeneity in a meta-analysis is the among-study variance (often called τ²), calculated as part of a random effects meta-analysis. This is more useful for comparisons of heterogeneity among subgroups, but values depend on the treatment effect scale. We believe, I² offers advantages over existing approaches to the assessment of heterogeneity (box). Focusing on the effect of heterogeneity also avoids the temptation to perform so called two stage analyses, in which the meta-analysis strategy (fixed or random effects method) is determined by the result of a statistical test. Such strategies have been found to be problematic.²³^,24 We therefore believe that I² is preferable to the test of heterogeneity when assessing inconsistency across studies.

Go to:

Notes

We thank Keith O'Rourke and Ian White for useful comments.

Contributors: The authors all work as statisticians and have extensive experience in methodological, empirical and applied research in meta-analysis. JH, JD, and DA are coconvenors of the Cochrane Statistical Methods Group. The views expressed in the paper are those of the authors. All authors contributed to the development of the methods described. JH and ST worked more closely on the development of I². JH is guarantor.

Funding: This work was funded in part by MRC Project Grant G9815466.

Competing interests: None declared.

Go to:

References

1. Egger M, Davey Smith G. Meta-analysis: potentials and promise. BMJ 1997;315: 1371-4. [Europe PMC free article] [Abstract] [Google Scholar]

2. Liberati A, Buzzetti R, Grilli R, Magrini N, Minozzi S. Which guidelines can we trust? West J Med 2001;174: 262-5. [Europe PMC free article] [Abstract] [Google Scholar]

3. Harbour R, Miller J for the Scottish Intercollegiate Guidelines Network Grading Review Group. A new system for grading recommendations in evidence based guidelines. BMJ 2001;323: 334-6. [Europe PMC free article] [Abstract] [Google Scholar]

4. Guyatt G, Sinclair J, Cook D, Jaeschke R, Schünemann H, Pauker S. Moving from evidence to action. In: Guyatt G, Rennie D, eds. Users' guides to the medical literature: a manual for evidence-based clinical practice. Chicago: American Medical Association, 2002: 599-608.

5. Petitti DB. Approaches to heterogeneity in meta-analysis. Stat Med 2001;20: 3625-33. [Abstract] [Google Scholar]

6. Higgins J, Thompson S, Deeks J, Altman D. Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and practice. J Health Serv Res Policy 2002;7: 51-61. [Abstract] [Google Scholar]

7. Cochran WG. The combination of estimates from different experiments. Biometrics 1954;10: 101-29. [Google Scholar]

8. Sterne JAC, Egger M. Funnel plots for detecting bias in meta-analysis: guidelines on choice of axis. J Clin Epidemiol 2001;54: 1046-55. [Abstract] [Google Scholar]

9. Paul SR, Donner A. Small sample performance of tests of homogeneity of odds ratios in k 2×2 tables. Stat Med 1992;11: 159-65. [Abstract] [Google Scholar]

10. Hardy RJ, Thompson SG. Detecting and describing heterogeneity in meta-analysis. Stat Med 1998;17: 841-56. [Abstract] [Google Scholar]

11. Jefferson TO, Demicheli V, Deeks JJ, Rivetti D. Amantadine and rimantadine for preventing and treating influenza A in adults. Cochrane Database Syst Rev 2002;(4): CD001169. [Abstract]

12. Dickersin K, Berlin JA. Meta-analysis: state-of-the-science. Epidemiol Rev 1992;14: 154-76. [Abstract] [Google Scholar]

13. Barbui C, Hotopf M, Freemantle N, Boynton J, Churchill R, Eccles MP, Geddes JR, et al. Treatment discontinuation with selective serotonin reuptake inhibitors (SSRIs) versus tricyclic antidepressants (TCAs). Cochrane Database Syst Rev 2003;(3): CD002791. [Abstract]

14. Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002;21: 1539-58. [Abstract] [Google Scholar]

15. Sterne JAC, Bradburn MJ, Egger M. Meta-analysis in STATA. In: Egger M, Davey Smith G, Altman DG, eds. Systematic reviews in health care: meta-analysis in context. 2nd ed. London: BMJ Publications, 2001: 347-69.

16. Early Breast Cancer Trialists' Collaborative Group. Tamoxifen for early breast cancer: an overview of the randomised trials. Lancet 1998;351: 1451-67. [Abstract] [Google Scholar]

17. Lau J, Antman EM, Jimenez-Silva J, Kupelink B, Mosteller SF, Chalmers TC. Cumulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med 1992;327: 248-54. [Abstract] [Google Scholar]

18. Egger M, Davey Smith G. Misleading meta-analysis. BMJ 1995;310: 752-4. [Europe PMC free article] [Abstract] [Google Scholar]

19. Angelillo IF, Villari P. Residential exposure to electromagnetic fields and childhood leukaemia: a meta-analysis. Bull World Health Organ 1999;77: 906-15. [Europe PMC free article] [Abstract] [Google Scholar]

20. Cochrane Injuries Group Albumin Reviewers. Human albumin administration in critically ill patients: systematic review of randomised controlled trials. BMJ 1998;317: 235-40. [Europe PMC free article] [Abstract] [Google Scholar]

21. Engels EA, Schmid CH, Terrin N, Olkin I, Lau J. Heterogeneity and statistical significance in meta-analysis: an empirical study of 125 meta-analyses. Stat Med 2000;19: 1707-28. [Abstract] [Google Scholar]

22. Deeks JJ. Issues in the selection of a summary statistic for meta-analysis of clinical trials with binary outcomes. Stat Med 2002;21: 1575-1600. [Abstract] [Google Scholar]

23. Freeman PR. The performance of the two-stage analysis of two-treatment, two-period crossover trials. Stat Med 1989;8: 1421-32. [Abstract] [Google Scholar]

24. Steyerberg EW, Eijkemans MJ, Habbema JD. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. J Clin Epidemiol 1999;52: 935-42. [Abstract] [Google Scholar]

Articles from The BMJ are provided here courtesy of BMJ Publishing Group

Full text links

Read article at publisher's site: https://doi.org/10.1136/bmj.327.7414.557

Read article for free, from open access legal sources, via Unpaywall: https://europepmc.org/articles/pmc192859?pdf=render

Citations & impact

Impact metrics

29,121

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/690979

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/690979

Article citations

Associations between lifestyle-related risk factors and back pain: a systematic review and meta-analysis of Mendelian randomization studies.
Guan J, Liu T, Gao G, Yang K, Liang H
BMC Musculoskelet Disord, 25(1):612, 01 Aug 2024
Cited by: 0 articles | PMID: 39090551 | PMCID: PMC11293147
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Systematic review and meta-analysis of postpartum depression and its associated factors among women before and after the COVID-19 pandemic in Uganda.
Kabunga A, Tumwesigye R, Kigongo E, Musinguzi M, Acup W, Auma AG
BMJ Open, 14(7):e076847, 31 Jul 2024
Cited by: 0 articles | PMID: 39089714 | PMCID: PMC11293379
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Bartonella quintana detection among arthropods and their hosts: a systematic review and meta-analysis.
Boodman C, Gupta N, van Griensven J, Van Bortel W
Parasit Vectors, 17(1):328, 02 Aug 2024
Cited by: 0 articles | PMID: 39095833 | PMCID: PMC11295871
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
The diagnosis of ASD with MRI: a systematic review and meta-analysis.
Schielen SJC, Pilmeyer J, Aldenkamp AP, Zinger S
Transl Psychiatry, 14(1):318, 02 Aug 2024
Cited by: 0 articles | PMID: 39095368 | PMCID: PMC11297045
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Moral courage level of nurses: a systematic review and meta-analysis.
Li H, Guo J, Ren Z, Bai D, Yang J, Wang W, Fu H, Yang Q, Hou C, Gao J
BMC Nurs, 23(1):530, 02 Aug 2024
Cited by: 0 articles | PMID: 39090605 | PMCID: PMC11295526
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (29,121) article citations

Other citations

Wikipedia

https://en.wikipedia.org/wiki/Study_heterogeneity

Search life-sciences literature (44,596,569 articles, preprints and more)

Measuring inconsistency in meta-analyses.

Author information

Affiliations

Authors

ORCIDs linked to this article

Abstract

Free full text

Measuring inconsistency in meta-analyses

Julian P T Higgins

Simon G Thompson

Jonathan J Deeks

Douglas G Altman

Short abstract

Need for consistency

Testing for heterogeneity

Quantifying heterogeneity: a better approach

Examples of values of I2

Table 1

Further applications of I2

Methodological subgroups

Table 2

Heterogeneity related to choice of effect measure

Clinically important subgroups

How much is too much heterogeneity?

Notes

References

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Other citations

Wikipedia

Similar Articles

Partnerships & funding

Examples of values of I²

Further applications of I²