Chapter 8: Domains Decreasing Certainty in the Evidence

About

  • This ACIP GRADE handbook provides guidance to the ACIP workgroups on how to use the GRADE approach for assessing the certainty of evidence.

8.1 Risk of bias (study limitations)

Study limitations may bias the estimates of the effect of an intervention on health outcomes.1 The factors considered for evaluating study limitations or risk of bias (also referred to as internal validity) will depend on the study design. The number of studies is not a determining factor in determining risk of bias, as a single well-conducted study may result in high confidence in the estimated effect of vaccination on health outcomes. Risk of bias can differ amongst outcomes within an individual study, therefore, limitations for each outcome of interest in a study should be assessed separately.

Randomized Controlled Trials

For randomized controlled trials, Cochrane's revised risk of bias (RoB 2) tool can be used to assess study limitations.23 The tool considers bias that may arise from the randomization process, deviations from the intended interventions, missing outcome data, measurement of the outcome and the selection of the reported result. Signaling questions are used to highlight concerns in each RoB domain. Judgements can express "High", "Low" or "Some concerns" with risk of bias. Details on how to use the tool and the various assessment questions can be found on the Risk of bias website2. Studies in which participants are allocated to intervention or control groups through quasi-randomization techniques (e.g., allocation by odd or even date of birth, date or day of admission, case record number, alternation/rotation) will automatically be at risk of selection bias due to inadequate generation of a randomized sequence, in addition to the ability of participants, or investigators enrolling participants, to foresee allocation.4 Blinding outcome assessors is less important for the assessment of objective outcomes such as all-cause mortality, but is crucial for subjective outcomes such as quality of life. Risk of bias can differ across outcomes (e.g., higher risk of bias for subjective outcomes compared to objective outcomes when outcome assessors are not blinded; different subsets of studies for safety vs. efficacy studies). For adverse events or non-inferiority studies, intention-to-treat analyses may not be appropriate. If any information for assessing risk of bias is not reported in a publication, study investigators may be contacted. It may be possible to assess risk of bias from other reported information. For example, if information on allocation sequence concealment is not reported, data showing that the intervention and control groups are balanced at baseline may assuage concern regarding risk of bias. When assessing the risk of bias due to missing outcome data, reasons for the missing data and the quantity of missing data should both be taken into consideration. Table 5 provides a summary of the domains used in the RoB 2 assessment.

Table 5. Domains of RoB 2 tool

Study Risk of bias arising from the randomization process
(High/Low/Some Concerns)
Risk of bias due to deviations from the intended interventions (High/Low/Some Concerns) Risk of bias due to missing outcome data
(High/Low/Some Concerns)
Risk of bias in measurement of the outcome
(High/Low/Some Concerns)
Risk of bias in selection of the reported result
(High/Low/Some Concerns)
           

The Cochrane group has also developed risk of bias assessment tools to use for cluster-randomized trials and crossover trials.2

Non-randomized Studies

The criteria for assessing non-randomized studies like cohort studies, case-control studies, controlled before-after studies, interrupted time series, and case series differs from risk of bias assessments for randomized trials.1 The Cochrane group recommends using the Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) tool to assess the risk of bias for non-randomized studies, specifically for comparative cohort studies.5 Similar to the RoB 2 tool recommended for RCTs, ROBINS-I assessments are done for specific results; each reported outcome study should be considered separately rather than judging the study as a whole. Confounding and co-interventions are major concerns that could lead to bias in non-randomized studies. Other domains such as selection bias, information bias, and reporting bias are also evaluated using the ROBINS-I tool; details on the signaling questions and domains used in the tool can be found on the Risk of bias website.

Table 6 provides an overview of the domains used in the ROBINS-I tool. Each domain is judged to have "Low", Moderate", or "Critical" risk of bias. "No information (NI)" is used when there is insufficient information to make a judgment on a domain. When using this tool, NRS start off with high certainty and can be graded down for study limitations after the ROBINS-I tool is used and concerns with risk of bias are identified.27 The ROBINS-I tool uses an absolute metric rather than comparing non-randomized studies to a standard ideal NRS, thus making it easier to compare RCTs and non-randomized studies, as both are assessed using a similar metric for risk of bias.

Table 6. Domains of the ROBINS-I tool for NRS

Study Bias due to confounding
(Low/Moderate/ Critical/NI)
Bias in selection of participants into the study
(Low/Moderate/ Critical/NI)
Bias in classifications of interventions
(Low/Moderate/ Critical/NI)
Bias due to deviations from intended interventions
(Low/Moderate/ Critical/NI)
Bias due to missing data
(Low/Moderate/ Critical/NI)
Bias in measurement of outcomes
(Low/Moderate/ Critical/NI)
Bias in selection of the reported result
(Low/Moderate/ Critical/NI)
               

The Newcastle-Ottawa Scale (NOS) is another tool that has been developed to assess the risk of bias of nonrandomized studies.6

After using a tool to assess the risk of bias for each outcome in an individual study, the extent of study limitations for the body of evidence is categorized into one of the following groups:1

  • No serious limitations (do not downgrade evidence type): most of the studies comprising the body of evidence have low risk of bias for all key criteria for evaluating study limitations.
  • Serious limitations (downgrade one level): most of the studies have crucial limitations for one criterion or some limitations for multiple criteria that lower confidence in the estimated effect of vaccination on the outcome of interest.
  • Very serious limitations (downgrade two levels): most of the studies have crucial limitations for one or more criteria that substantially lower confidence in the estimated effect.
  • Extremely serious limitations (downgrade three levels):7 most of the studies have crucial limitations for multiple criteria that substantially lower confidence in the estimated effect. This option exists only for studies which are evaluated using ROBINS-I tool. The use of ROBINS-I here starts the evidence at high certainty.

When considering a body of evidence in which some studies have no serious limitations, some have serious limitations, and some have very serious limitations, it is not appropriate to automatically assign an average rating of serious limitations for the group of studies. When the risk of bias varies across studies, principles for determining whether to downgrade the evidence type for a group of studies include:1

  • Consider the extent to which each study contributes to the overall or pooled estimate of effect. Larger studies with many outcome events will contribute more.
  • Assess whether the results differ for studies with low risk of bias and those with high risk of bias. Consider focusing on studies with lower risk of bias if the results differ by risk of bias.
  • Downgrade when there is substantial risk of bias across most of the studies.
  • Consider limitations pertaining to the other GRADE criteria (if there are close calls regarding risk of bias with another GRADE criterion, consider downgrading the evidence level for at least one of the two GRADE criteria)

When close-call situations occur, this should be made explicit, and the reason for the ultimate classification should be stated. Table 7a provides an example of when results from NRS may not have serious concerns with risk of bias, while the body of evidence consisting of randomized trials has concerns with study limitations. Since the trials used subjective reporting of the outcome and lacked blinding, the body of evidence was downgraded due to serious concerns with risk of bias.

Table 7b presents a situation in which the certainty of the evidence from RCTs for the outcomes of serious adverse events and myo- /pericarditis were judged as very low; therefore, the work group considered the evidence from NRS. For both of these outcomes, the RCTs had concerns due to the small number of events and total patients. The NRS provided complementary evidence with a larger number of participants and results consistent with those from RCTs.

Table 7a. Evidence profile for outcome of incidence of arthritis (5–56 days)

References in this table:8

Certainty assessment No. of patients Effect Certainty Importance
No. of studies Study Design Risk of Bias Inconsistency Indirectness Imprecision Other considerations rVSV-
vaccine
No rVSV-
vaccine
Relative (95% CI) Absolute (95% CI)
4 Randomized trials Seriousa Not serious Not serious Seriousb None 39/1776 (2.2%) 16/868 (1.8%) RR 1.80d (0.21 to
15.13)
23 more per
1,000 (from
22 fewer to
400 more)
Low Critical
2 Non-randomized studies Not serious Not serious Not serious Very seriousb,c None 43/520 (8.3%) 3/107 (2.8%) RR 2.06d (0.0001
to 7739.16)
33 more per
1,000 (from
28 fewer to
1000 more)
Very Low Critical

Note: Non-randomized studies without comparators are not included in evidence table, but would be considered to offer very low certainty (evidence type 4)

Explanations

a. Studies used variable definitions and methods for diagnosing and reporting arthritis. In addition, participants, healthcare personnel, and outcome assessors were not blinded in Huttner 2015 or Samai 2018 potentially influencing events reported for this subjective outcome.

b. The 95% CI includes the potential for possible harms, as well as possible benefit.

c. Few events reported do not meet optimal information size and suggest fragility in the estimate.

d. RR calculated using the standard continuity correction of 0.5 and the overall effect uses a random effects model.

Table 7b. Evidence profile for Use of JYNNEOS (orthopoxvirus) vaccine heterologous for those who received ACAM2000 primary series

References in this table:91011121314151617

Certainty assessment № of patients Effect Certainty Importance
№ of studies Study design Risk of bias Inconsistency Indirectness Imprecision Other considerations a booster dose of JYNNEOS a booster dose of ACAM2000 Relative (95% CI) Absolute (95% CI)
Prevention of disease (assessed with: seroconversion rate)
31,2,3,4,5,6,7 observational studies seriousa not serious seriousb seriousc none No comparison data available. Intervention data from the systematic review: 272/333 (81.68 %) participants from 3 studies seroconverted 14 days after booster with MVA. VERY LOW CRITICAL
Severity of disease (assessed with: take maximum lesion area)
18 observational studies seriousa,d not serious not serious very seriouse none No comparison data available. Intervention data from the systematic review: 20/20 (100%) of vaccinia experienced participants developed an attenuated take lesion after Dryvax challenge following booster with MVA vaccine. VERY LOW IMPORTANT
Serious adverse events (assessed with: vaccine related serious adverse event rate)
18 randomized trials seriousf not serious not serious very seriousg none 0/22 (0.0%) 0/28 (0.0%) not estimable VERY LOW CRITICAL
C. Serious adverse events (assessed with: vaccine related serious adverse event rate)
41,2,3,4,5,6,7,9 observational studiesh not serious not serious seriousi very seriousg none 0/367 (0.0%)j 3/1371 (0.2%)k RR 0.53
(0.03 to 10.32)
1 fewer
per 1,000
(from 2 fewer to 22 more)
VERY LOW CRITICAL
D. Myo-/pericarditis (assessed with: myo-/pericarditis event rate)
18 randomized trials very seriousl not serious not serious very seriousm none 0/22 (0.0%) 0/28 (0.0%) not estimable VERY LOW IMPORTANT
D. Myo-/pericarditis (assessed with: myo-/pericarditis event rate)
31,2,3,4,5,6,7 observational studies not serious not serious seriousi very seriousm none 0/349 (0.0%)n 0/1371 (0.0%)o not estimable VERY LOW IMPORTANT

RR: risk ratio; CI: confidence interval

Explanations

a. Risk of bias due to lack of comparison data.

b. Seroconversion rate is an indirect measure of prevention.

c. Small sample size, no comparison.

d. Attrition rate was variable across study groups. One group lost 17% of participants.

e. Small sample size, fragility of estimate.

f. In the protocol it is unclear how serious adverse events were assessed.

g. Sample size is small, too small to detect rare adverse events.

h. Observational data was included in the evidence profile for this outcome because the effect estimate for the randomized trials was not estimable.

i. Single-arm studies contribute data to the intervention, but no available data for the comparison from the systematic review. Downgraded for indirectness because historical data was used for comparison.

j. Intervention data was drawn from 3 observational studies included in the systematic review. 0/349 (0.00 %) participants from 3 studies developed vaccine related serious adverse events.

k. Comparison data was drawn from historical data. In a phase III clinical trial for ACAM2000 enrolling participants with previous smallpox vaccination 3/1371 (0.22%) developed vaccine related serious adverse events after ACAM2000 administration. No smallpox vaccine-specific serious adverse event was recorded.

l. Assessment of myo-/pericarditis was initiated late in the study at the request of FDA. Very few subjects could be evaluated at that point. It was unclear how many subjects were evaluated.

m. Sample size is small, too small to detect rare events of myopericarditis after JYNNEOS®.

n. Intervention data was drawn from 3 observational studies included in the systematic review. 0/349 (0.00 %) participants developed myo-/pericarditis.

o. Comparison data was drawn from historical data. In a phase III clinical trial for ACAM2000 enrolling participants with previous smallpox vaccination, 0/1371 (0.00%) developed myo-/pericarditis after ACAM2000 administration.

8.2 Inconsistency

Inconsistency refers to an unexplained heterogeneity in the effect estimates across studies contributing to a summary estimate (e.g., relative risk or odds ratio for binary outcomes; mean difference for continuous outcomes).18 Inconsistency can be assessed by examining the following indicators of heterogeneity: 1) visual examination of the forest plot (point estimates and confidence intervals); 2) calculating statistical test of heterogeneity])- Chi-squared (Chi2 or X2) statistic; 3) calculating the (I-squared[I2 ]; 4) contextualizing the findings with the target for our certainty rating.

Heterogeneity occurs when there is large variability between the studies pooled in a meta-analysis. Visual inspection can show effects that differ from the rest and should include an examination of the point estimates and overlap of confidence intervals.19 A forest plot suggesting heterogeneity would show confidence intervals from individual studies that have limited or no overlap with the summary estimate. The studies contributing to the summary estimate may have point estimates that widely differ. However, difference may not only be detected by visualization; therefore, complementing this with numerical estimates of heterogeneity may be helpful. The I² statistic describes the percentage of variation across studies that is due to heterogeneity rather than chance. The higher the I2 statistic, the more likely the variability seen is due to more than just change (I2 >30% is low, ~50% is moderate, and >75% is substantial and requires further exploration). The Chi2 tests the null hypothesis that the included studies are not different (homogenous); however, the results are susceptible to studies with small samples or if there are few studies in the meta-analysis. If the Chi2 is small and the p-value large (>0.10 or >0.05; i.e., not significant) heterogeneity may not be suspected. Lastly, if the point estimate of the pooled estimate visually falls within the 95% CI of the studies included in the analysis, heterogeneity is less of a concern.

When making decisions about the extent to which heterogeneity contributes to our certainty rating (i.e., should we rate down for inconsistency and by how much), the target (threshold or range) of our certainty rating must be identified.20 This could be the null, a minimally important difference, a range of magnitudes of trivial, small, moderate or large. Inconsistency is a concern when it crosses possible thresholds of meaning. Inconsistency may not be a concern when all of the point estimates (and CIs) of included studies lie above a given threshold even if they are disparate (e.g., visually confidence intervals don't overlap or I2 is high, etc.).

In addition to noting the presence of inconsistency, it is desirable to determine potential reasons for the inconsistency. Differences in the following may result in inconsistency:

  • Populations (e.g., vaccines may have different relative effects in sicker populations);
  • Interventions (e.g., different effects with different number of doses or comparators);
  • Outcomes (e.g., duration of follow-up);
  • Study methods (e.g., studies with higher and lower risk of bias

When heterogeneity is large and a plausible explanation cannot be identified, the evidence level should be downgraded by one or two levels, depending on heterogeneity in the magnitude of effect. While there are not specific guidelines for this; see "GRADE guidelines: 7. Rating the quality of evidence—inconsistency" for examples of downgrading.18 If inconsistency can be explained, estimates of effect should be presented separately for the stratification that explains the observed heterogeneity. If results differ by study methods, preference may be given to results of studies with a lower risk of bias. If results differ by population groups, different recommendations may be made for different groups. If only one study is available, there are by default no concerns with inconsistency (i.e., select "Not serious" when grading).

Inconsistency is assessed more strictly in binary/dichotomous outcomes (relative values) than continuous outcomes (absolute values). For binary outcomes, inconsistency should be assessed using risk ratio or odds ratio which are measures of relative effect, where a value of 1 indicates the estimated effect is similar for both the intervention and comparison group.21 Conversely, the risk difference is a measure of absolute effect that represents the difference in the observed risk and should not be used to assess inconsistency because it is very sensitive to the baseline risk (i.e., risk in control group) and baseline risk can differ substantially between studies.18 The forest plot below (Figure 6) shows four studies included in the analysis for the binary outcome of severe (grade 3) arthralgia. Here, two studies contribute to the effect estimate (risk ratio), as they contain events. Visually, the pooled estimate (6.40) falls within the 95% CIs of the included studies; the Chi2 is small (0.08) and the p-value is large (i.e., not significant at 0.10), and the I2 = 0%.8 Based on all three steps, heterogeneity is not serious for this outcome.

To recap, any of the following factors may result in rating down for inconsistency:

  1. I2 is large (I2 >30% is low, ~ 50% is moderate, and >75% is substantial and requires further exploration).
  2. Statistical test for heterogeneity (Chi2) shows a low P-value (i.e., < 0.05).
  3. Confidence intervals of the point estimates of included studies do not overlap or show minimal overlap.

Figure 6. Estimates of effect for RCTs included in analysis for outcome of incidence of severe (grade 3) arthralgia (0-42 days)

References in this figure: 8

Figure 6. Estimates of effect for RCTs included in analysis for outcome of incidence of severe (grade 3) arthralgia (0-42 days)
Figure 6

Effect estimates from continuous outcomes can be presented in a number of ways. If the primary studies included have assessed an outcome using the same scale, then it can be presented as a Mean Difference (MD). However, when pooling studies which measure the same continuous outcome using different instruments or varying scales, researchers might choose to present this as a Standardized Mean Difference (SMD). The MD can be easily interpreted and assessed for heterogeneity and inconsistency. However, SMD might pose more of a difficulty and reviewers might need to use a different approach to further present and interpret the effect estimate.22 Tables 8 and 9 present the options available to reviewers dealing with studies with these challenges.

Table 8: Five approaches to presenting results of continuous variables when primary studies have used different instruments to measure the same construct

References in this table:22

Approach Advantages Disadvantages Recommendation
SD units (standardized mean difference; effect size) Widely used Interpretation challenging
Can be misleading depending on whether population is very homogenous or heterogeneous
Do not use as the only approach
Present as natural units May be viewed as closer to primary data Few instruments sufficiently used in clinical practice to make units easily interpretable Approaches to conversion to natural units include those based on SD units and rescaling approaches. We suggest the latter. In rare situations when instrument very familiar to frontline clinicians, seriously consider this presentation
Relative and absolute effects Very familiar to clinical audiences and thus facilitate understanding
Can apply GRADE guidance for large and very large effects
Involve assumptions that may be questionable (particularly methods based on SD units) If the MID is known, use this strategy in preference to relying on SD units
Always seriously consider this option
Ratio of means May be easily interpretable to clinical audiences
Involves fewer questionable assumptions than some other approaches
Can apply GRADE guidance for large and very large effects
Cannot be applied when measure is change and therefore negative values possible interpretation requires knowledge and interpretation of control group mean Consider as complementing other approaches, particularly the presentation of relative and absolute effects
MID units May be easily interpretable to audiences
Not vulnerable to population heterogeneity
Only applicable when MID is known
To the extent that MID is uncertain, this approach will be less attractive
Consider as complementing other approaches, particularly the presentation of relative and absolute effects

Abbreviations: SD, standard deviation; MID, minimally important difference.

Table 9: Application of approaches to dexamethasone for pain after laparoscopic cholecystectomy example

References in this table:22

Outcomes Estimated risk or estimated score/value Absolute reduction in risk or reduction in score/value with dexamethasone Relative effect (95% CI) Number of participants (studies) Confidence in effect estimate Comments
(A) Postoperative pain, SD units: investigators measured pain using different instruments. Lower scores mean less
pain
The pain score in the dexamethasone groups was on average 0.79 SDs (1.41–0.17) lower than in the placebo groups - 539 (5) Low evidencea,b As a rule of thumb, 0.2 SD represents a small difference, 0.5 a moderate, and 0.8 a large
(B) Postoperative pain, natural units: measured on a scale from 0 (no pain) to 100 (worst pain imaginable). The mean postoperative pain scores with placebo ranged from 43 to 54 The mean pain scores in the intervention groups was on average 8.1 (1.8–
14.5) lower
- 539 (5) Low evidence Scores estimated based on an SMD of 0.79 (95%
CI:1.41, 0.17). The
minimally important difference on the 0e100 pain scale is
approximately 10
(C)  Substantial
postoperative pain: investigators measured pain using different instruments
20 per 100c More patients in dexamethasone group achieved important improvement in pain score 0.15 (95% CI: 0.19,
0.04)
539 (5) Low evidence Scores estimated based on an SMD of 0.79 (95%
CI:1.41, 0.17) Method
assumes that distributions in intervention and control groups are normally distributed and
variances are similar
(D) Postoperative pain: investigators measured pain using different instruments. Lower
scores mean less pain
28.1d 3.7 lower pain score (6.1
lower 0.6 lower)
539 (5) Low evidence Weighted average of the mean pain score in dexamethasone group divided by mean pain
score in placebo
(E) Postoperative pain: investigators measured pain using different instruments The pain score in the dexamethasone groups was on average 0.40 (95% CI: 0.74, 0.07) minimally important difference units less than in the control
group
- 539 (5) Low evidence An effect less than half the minimally important difference suggests a
small or very small effect

Abbreviations: CI, confidence interval; SD, standard deviation; SMD, standardized mean difference.

a. Evidence limited by heterogeneity between studies

b. Evidence limited by imprecise data

c. The 20% comes from the proportion in the control group requiring rescue analgesia

d. Crude (arithmetic) means of the postoperative pain mean responses across all five trials when transformed to a 100-point scale

Table 10 provides an example of how inconsistency is explained in an evidence profile. The footnotes highlight the large I2 value and, while some of the heterogeneity may be explained by study limitations, there is enough concern to warrant downgrading the body of evidence. As a result, the table shows serious concerns with inconsistency.

Table 10. Evidence profile for outcome of incidence of arthralgia (0–42 days)

References in this table: 8

Certainty assessment No. of patients Effect Certainty Importance
No. of studies Study Design Risk of Bias Inconsistency Indirectness Imprecission Other considerations rVSV-vaccine No rVSV-vaccine Relative (95% CI) Absolute (95% CI)
6 Randomized trials Seriousa Seriousb Not serious Seriousc None 316/1874 (16.9%) 42/891 (4.7 %) RR 2.55d (0.94 to 6.91) 73 more per 1,000 (from 3 fewer to 279 more) Very Low Critical
2 Non- randomized studies Not serious Not serious Not serious Seriousd None 75/469 (16.0%) 8/99 (8.1%) RR 1.63e (0.0001 to 7739.16) 51 more per 1,000 (from 81 fewer to 1000 more) Very Low Critical

Note: Non-randomized studies without comparators are not included in evidence table, but would be considered of very low certainty (evidence type 4); CI: Confidence interval; RR: Relative risk

Explanations

a. Participants, healthcare personnel, and outcome assessors were not blinded in Huttner 2015 or Samai 2018 potentially influencing events reported for this subjective outcome. Concern for possible underreporting in Kennedy because arthralgia was only solicited at one week and at one month for most participants; Huttner only solicited arthralgia for low dose participants

b. Rated down once due to concerns with heterogeneity (I2=70%). Some may be explained by concerns with risk of bias (poor randomization or outcome definition)

c. The 95% confidence interval of the mean pooled estimate includes potential for possible harms as well as benefits

d. Few events reported do not meet optimal information size and suggest fragility in the estimate

e. RR calculated using the standard continuity correction of 0.5 and uses a random effects mode

8.3 Indirectness

Research that answers the PICO question most appropriately is considered direct evidence; therefore, studies that address the target population, compare the interventions specified in the question and measure the outcomes of interest can be classified as direct evidence.23 Indirectness can be introduced when any of the four situations below occur:

  • The population that participated in studies may differ from the population of interest;
  • The intervention that was evaluated may differ from the intervention of interest;
  • The primary interest is head-to-head comparisons of vaccine A to vaccine B, but A was compared with C and B was compared with C (i.e., the comparator is different from the comparator of interest)
  • The outcome that was assessed may differ from that of primary interest. This may occur when there is either an intermediate outcome or a surrogate outcome used to inform the outcome of interest. For example, a panel may decide that vaccine efficacy is a critical outcome; however, the underlying evidence does not report directly on the measure of efficacy. This may occur when there is a low baseline risk of developing the outcome of interest. When assessing the evidence for vaccines, immunogenicity may serve as an appropriate surrogate for vaccine efficacy if vaccine efficacy data are not available; however, unless there is an established immune correlate of protection, this should result in downgrading for indirectness.

Table 11. Examples of indirect evidence

Indirect Question of Interest Source of Indirectness
Population
  1. Efficacy of vaccine in preventing disease.
  1. Studies are available for healthy persons, but not for the population of interest (e.g., older adults with chronic health conditions).
  2. Studies are available for the correct population; however, the baseline risk of infection is not representative of the recruited target population in the trial. For example, RSV vaccine trial participants are recruited during a year with unrepresentatively low RSV rates.
Intervention Efficacy of a new formulation of a vaccine in preventing disease. Studies of previous formulations of the vaccine provide indirect evidence bearing on the new vaccine.
Comparator Efficacy of vaccine A compared to vaccine B in preventing disease. Studies compared vaccine A to placebo and vaccine B to placebo, but studies comparing A to B are unavailable.
Outcome Prevention of disease. Increase in antibody titers following vaccination are reported, but there are no well-established standard correlates of protection.
Intervention vs. Comparator Efficacy of vaccine A compared to no vaccine in preventing disease. Studies only compare vaccine A to the current standard of care, vaccine B; therefore, the relationship between the intervention and the comparator is indirect.

Both systematic reviews and guidelines may require the use of evidence that is indirect with respect to the comparator and outcomes of interest. Guidelines also commonly deal with evidence that is indirectly related to the population and intervention specified in the PICO question; these are sometimes described as concerns with applicability. When limited evidence is available, it is often necessary to turn to indirect evidence to help inform judgements. For the purpose of guidelines, it is important to consider all four potential causes of indirectness when rating down the domain; when there are multiple concerns with indirectness, it may be appropriate to rate down twice for indirectness. The use of surrogate outcomes typically results in rating down unless evidence of a strong association between the surrogate and the long- or short-term outcome of interest is established. The rating down process is not always additive, thus it is important to consider the evidence from all angles.

When developing recommendations, guidelines may need to use surrogate outcomes and/or indirect evidence. Although direct evidence is ideal, recommendations may be supported by indirect evidence as long as the indirectness is acknowledged in the certainty assessment.

To decide whether JYNNEOS® (orthopoxvirus) vaccine primary series or ACAM2000 vaccine primary series should be recommended for persons who are at risk for occupational exposure to orthopoxviruses, the guideline panel prioritized the outcome of "Prevention of Disease". However, cases of orthopoxvirus were not reported by the trials. Instead, the surrogate measures of geometric mean titer (GMT) and seroconversion rate were used to inform the outcome of "Prevention of Disease". The work group decided to rate down for indirectness for both of these measures, as there was some uncertainty in how directly findings about the GMT or seroconversion rate would predict prevention of disease. Table 12a presents a truncated GRADE Evidence Profile showing the use of a surrogate outcome to inform the critical outcome of Prevention of Disease. The second outcome presented, Severity of Disease, was informed by one trial reporting on the proportion of study participants with an attenuated take lesion. The ideal measure of disease severity is taking maximum lesion area. However, the work group recognized that the clinical difference between categorical (proportion of participants with attenuated take) and the continuous measurement (take maximum lesion area) was minimal and therefore did not rate down for indirectness for the outcome of Severity of Disease.

In a second example, the ACIP recently provided recommendations for the following policy question: Should pre-exposure vaccination with the rVSVΔG-ZEBOV-GP vaccine be recommended for adults 18 years of age or older in the U.S. population who are at potential occupational risk of exposure to Ebola virus (species Zaire ebolavirus) for prevention of Ebola virus infection.24 Due to the limited literature available for certain outcomes like the development of Ebola-related symptomatic illness, a randomized cluster study was used in the evidence profile that focused on contacts of recently confirmed Ebola cases in Guinea, west Africa.25 Since the PICO question was specific to the U.S. population, the evidence was downgraded for indirectness but was still used to support the guideline recommendations. As a result, in table 12b, the cluster study is downgraded, and an explanation is provided in the footnotes regarding why there are serious concerns for indirectness.

Table 12a. GRADE Evidence Profile for Use of JYNNEOS (orthopoxvirus) vaccine primary series for research, clinical laboratory, response team, and healthcare personnel

References in this table: 918101112131415

Certainty assessment № of patients Effect Certainty Importance
№ of studies Study design Risk of bias Inconsistency Indirectness Imprecision Other considerations JYNNEOS OPXV vaccine primary series ACAM2000 OPXV vaccine primary series Relative (95% CI) Absolute (95% CI)
A. Prevention of disease (assessed with: geometric mean titer)
21,2,3,4,5,6 randomized trials not serious not serious seriousa,b not serious none 213 199 - MD 1.62 titer units higher (1.32 higher to 1.99 higher)c Moderate CRITICAL
A. Prevention of disease (assessed with: seroconversion rate)
21,2,3,4,5,6 randomized trials not serious not serious seriousb,d seriouse none 213/213 (100.0%) 192/199 (96.5%) RR 1.02
(0.99 to 1.05)
19 more per 1,000
(from 10 fewer to 48 more)
Low CRITICAL
B. Severity of disease (assessed with: maximum lesion area)
17 randomized trials seriousf not serious not seriousg very seriouse,h none 15/15 (100.0%)i 8/8 (100.0%) RR 1.00
(0.83 to 1.20)
0 fewer per 1,000
(from 170 fewer to 200 more)
Very low IMPORTANT

Explanations

a. Geometric mean titer is an indirect measure of efficacy.

b. Frey study used Dryvax in the comparison group. For the immunogenicity outcomes we do not feel there would be a significant difference between the two live vaccines.

c. In order to calculate a mean difference and 95% CI, geometric mean data were transformed to arithmetic mean. The effect estimate was then transformed to geometric mean difference, which you see here.

d. Seroconversion rate is an indirect measure of efficacy.

e. 95% CI includes the potential for both meaningful benefit as well as meaningful harm.

f. Concerns for risk of bias due to attrition. The two groups that contributed data to the intervention and comparison for this outcome lost between 11 and 21% of participants at the time this outcome was assessed.

g. The ideal measure of disease severity is to take maximum lesion area. This study reports the proportion of participants with an attenuated take lesion. Clinical difference between categorical (proportion of participants with attenuated take) vs. continuous measurement (take maximum lesion area) is minimal. We feel this won't affect indirectness. See Parrino et al. 2007 for a description of lesion attenuation criteria.

Table 12b. Evidence profile for outcome of development of Ebola-related symptomatic illness

References in this table:8

Certainty assessment No. of patients Effect Certainty Importance
No. of studies Study Design Risk of Bias Inconsistency Indirectness Imprecission Other considerations rVSV-vaccine No rVSV-vaccine Relative (95% CI) Absolute (95% CI)
1 Randomizeda (clusters) Not serious Not serious Seriousb Seriousc None 0/51 (0.0%) 7/47 (14.9 %) RR 0.06d (0 to 1.05) 140 fewer per 1,000 (from 149 fewer to 7 more) Low Evidence Critical
1 Non-randomized (participants) Not serious Not serious Seriousb Seriousc Strong association 0/2108f (0.0%) 16/3075(0.5%) RR 0.04e (0 to 0.74) 5 fewer per 1,000 to 1 fewer) Moderate Evidence Critical

Note: Outcome assessed with laboratory confirmed case of EVD

Explanations

a. Henao-Restrepo 2017 was a cluster randomized trial (i.e., units of randomization were clusters); cluster-level data presented here.

b. Concern for indirectness to U.S. population: population consists of contacts and contacts of contacts of EVD case, ring vaccination strategy which may include post-exposure vaccination.

c. Because this study was done at a time when the 2014—2015 West Africa outbreak was waning in Guinea and there are few events reported, it does not meet optimal information size and suggests fragility in the estimate; 95% CI contains the potential for desirable as well as undesirable effects.

d. Henao-Restrepo 2017 was a cluster randomized trial (i.e., units of randomization were clusters); participant-level data presented here

e. The concerns with indirectness pose no inflationary effect; therefore, the evidence was rated up based on a very large magnitude of effect from the 96% reduction in risk and overall certainty was upgraded two levels.

f. Denominator represents participants from the clusters randomized to receive immediate vaccination.

g. RR calculated using the standard continuity correction of 0.5.

8.4 Imprecision

Imprecision refers to the risk of random error in the evidence. It is rated as either not serious, serious or very serious, similar to the other GRADE domains discussed above.26 The estimated effect is considered imprecise when studies have a wide confidence interval (CI). This usually occurs when few events and few patients are included in studies. Concerns with imprecision can lead to uncertainty in the results presented in the evidence. For systematic reviews, the following indicate imprecision for an outcome:

  • Total sample size across all studies for an outcome is lower than the calculated sample size for a single adequately powered study (online calculators are available for sample size calculations; or
  • The 95% confidence interval (CI) of the pooled or best estimate of effect size includes both no effect AND appreciable benefit or appreciable harm (even if sample size is adequate). When an outcome is rare, 95% CIs of relative effects may be very wide, but 95% CIs of absolute effects may be narrow; in such situations, the evidence level may not be downgraded. For continuous outcomes, the threshold for appreciable benefit or appreciable harm refers to the difference in score in the outcome that is perceived as important.

For guidelines, additional considerations like clinical decision thresholds for optimal sample size and the event rate must be accounted for.27 The evidence level may be downgraded because of imprecision in the following situations:

  • When the recommendation is for an intervention, and
    • The 95% CI includes both no effect AND an effect that represent a benefit that would outweigh potential harms.
    • The 95% CI excludes no effect, but the lower confidence limit crosses a threshold below which, given potential harms, one would not recommend the intervention
  • When the recommendation is against an intervention, and
    • The 95% CI includes no effect AND an effect that represent a harm that despite the benefits, would still be unacceptable.
    • The 95% CI excludes no effect, but the upper confidence limit crosses a threshold above which, given the benefits, one would recommend the intervention.

When assessing the risk for rare events (e.g., GBS, myocarditis, etc.) caused by a vaccine, the number of events needed may not be large enough to detect such rare events. The suspected rate of such events should be assessed in relation to the number of subjects tested to determine if the evidence should be downgraded for concerns about fragility with imprecision. An alternative approach would be to calculate the optimal information size (OIS) based on the total population instead of relying on the number of events that typically inform a judgment for imprecision. The OIS has been defined as the minimum amount of cumulative information required for reliable conclusions about an intervention, i.e., a calculation similar to calculating the sample size of patient in an individual trial, the difference being that the OIS considers the potential for heterogeneity between studies.28 Therefore, if the number of participants in the meta-analyses is less than what is generated from a conventional sample-size calculation, there may be serious or very serious concerns about imprecision.

Table 11 provides an example of how imprecision assessments are justified. For example, the results from the randomized controlled trials are informed by a large sample size, however, the confidence interval is wide and cannot exclude the potential for both harm and benefit. Thus, concerns with imprecision are serious. In contrast, the results from the NRS have a wide confidence interval that cannot exclude the potential for harm and benefit; they are informed by few events that do not meet the optimal information size. Therefore, the concerns with imprecision are classified as "very serious" rather than "serious".

More information on assessing imprecision is available in the "Grade Guidelines 6. Rating the quality of evidence—imprecision" 20112629

Table 13. Evidence profile for outcome of incidence of arthritis (5-56 days)

References in this table:8

Certainty assessment No. of patients Effect Certainty Importance
No. of studies Study Design Risk of Bias Inconsistency Indirectness Imprecission Other considerations rVSV-vaccine No rVSV-vaccine Relative (95% CI) Absolute (95% CI)
4 Randomized trials Seriousa Not serious Not serious Seriousb None 39/17 76 (2.2%) 16/8 68 (189%) RR 1.80d (0.21 to 15.3) 23 fewer per 1,000 (from 22 fewer to 400 more) Low Evidence Critical
2 Non-randomized studies Not serious Not serious Not serious Very Seriousb,d None 43/52 0 (8.3%) 3/10 7 (2.8%) RR 2.06d (0.00 01 to 7739.16) 33 more per 1,000 (from 28 fewer to 1000 more) Very low Evidence Critical

Note: Non-randomized studies without comparators are not included in evidence table, but would be considered of very low certainty (evidence type 4)

Explanations

a. Studies used variable definitions and methods for diagnosing and reporting arthritis. In addition, participants, healthcare personnel, and outcome assessors were not blinded in Huttner 2015 or Samai 2018 potentially influencing events reported for this subjective outcome.

b. The 95% CI includes the potential for possible harms, as well as possible benefit.

c. Few events reported do not meet optimal information size and suggest fragility in the estimate.

d. RR calculated using the standard continuity correction of 0.5 and the overall effect uses a random effects model.

8.5 Publication bias

Publication bias is a type of reporting bias that leads to a systematic underestimation or an overestimation of the underlying effect (beneficial or harmful) due to the selective publication of studies.30 Publication bias arises when investigators fail to publish studies, typically those that show no effect. Publication bias might be suspected if the available studies are uniformly small and funded by industry; a thorough review of clinical trial registries should be performed to identify if any trials were registered but not published. A funnel plot of studies with the magnitude of the effect size (e.g., relative risk or odds ratio for a binary outcome) on the X-axis, and variance (proxy for sample size) on the Y-axis can help assess publication bias. A funnel plot with asymmetrical distribution suggests publication bias. For meta-analyses with fewer than 10 studies, performing a funnel plot may be skewed; therefore, it is recommended to only perform when more than 10 studies are available. In situations with fewer than 10 studies, authors can consider additional factors when assessing publication bias: size and direction of identified studies, records of unpublished trials, availability of intervention under investigation (i.e., proprietary or specialty vaccines may be more regulated or documented, therefore, increased confidence that all available studies have been identified).

Due to the challenges in determining publication bias, publication bias is either described as "undetected" or "strongly suspected" in an evidence profile. Figure 7 provides an example of a funnel plot that has a symmetrical distribution and there is not suspicion of undetected publication bias. Conversely, figure 8 presents an example in which the forest plot is asymmetrical and therefore suggests there may be concerns with publication bias, requiring further investigation.

Figure 7. Example of funnel plot with no strong suspicion of publication bias

References in this figure:31

Figure 7. Example of funnel plot with no strong suspicion of publication bias
Figure 7

Figure 8. Example of a funnel plot with suspicion of publication bias

References in this figure:30

Figure 8. Example of a funnel plot with suspicion of publication bias
Figure 8
  1. Guyatt GH, Oxman AD, Vist G, et al. GRADE guidelines: 4. Rating the quality of evidence--study limitations (risk of bias). J Clin Epidemiol. 2011/04// 2011;64(4):407-415. doi:10.1016/j.jclinepi.2010.07.017
  2. Risk of bias tools - RoB 2 tool.
  3. Sterne JA, Savović J, Page MJ, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;366
  4. Higgins J, Savović J, Page M, Elbers R, Sterne J. Chapter 8: Assessing risk of bias in a randomized trial. In: Higgins J, Thomas J, Chandler J, et al, eds. Cochrane Handbook for Systematic Reviews of Interventions version 63 (updated February 2022). Cochrane; 2022. www.training.cochrane.org/handbook.
  5. Sterne J, Hernán M, McAleenan A, Reeves B, Higgins J. Chapter 25: Assessing risk of bias in a non-randomized study. In: Higgins J, Thomas J, Chandler J, et al, eds. Cochrane Handbook for Systematic Reviews of Interventions version 63 (updated February 2022) Cochrane; 2022. www.training.cochrane.org/handbook.
  6. GA Wells BS, D O'Connell, J Peterson, V Welch, M Losos, P Tugwell. The Newcastle-Ottawa Scale (NOS) for assessing the quality of nonrandomised studies in meta-analyses. Ottawa Hospital Research Institute. https://www.ohri.ca/programs/clinical_epidemiology/oxford.asp
  7. Thomas Piggott RLM, Carlos A Cuello-Garcia, Nancy Santesso, Reem A Mustafa, Joerg J Meerpohl, Holger J Schünemann; GRADE Working Group. Grading of Recommendations Assessment, Development, and Evaluations (GRADE) notes: extremely serious, GRADE's terminology for rating down by three levels. J Clin Epidemiol. 2020;120:116-120. doi:10.1016/j.jclinepi.2019.11.019
  8. Choi MJ, Cossaboom CM, Whitesell AN, et al. Use of ebola vaccine: recommendations of the Advisory Committee on Immunization Practices, United States, 2020. MMWR Recommendations and Reports. 2021;70(1):1.
  9. (ACIP) ACoIP. Grading of Recommendations, Assessment, Development, and Evaluation (GRADE): Use of JYNNEOS® (orthopoxvirus) vaccine heterologous for those who received ACAM2000 primary series. Centers for Disease Control and Prevention. https://www.cdc.gov/vaccines/acip/recs/grade/JYNNEOS-orthopoxvirus-heterologous.html
  10. Ahmed F, Temte JL, Campos-Outcalt D, Schünemann HJ, Group AEBRW. Methods for developing evidence-based recommendations by the Advisory Committee on Immunization Practices (ACIP) of the US Centers for Disease Control and Prevention (CDC). Vaccine. 2011;29(49):9171-9176.
  11. Committee on Standards for Developing Trustworthy Clinical Practice Guidelines BoHCS, Institute of Medicine. Clinical Practice Guidelines We Can Trust. National Academies Press; 2011.
  12. Schünemann HJ, Wiercioch W, Etxeandia I, et al. Guidelines 2.0: systematic development of a comprehensive checklist for a successful guideline enterprise. CMAJ. 2014/02/18/ 2014;186(3):E123-E142. doi:10.1503/cmaj.131237
  13. World Health O. WHO handbook for guideline development. World Health Organization; 2014:167.
  14. Thomas J, Kneale D, McKenzie J, Brennan S, Bhaumik S. Chapter 2: Determining the scope of the review and the questions it will address. In: Higgins J, Thomas J, Chandler J, et al, eds. Cochrane Handbook for Systematic Reviews of Interventions version 63 (updated February 2022). Cochrane; 2022. www.training.cochrane.org/handbook.
  15. Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 2. Framing the question and deciding on important outcomes. J Clin Epidemiol. 2011/04// 2011;64(4):395-400. doi:10.1016/j.jclinepi.2010.09.012
  16. Fitch K, Bernstein SJ, Aguilar MD, et al. The RAND/UCLA Appropriateness Method User's Manual. 2001. 2001/01/01/. Accessed 2022/03/06/21:27:33. https://www.rand.org/pubs/monograph_reports/MR1269.html
  17. (ACIP) ACoIP. GRADE: Use of Smallpox Vaccine in Laboratory and Health-Care Personnel at Risk for Occupational Exposure to Orthopoxviruses. Centers for Disease Control and Prevention.
  18. Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 7. Rating the quality of evidence-- inconsistency. J Clin Epidemiol. 2011/12// 2011;64(12):1294-1302. doi:10.1016/j.jclinepi.2011.03.017
  19. Cynthia P Cordero ALD. Key concepts in clinical epidemiology: detecting and dealing with heterogeneity in meta-analyses. J Clin Epidemiol. 2021;130:149-151. doi:10.1016/j.jclinepi.2020.09.045
  20. Gordon Guyatt YZ, Martin Mayer, Matthias Briel, Reem Mustafa, Ariel Izcovich, Monica Hultcrantz, Alfonso Iorio, Ana Carolina Alba, Farid Foroutan, Xin Sun, Holger Schunemann, Hans DeBeer, Elie A Akl, Robin Christensen, Stefan Schandelmaier. GRADE guidance 36: updates to GRADE's approach to addressing inconsistency. J Clin Epidemiol. 2023;158:70-83. doi:10.1016/j.jclinepi.2023.03.003
  21. Higgins J, Li T, Deeks J. Chapter 6: Choosing effect measures and computing estimates of effect. In: Higgins J, Thomas J, Chandler J, et al, eds. Cochrane Handbook for Systematic Reviews of Interventions version 63 (updated February 2022). Cochrane; 2022. www.training.cochrane.org/handbook.
  22. Guyatt GH, Thorlund K, Oxman AD, et al. GRADE guidelines: 13. Preparing summary of findings tables and evidence profiles-continuous outcomes. J Clin Epidemiol. Feb 2013;66(2):173-83. doi:10.1016/j.jclinepi.2012.08.001
  23. Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 8. Rating the quality of evidence-- indirectness. J Clin Epidemiol. 2011/12// 2011;64(12):1303-1310.
  24. ACIP Grading for Ebola Vaccine | CDC. 2021/01/07/T05:56:55Z 2021
  25. Henao-Restrepo AM, Camacho A, Longini IM, et al. Efficacy and effectiveness of an rVSV vectored vaccine in preventing Ebola virus disease: final results from the Guinea ring vaccination, open-label, cluster-randomised trial (Ebola Ça Suffit!). The Lancet. 2017/02/04/ 2017;389(10068):505-518. doi:10.1016/S0140-6736(16)32621-6
  26. Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines 6. Rating the quality of evidence-- imprecision. J Clin Epidemiol. 2011/12// 2011;64(12):1283-1293. doi:10.1016/j.jclinepi.2011.01.012
  27. Zeng L, Brignardello-Petersen R, Hultcrantz M, et al. GRADE guidelines 32: GRADE offers guidance on choosing targets of GRADE certainty of evidence ratings. J Clin Epidemiol. Sep 2021;137:163-175. doi:10.1016/j.jclinepi.2021.03.026
  28. Pogue JM, & Yusuf, S. Cumulating evidence from randomized trials: utilizing sequential monitoring boundaries for cumulative meta-analysis. Controlled clinical trials. 1997;18(6):580- 593.
  29. Gordon H Guyatt ADO, Regina Kunz, Jan Brozek, Pablo Alonso-Coello, David Rind, P J Devereaux, Victor M Montori, Bo Freyschuss, Gunn Vist, Roman Jaeschke, John W Williams Jr, Mohammad Hassan Murad, David Sinclair, Yngve Falck-Ytter, Joerg Meerpohl, Craig Whittington, Kristian Thorlund, Jeff Andrews, Holger J Schünemann. GRADE guidelines 6. Rating the quality of evidence--imprecision. J Clin Epidemiol. 2011;64(12):1283-93. doi:10.1016/j.jclinepi.2011.01.012
  30. Guyatt GH, Oxman AD, Montori V, et al. GRADE guidelines: 5. Rating the quality of evidence-- publication bias. J Clin Epidemiol. 2011/12// 2011;64(12):1277-1282. doi:10.1016/j.jclinepi.2011.01.011
  31. Yong PJ, Matwani S, Brace C, et al. Endometriosis and Ectopic Pregnancy: A Meta-analysis. J Minim Invasive Gynecol. 2020/02// 2020;27(2):352-361.e2.