Understanding SRTR's Outcome Assessment

Summarizing transplant program performance using a 5-tier system

SRTR assigns each transplant program an outcome assessment tier (1 to 5) based on how many patients remain alive with a functioning transplanted organ 1 year after transplant. The outcome assessment is displayed in the program search results, and in the more detailed results shown for the program. This guide is meant to help you understand why SRTR assigns the outcome assessment, how SRTR calculates the assessment, and how to interpret the assessment.


A Guide for Patients

Outcome assessment
The outcome assessment tells you if the program’s 1-year survival after transplant is better, worse, or about the same as what is expected for that program. 

Why does the assessment show how outcomes compare to what is “expected”?

Many things affect whether a transplant recipient has a good outcome after transplant. Some patients are sicker than others and some organs come from healthier donors. Some programs may perform more transplants in sicker patients, and these programs would “expect” to have more complications. The comparison to “expected” tells you how a program’s outcomes compare to national predicted outcomes for similar patients transplanted with similar donors.


Why is the outcome assessment important to me?
Patients want to undergo transplant with the lowest risk of complications. An assessment value of 3 or higher indicates that a program has recently demonstrated good outcomes. 

What do the numbers mean?

Based on the health of the transplant recipients and donors:

  • 5 means that a program’s outcomes are better than expected.
  • 3 means that a program’s outcomes are about the same as expected.
  • 1 means that a program’s outcomes are worse than expected

What should I think about when interpreting the outcome assessment?

A program’s assessment might be higher than others’ assessments, but this may not be the most important factor. You may also want to consider:

  • Waiting times.
  • Distance from home.
  • Insurance coverage.

In some cases, the risk of a complication after surgery is much lower than the risk of becoming too sick to undergo transplant while on the waiting list. Your care team can discuss the risks and how they may affect your decisions.

How is the outcome assessment calculated?

The outcome assessment is a statistical calculation; read on to learn more.

Why does SRTR assign an outcome assessment?

While considering guidance from the Agency for Healthcare Research and Quality (AHRQ), we developed the 5-tiered outcome assessment system to make it easier for the general public to understand and compare the outcomes of different transplant programs. This is in alignment with the reporting requirements of the OPTN Final Rule, which states that OPTN and SRTR, as appropriate, shall "Make available to the public timely and accurate program-specific information on the performance of transplant programs. This shall include free dissemination over the Internet, and shall be presented, explained, and organized as necessary to understand, interpret, and use the information accurately and efficiently" (OPTN Final Rule 121.11(b)(iv)). Further in fulfillment of the Final Rule, the SRTR contractor must identify transplant programs and organ procurement organizations with better or worse outcomes (SRTR Task 3.9.1). 

To fulfill its contractual obligation, SRTR currently evaluates transplant outcomes at three time points: 1 month, 1 year, and 3 years after transplant. In addition, SRTR evaluates two different outcomes: 1) survival with a functioning transplanted organ, and 2) survival regardless of whether the organ continues to function. SRTR uses complex statistical methods to perform these evaluations. These methods attempt to adjust for the case mix at the transplant program so programs that perform transplants in sicker patients, or accept higher-risk donors than other programs, are not penalized in their evaluations. These evaluations employ a Bayesian statistical methodology that results in an estimated "hazard ratio" for every program, for each outcome evaluated. The hazard ratio tells us how each program's outcomes compare with what we expected to happen given the types of patients and the types of donors the program accepts. A hazard ratio of 1.0 indicates that the program's results were exactly as expected, whereas a value of 2.0 indicates failure rates, i.e., death or failure of the transplanted organ, that are twice as high as expected, and, conversely, a value of 0.5 indicates failure rates that are half what would be expected. In addition, a level of certainty is associated with each estimated hazard ratio. Larger programs have more data available, and therefore we are generally more certain about their hazard ratios; we have less certainty about smaller programs because they have less data. Therefore, even if two programs have the same hazard ratio, we may be more certain about one than the other. These are difficult statistical concepts. SRTR uses the 5-tiered outcome assessment to translate program outcomes for people without statistical training.

Recommendations from the Agency for Healthcare Research and Quality (AHRQ)

In June 2010, the Agency for Healthcare Research and Quality (AHRQ) published Best Practices in Public Reporting (Hibbard J, Sofaer S, AHRQ Publication No. 10-0082-EF, June 2010). This three-part publication makes many recommendations about the effective reporting of complex healthcare quality data. SRTR, along with its Visiting Committee, carefully considered these recommendations. Key recommendations that influenced development of the 5-tiered system included:

  • "Make it easy for consumers to understand and use the comparative information. Reduce the cognitive burden by summarizing, interpreting, highlighting meaning, and narrowing options." 
  • "Rank ordering by performance as opposed to alphabetical ordering."
  • "Using symbols instead of numbers."
  • "Providing an overall summary measure"
  • "Including fewer reporting categories (5 vs. 9)"

SRTR and the SRTR Visiting Committee considered these recommendations in development of a simplified summary metric to convey difficult concepts of program quality to the general public.

How is the outcome assessment calculated?

Visual Depiction of the Posterior Probability Density of the Program's Hazard Ratio
Figure 1. Estimate the program's hazard ratio for first-year graft failure.
Step 1. Estimate the Program's Hazard Ratio for First-Year Graft Failure

The outcome assessment is calculated from SRTR's evaluation of the program's first-year graft failure rate. In other words, we start with our assessment of how often patients at the program die or lose function of their transplanted organ (the graft) within the first year after transplant. This evaluation results in an estimate of the program's hazard ratio. The hazard ratio is a measure of how many patients did not make it through the first year with a functioning graft relative to how many we expected not to make it. So, a hazard ratio of 1.0 means that we observed exactly the number of graft failures we expected at the program (after taking into account the types of patients and the donors the program accepts). A hazard ratio of 0.5 means that the program experienced half the failures we expected, and a hazard ratio of 2.0 means that the program experienced double the number of failures that we expected. Estimating a program's performance always involves some degree of uncertainty in the estimate. Therefore, we calculate a bell-shaped curve like the one shown in Figure 1 that describes the likely location of the program's hazard ratio. A narrower bell curve indicates that we have more certainty about the estimate, and a wider bell curve indicates less certainty. Technical information on the estimation of a program's hazard ratio was published in the American Journal of Transplantation.

Step 2. Assign a Score to Each Hazard Ratio Value

Once we have an estimate of the program's hazard ratio (as shown in Figure 1), we assign a score to the program based on the location of the bell-shaped curve and on how certain we are of the assessment. The location of the curve, i.e., how far to the left or to the right it is, tells us whether or not the program’s graft failure rates are likely better than expected (curve shifts to the left of 1.0) or worse than expected (curve shifts to the right of 1.0). The spread of the bell-shaped curve, i.e., how wide or narrow it is, tells us how certain we are of the estimate. To assign a score to the program's assessment, we apply a score function (Figure 2). This function assigns more value to hazard ratio estimates that are less than 1.0 (indicating better than expected performance) and less value to estimates that are more than 1.0. This is shown visually in Figure 2 as the curve gradually declines moving from left to right.

The Outcomes Assessment Scoring Function
Figure 2. The scoring function used to assign a score to each program's performance.
Applying the Scoring Function to the Program's Posterior Density of the Hazard Ratio to Arrive at a Score
Figure 3. Applying the score function.
Step 3. Applying the Score Function

Using the score function described in step 2, we arrive at a new bell-shaped curve by applying the score function across the range of the original bell-shaped curve for the program's hazard ratio. This is done by multiplying the value of the score function by the value of the bell-shaped curve along the full range of the curve (Figure 3).

Step 4. Arrive at a Program Score

After applying the score function to our estimate of the program's hazard ratio as described by the bell-shaped curve, we arrive at a new bell-shaped curve that is a result of multiplying the score function and the bell-shaped curve together. The area under this new bell-shaped curve is the program's score (Figure 4).

Calculating the Program's Score
Figure 4. Arriving at a final program score.
Assigning the program to 1 of 5 assessment tiers
Figure 5. Assigning program scores to tiers.
Step 5. Assign the Program's Score to an Assessment Tier (1-5)

The final step in the process is to assign the score to 1 of 5 possible assessment tiers. Each program's score will be between 0 and 1, with higher scores indicating likely better performance. Final tiers for the outcome assessment are assigned as shown in Figure 5 and the following table:

Score Range Assessment Tier
0 - <0.125 1 (worse than expected)
0.125 - <0.375 2 (somewhat worse than expected)

0.375 - <0.625

3 (as expected)
0.625 - <0.875 4 (somewhat better than expected)
0.875 - 1.0 5 (better than expected)

Frequently Asked Questions

Why SRTR changed the 3-tier transplant program outcome assessment system to a 5-tier system.

SRTR is charged with reporting the proportion of transplanted organs and recipients that survive after transplantation at each program in the United States. SRTR attempts to provide this information over the internet in ways that are easy to understand, interpret and use, and updates this information every 6 months. In 2002, SRTR began publishing transplant program-specific reports on its public website. SRTR presented a 3-tier assessment of how many transplants survived the first year after transplantation at each program: “as expected”, “better than expected” or “worse than expected”. From the early 2000s through late 2014, these assessments were based on a statistical hypothesis test and required extremely strong statistical evidence to place a program outside of the “as expected” category. Specifically, programs were placed in the “as expected” category unless there was a 97.5% chance that their outcomes were better or worse than national norms.  As a result of these stringent requirements, almost all programs were “as expected” regardless of observed performance.  Therefore, the “as expected” category failed to identify important differences in program outcomes.  For example, the worst program in the “as expected” category typically had a transplant failure rate over four times higher than the best program in the “as expected” category. Given these facts, SRTR decided to explore alternative systems that would better convey differences in program outcomes to the general public.

In 2012 SRTR, OPTN, and HRSA hosted a Consensus Conference on Transplant Program Quality and Surveillance. One of the recommendations coming from this conference was to explore ways to make the transplant outcome assessments more understandable for patients and the general public. In 2013, SRTR began working with its over-sight committee, now called the SRTR Visiting Committee, on a different program outcome assessment system. SRTR considered many alternatives and best practices in public reporting as recommended by the Agency for Healthcare Research and Quality (AHRQ). After exploring several different options, SRTR and its Visiting Committee recommended a 5-tier assessment system. This system was presented and discussed at OPTN committees and national conferences over a period of 3 years before it was released in December, 2016.

The 5-tier assessment system was designed to better inform patients on program performance by identifying differences in posttransplant outcomes.  Specifically, the 5-tier assessment places only 30% of programs in the “as expected” tier (Table 1), while 96% of programs are “as expected” in the 3-tier assessment (Table 2).  By more evenly distributing programs across each category, the 5-tier assessment better identifies programs with similar posttransplant outcomes (Figure 1). Therefore, the 5-tier assessment will better inform the general public of transplant program performance.

After the new 5-tier system was released, a number of transplant programs and transplant surgeons expressed to HRSA that they had not had enough time to examine and understand the new system. As a result, HRSA directed SRTR to temporarily return to reporting program outcomes with the 3-tier system, while at the same time placing the new 5-tier system on a separate publicly accessible website so that everyone can compare the 3-tier system with the 5-tier system and have adequate time to understand the differences in the two systems.

Table 1. Numbers of adult transplant programs in each of the 5-tier assessment system categories.

Transplant Type

Tier 1
(Worse than Expected)

Tier 2
(Somewhat Worse than Expected)

Tier 3
(Good, As Expected)

Tier 4
(Somewhat Better than Expected)

Tier 5
(Better than Expected)

Heart

8

16

44

47

8

Kidney

12

52

78

61

30

Liver

5

32

40

37

10

Lung

3

17

22

20

5

 

Table 2. Numbers of adult transplant programs in each of the 3-tier assessment system categories.

Transplant Type

Worse than Expected

As Expected

Better than Expected

Heart

1

121

1

Kidney

7

218

8

Liver

0

121

3

Lung

1

65

1

 

Figure 1. The range of program outcomes within each of the tiers under the 3-tier and 5-tier systems.  A tier with wider dashed lines (or whiskers) indicates less similar program performance within the tier.  For example, the “as expected” category in the 3-tier system fails to capture important differences in program outcomes, with the worst program possessing a transplant failure rate over four times higher than the best program (2.0/0.5 = 4).
Why is SRTR using a 5-tiered assessment system?

One of SRTR's functions is to provide information to the public on the performance of transplant programs as mandated in the OPTN Final Rule. The metrics SRTR developed to assess program performance are rightly complex, involving complex statistics and mathematics that are often difficult to explain. Trying to convey transplant program performance metrics that are based on risk-adjusted Cox proportional hazards models that produce expected event counts which are transformed into Bayesian posterior distributions of the program's hazard ratio is quite challenging! Therefore, SRTR has developed this system in consultation with both HRSA and the SRTR Visiting Committee as a way to summarize and communicate complex concepts to the general public. In doing so, SRTR has tried to adhere to the recommendations of the AHRQ for best practices in communicating healthcare quality information to the public as described above. 

Doesn't a 5-tiered system lose important information?

Whenever we transform something that is continuous in nature, such as a program's hazard ratio, which can range from a value greater than 0 to positive infinity, into a 5-tiered system, we necessarily lose information. For example, we lose whether or not the program's score was close to a tier boundary (a high 3 or a low 4, for example). Furthermore, we lose the sense of how certain we are of the overall evaluation, something that is conveyed by the size and shape of the program's bell-shaped estimate of the hazard ratio. While the AHRQ recommends summarizing and condensing these complex statistical measurements, which the 5-tier system does, SRTR continues to make available to interested readers the full evaluation of the program.

Does this evaluation system cause programs to avoid performing higher risk transplants?

SRTR’s evaluations are “risk adjusted” so that programs taking on higher risk patients and/or using higher risk donor organs are not penalized for having lower survival rates. For each assessment, SRTR takes into account many recipient and donor characteristics in an effort to make a fairer comparison of program outcomes. The factors considered in each evaluation are available for review and a publication detailing the process of arriving at the risk adjustment models is also available. By performing these adjustments for risk, SRTR is comparing the outcomes achieved at a program with outcomes achieved nationally, for similar levels of risk. SRTR has also recently shown that there is no relationship between the percentage of high-risk transplants performed at a program and their likelihood of having a poor outcome assessment. SRTR continues to work with the OPTN to advocate for improved data collection to enhance future versions of the risk adjustment models.

Are the models good enough to support this type of evaluation?

SRTR follows a published process for building and maintaining the risk-adjustment models used to account for recipient and donor characteristics when evaluating programs (Snyder JJ, Salkowski N, Kim SJ, Zaun D, Xiong H, Israni AK, Kasiske BL. Developing statistical models to assess transplant outcomes using national registries: The process in the United States. Transplantation. 2016;100:288-294). In addition, we are continually working in partnership with the OPTN community to discuss ways to collect better data to allow for better prediction of outcomes. 

How did you arrive at the score function used in Steps 2 and 3?

SRTR worked with the Visiting Committee over a period of approximately 4 years to develop and refine the process. Various score functions were considered, each having the property of a continuously declining function (no step functions) that start near 1.0 and decline to 0 as the hazard ratio increases. This type of function (referred to as a logistic function) allows for assignment of value to each possible value of the hazard ratio. The function assigns more value to estimates of the hazard ratio that are good, and less value to estimates that are bad. The logistic function can be formulated in different ways, some of which result in steeper or flatter curves. We decided on this function because:

  • It is symmetric around a hazard ratio of 1.0; i.e., having a hazard ratio of 2.0 (twice the failure rate as expected) is scored the opposite of having a hazard ratio of 0.5 (half the failure rate as expected).
  • It is continuous such that estimates of the hazard ratio that are close to each other, e.g., 1.20 and 1.21, receive about the same score.
  • It (nearly) plateaus near 0.67 on the good end and 1.5 on the bad end, which makes the value judgment that estimates of the hazard ratio that are more than 33% better than expected or more than 50% worse than expected are nearly equivalently good or bad, respectively.

While these were the primary reasons for choosing the score function, other versions of the score function yield similar final tier assessments because altering the score function requires altering the function that assigns scores to tiers, as discussed in the FAQ on the tier assignment function.

How did you arrive at the function that assigns program scores to tiers in Step 5?

In step 5, we must assign program scores into 1 of 5 tiers. The cut-points were chosen in consultation with the SRTR Visiting Committee such that most programs would fall within tier 3, with about the same number of fewer programs in tiers 2 and 4, and even fewer programs in tiers 1 and 5. Having said that, it is not necessary for any programs to fall into the extreme tiers. If all programs in the nation had outcomes that were very similar, it is possible for all to be placed within tier 3. We are not forcing a certain number or percentage of programs to be assigned to each tier.

What is the actual difference in outcomes across the 5 tiers?

Because different programs accept different levels of risk, i.e. some programs transplant sicker patients and/or use lower quality donors, it is difficult to draw conclusions from actual failure rates across programs within the tiers. However, we can look at predicted failure rates for programs in each of the tiers, if they all transplanted a patient of “average” risk. The differences in the predicted failure rates across the tiers are shown in the following table:

Tier

Predicted First-Year Transplant Failure Rate

Heart

Kidney

Liver

Lung

5 (Better than expected)

5.5%

2.6%

6.4%

7.5%

4 (Somewhat better than expected)

8.2%

4.1%

9.0%

9.9%

3 (Good, as expected)

10.4%

5.0%

10.4%

13.4%

2 (Somewhat worse than expected)

13.6%

6.6%

13.1%

17.4%

1 (Worse than expected)

18.5%

8.8%

18.6%

24.4%

The above table shows that there is approximately a 3-fold difference in first-year failure rates across the range of tiers.

 

In addition, we can show the median hazard ratio for programs within each tier:

Tier

Median Hazard Ratio for First-Year Transplant Failure

Heart

Kidney

Liver

Lung

5 (Better than expected)

0.55

0.52

0.60

0.55

4 (Somewhat better than expected)

0.83

0.82

0.85

0.73

3 (Good, as expected)

1.07

1.00

1.00

1.02

2 (Somewhat worse than expected)

1.42

1.34

1.27

1.35

1 (Worse than expected)

1.99

1.81

1.86

1.97

The above table shows that programs falling within tier 5 have a failure rate that is about half what would be expected based on national performance, while programs falling within tier 1 have a failure rate that is about double what would be expected.

How certain are you that a program in a higher tier is really better than a program in a lower tier?

SRTR performed simulation studies to ascertain how well this system performs at ordering programs correctly into the tiers. In the simulations, we know exactly which programs have better outcomes and which programs have worse outcomes and we assess the system's ability to order programs correctly into tiers. The results of those simulations are shown in the following table:

The probability that a program in the column tier has truly better outcomes than a program in the row tier:

Tier 5

Tier 4

Tier 3

Tier 2

Tier 4

0.72

     

Tier 3

0.81

0.63

   

Tier 2

0.91

0.78

0.66

 

Tier 1

0.98

0.92

0.86

0.75

For example, these simulations show that the probability that a tier 5 program truly has better outcomes than a tier 4 program is 0.72 (72%). The probability goes up with further separation in the tiers; e.g., we are 91% certain that a tier 5 program has better outcomes than a tier 2 program.

How does this system handle small vs. large programs?

Smaller programs have less precise estimates of their hazard ratios because we have less data to observe. Therefore, smaller programs have wider bell-shaped curves, and curves that are more likely to be centered near 1.0 due to the Bayesian evaluation system. This makes it more likely that small programs land somewhere in tiers 2-4, and only in extreme cases have enough evidence to put them into tiers 1 or 5. Extremely small programs with low apparent actual survival can be ‘as expected’ due, in part, to large variability in actual survival.

Why not show actual survival achieved at each program instead of the 5-tier assessment?

The actual survival percentage achieved at a program is not a useful measure of the program’s outcomes compared to other programs, because some programs take on more risk than others. For example, Hospital A may have a lower survival percentage than Hospital B, simply because Hospital A takes riskier patients than Hospital B. It is also possible that Hospital A performs very well with risky transplants, and would receive a tier 5 assessment even though they have lower absolute survival than Hospital B. Therefore, it would be misleading to consumers to present actual patient survival as the basis for a quality metric to rank transplant programs. Risk adjusted assessments that form the basis of the 5-tier system are necessary to allow programs to take on varying degrees of risk without the fear of poor outcome evaluations. Actual patient survival percentages are available in the program reports, but they do not form the sole basis of the quality evaluation.

How does the 5-tier assessment differ from statistical hypothesis tests?

Statistical hypothesis tests were designed to address the strength of evidence against a given null hypothesis, rather than place programs into tiers of relative posttransplant outcomes. The choice of an appropriate null hypothesis is especially problematic. For example, it is not clear whether the null hypothesis should be a hazard ratio of 1 or the hazard ratio of a program with better or worse outcomes. Thus, the 5-tier assessment avoids statistical hypothesis testing and instead creates tiers of programs with similar posttransplant outcomes.

Where can I find more information?