ROC Analysis
ROC methodology is appropriate in situations where there are 2 possible "truth states" (i.e., diseased/normal, event/non-event, or some other binary outcome), "truth" is known for each case, and "truth" is determined independently of the diagnostic tests / predictor variables / etc. under study.
In this subdirectory, you will find a number of programs (mostly in FORTRAN) used in ROC analysis. They are briefly described below, along with general guidelines to help you decide which program is most appropriate for your data. First, some basic terminology to help you make your decision (note that "disease" can be replaced with "condition" or "event"):
Rating data vs Continuous data
The term "rating data" is used to describe data based on an ordinal scale. For example, it is common in radiology studies to use a 5-point scale such as 1=disease definitely absent, 2=disease probably absent, 3=disease possibly present, 4=disease probably present, 5=disease definitely present. "Continuous data" refers to either truly continuous measurements or "percent confidence" scores (0-100).
Interpreting the Area Under the ROC Curve (AUC)
The area under the ROC curve (AUC) is commonly used as a summary measure of diagnostic accuracy. It can take values from 0.0 to 1.0. The AUC can be interpreted as the probability that a randomly selected diseased case (or "event") will be regarded with greater suspicion (in terms of its rating or continuous measurement) than a randomly selected nondiseased case (or "non-event"). So, for example, in a study involving rating data, an AUC of 0.84 implies that there is an 84% likelihood that a randomly selected diseased case will receive a more-suspicious (higher) rating than a randomly selected nondiseased case. Note that an AUC of 0.50 means that the diagnostic accuracy in question is equivalent to that which would be obtained by flipping a coin (i.e., random chance). It is possible but not common to run into AUCs less than 0.50. It is often informative to report a 95% confidence interval for a single AUC in order to determine whether the lower endpoint is > 0.50 (i.e., whether the diagnostic accuracy in question is, with some certainty, any better than random chance).
Designing an ROC study: Which scale to use?
While ordinal (1-5) rating scales are probably the most widely used in radiology studies, there are advantages to using "percent confidence" (0-100) scales. (Of course, if you are dealing with a continuous measurement, you don't have to worry about which scale to use.) For continuous data, nonparametric methods are quite reasonable. With rating data, parametric methods are recommended, as nonparametric methods will be biased (i.e., tend to underestimate the true AUC). The standard error of the estimated area under the ROC curve is smaller using a continuous scale.
Parametric vs Nonparametric methodology
"Parametric" methodology refers to inference (MLEs) based on the bivariate normal distribution (i.e., this estimate assumes one normal distribution for cases with the disease and one normal distribution for cases without, or that the data has been monotonically transformed to normal). When this assumption is true, the MLE is unbiased.
"Nonparametric" refers to inference based on the trapezoidal rule (which is equal to the Wilcoxon estimate of the area under the ROC curve, which in turn is equal to the "c"-statistic in SAS PROC LOGISTIC output). Nonparametric estimates of the area under the ROC curve (AUC) tend to underestimate the "smooth curve" area (i.e., parametric estimates), but this bias is negligible for continuous data.
Recommendations
For rating data, try a parametric method first. The bias inherent in the nonparametric method might be problematic. If the data are sparse (i.e., nondiseased patients and diseased patients tend to be rated at opposite ends of the scale), then parametric methods may not work well. Using the nonparametric approach is an option in these cases, but may provide even more biased results than it normally would.
For continuous data, either the parametric or nonparametric approach is fine.
Correlated data
"Correlated data" refers to multiple observations obtained from the same "region of interest" (ROI). For example, in a study of appendicitis screening, each patient may be imaged by two different "modalities" (i.e., plain X-ray film versus digitized images). So, then, there will be 2 different images (plain film vs digitized image) of Patient X's appendix, and each image will be assigned a separate rating. Therefore, a single ROI (Patient X's appendix) yields 2 observations (1 from each imaging modality). When comparing the accuracy (AUC) of plain film to that of digitized imaging, we must take into account the fact that the 2 AUCs are correlated because they are based on the same sample of cases.
Clustered data
"Clustered data" refers to situations in which (one or more) patients have two or more "regions of interest" (ROIs), each of which contributes a separate measurement. For example, in a brain study, measurements may be obtained from the left and right hemispheres in each patient. In a mammography study, each breast image may be subdivided into 5 ROIs, and thus 10 separate ratings may be obtained for each patient. In such cases, it is important to account for intrapatient correlation between measurements obtained on ROIs within the same patient.
Of course, it is possible to have data that is both clustered and correlated. For example, in the mammography example mentioned above, if each breast is imaged in 2 modalities, then there are 10 ROIs per patient, and 2 ratings per ROI. A comparison of the accuracies (AUCs) between the two modalities will have to take into account (a) the fact that the 10 ratings from each patient (for each modality) are correlated, and (b) the fact that the 2 ratings from the 2 modalities (for each ROI) are correlated.
One reader versus multiple readers
The preceding discussion assumes either (a) that the underlying variable is a measurement or rating obtained from only a single source (i.e., one reader) or (b) that AUCs will only be reported *separately* for each reader. If it is desired that the AUCs of multiple readers (i.e., 3 doctors independently assigning ratings) be averaged in order to arrive at one overall "average AUC" per modality, then special methods must be used to handle inter-rater correlation (correlation between the ratings of different readers due to the fact that the same set of cases is rated by each reader).
Partial ROC area
In some cases, rather than looking at the area under the entire ROC curve, it is more helpful to look at the area under only a portion of the curve- for example, within a certain range of false-positive (FP) rates (i.e., restricted to a portion of the X-axis). Often, interest does not lie in the entire range of FP rates, and consequently, only part of the area under the curve is relevant. For example, if we know ahead of time that a particular diagnostic test would not be useful if its FP rate is greater than 0.25, we might want to restrict our attention to that portion of the ROC curve where FP rates are less than or equal to 0.25.
Another possible reason to analyze the partial AUC rather than the entire AUC was discussed by Dwyer (Radiology 1997;202:621-625):
"A major drawback to the area below the entire ROC plot as an index of performance is its global nature. Its summarization of the entire ROC plot fails to consider the plot as a composite of different segments with different diagnostic implications. ROC plots that cross may have similar total areas but differ in their diagnostic efficacy in specific diagnostic situations. Prominent differences between ROC plots in specific regions may be muted or reversed when the total area is considered. Moreover, plots with different total areas may be similar in specific regions...A solution to the problems of a global assessment of the entire ROC plot is the assessment of specific regions...on the ROC plot."
Which program should you use?
For ROC sample size calculations:
ROCPOWER.SAS (1 reader; 1 or 2 ROC curves)
(for help, look at ROCPOWER_HELP.TXT and/or ROCPOWER_DOC.WPD)
DESIGNROC.FOR (for partial ROC area or SENS at fixed FPR)
(for help, look at DESIGNROC_HELP.TXT)
MULTIREADER_POWER.SAS (if multiple readers)
For plotting ROC curves using S-PLUS:
rocPlot.s (written by Hemant Ishwaran, PhD, Cleveland Clinic) (for this program, please refer to: http://www.bio.ri.ccf.org/Resume/Pages/Ishwaran/rocPlot.s)
For plotting ROC curves using SAS/Graph:
CREATE_ROC.SAS (for help, look at CREATEROC_HELP.TXT)
For inference on *partial* ROC area:
PARTAREA.FOR (for help, look at PARTAREA_HELP.TXT)
PARTAREA.FOR uses a parametric approach to estimate partial AUC. Margaret Pepe has developed Stata software to implement a nonparametric method of estimating partial (or full) AUC. This program can be obtained from:
http://www.fhcrc.org/labs/pepe/book/
(Click on the Programs link and download: aucbs.ado and aucbs.hlp)
Reference: Dodd LE, Pepe MS. Partial AUC estimation and regression. Biometrics. 59(3):614-23, 2003 Sep.
For inference on the area(s) under one or more ROC curves:
(SINGLE READER; 1-2 "MODALITIES")
Is data clustered?
If Y ===> Correlated data?
If parametric: Is data correlated?
Y ===> CORROC2.F
N ===> ROCFIT.F (for 1 curve) or INDROC.F (for 2 curves)
If nonparametric: Is data correlated?
Y ===> DELONG.FOR
N ===> DELONG.FOR
NOTE: I have also included in this subdirectory a SAS program, ROC_MASTERPIECE.SAS, which computes nonparametric estimates of ROC area for 2 or more (possibly) correlated modalities and performs a global chi-square test comparing all the AUCs.
(For continuous data) Would you prefer a parametric or nonparametric method?
If parametric: Is data correlated?
Y ===> CLABROC
N ===> LABROC4
If nonparametric: Is data correlated?
Y ===> DELONG.FOR
N ===> DELONG.FOR
NOTE: I have also included in this subdirectory a SAS program, ROC_MASTERPIECE.SAS, which computes nonparametric estimates of ROC area for 2 or more (possibly) correlated modalities and performs a global chi-square test comparing all the AUCs.
(for rating or continuous data) ===> OBUMRM2.FOR
This is a FORTRAN program written by Nancy Obuchowski which implements the Obuchowski-Rockette method for analyzing multireader and multimodality ROC data. The program produces nonparametric estimates of ROC area, but allows for the comparison of user-input parametric AUCs, partial AUCs, or sensitivities at a fixed FPR (see "Format B" for examples).
Note: the assumption that there can only be one test result per patient (i.e., no clustered data) is needed only for "Format A." For "Format B," one could first use software that handles clustered data to get estimates of the ROC areas and their variance-covariance matrix, and then these estimates could be entered into OBUMRM2 via "Format B."
(for rating or continuous data) ===> MULTIVARIATEROC.S
(for rating or continuous data) ===> LABMRMC
In general, a degenerate data set involves empty cells in the data matrix that represents the outcome of an ROC experiment that employs a discrete (e.g., five-category) confidence-rating scale, and particular patterns of these empty data cells inherently cause iterative maximization procedures...to fail to converge... . Dorfman and Berbaum (see below) developed an approach and a corresponding computer program, RSCORE4, that is able to estimate ROC curves for degenerate data sets. Their approach is still based on the conventional binormal ROC model, but it eliminates degeneracy by assigning small positive values to all empty cells. ... We have proposed an alternative approach to the problem of data degeneracy that employs a "proper" binormal model, and we have developed a corresponding computer program, PROPROC, for maximum-likelihood estimation of proper binormal ROC curves."
Lerner Research Institute
Cleveland Clinic, Mail Code NB21
9500 Euclid Avenue
Cleveland, Ohio 44195