ROC Analysis

ROC methodology is appropriate in situations where there are 2 possible "truth states" (i.e., diseased/normal, event/non-event, or some other binary outcome), "truth" is known for each case, and "truth" is determined independently of the diagnostic tests / predictor variables / etc. under study.

In this subdirectory, you will find a number of programs (mostly in FORTRAN) used in ROC analysis. They are briefly described below, along with general guidelines to help you decide which program is most appropriate for your data. First, some basic terminology to help you make your decision (note that "disease" can be replaced with "condition" or "event"):

Rating data vs Continuous data

The term "rating data" is used to describe data based on an ordinal scale. For example, it is common in radiology studies to use a 5-point scale such as 1=disease definitely absent, 2=disease probably absent, 3=disease possibly present, 4=disease probably present, 5=disease definitely present. "Continuous data" refers to either truly continuous measurements or "percent confidence" scores (0-100).

Interpreting the Area Under the ROC Curve (AUC)

The area under the ROC curve (AUC) is commonly used as a summary measure of diagnostic accuracy. It can take values from 0.0 to 1.0. The AUC can be interpreted as the probability that a randomly selected diseased case (or "event") will be regarded with greater suspicion (in terms of its rating or continuous measurement) than a randomly selected nondiseased case (or "non-event"). So, for example, in a study involving rating data, an AUC of 0.84 implies that there is an 84% likelihood that a randomly selected diseased case will receive a more-suspicious (higher) rating than a randomly selected nondiseased case. Note that an AUC of 0.50 means that the diagnostic accuracy in question is equivalent to that which would be obtained by flipping a coin (i.e., random chance). It is possible but not common to run into AUCs less than 0.50. It is often informative to report a 95% confidence interval for a single AUC in order to determine whether the lower endpoint is > 0.50 (i.e., whether the diagnostic accuracy in question is, with some certainty, any better than random chance).

Designing an ROC study: Which scale to use?

While ordinal (1-5) rating scales are probably the most widely used in radiology studies, there are advantages to using "percent confidence" (0-100) scales. (Of course, if you are dealing with a continuous measurement, you don't have to worry about which scale to use.) For continuous data, nonparametric methods are quite reasonable. With rating data, parametric methods are recommended, as nonparametric methods will be biased (i.e., tend to underestimate the true AUC). The standard error of the estimated area under the ROC curve is smaller using a continuous scale.

Parametric vs Nonparametric methodology

"Parametric" methodology refers to inference (MLEs) based on the bivariate normal distribution (i.e., this estimate assumes one normal distribution for cases with the disease and one normal distribution for cases without, or that the data has been monotonically transformed to normal). When this assumption is true, the MLE is unbiased.

"Nonparametric" refers to inference based on the trapezoidal rule (which is equal to the Wilcoxon estimate of the area under the ROC curve, which in turn is equal to the "c"-statistic in SAS PROC LOGISTIC output). Nonparametric estimates of the area under the ROC curve (AUC) tend to underestimate the "smooth curve" area (i.e., parametric estimates), but this bias is negligible for continuous data.


For rating data, try a parametric method first. The bias inherent in the nonparametric method might be problematic. If the data are sparse (i.e., nondiseased patients and diseased patients tend to be rated at opposite ends of the scale), then parametric methods may not work well. Using the nonparametric approach is an option in these cases, but may provide even more biased results than it normally would.

For continuous data, either the parametric or nonparametric approach is fine.

Correlated data

"Correlated data" refers to multiple observations obtained from the same "region of interest" (ROI). For example, in a study of appendicitis screening, each patient may be imaged by two different "modalities" (i.e., plain X-ray film versus digitized images). So, then, there will be 2 different images (plain film vs digitized image) of Patient X's appendix, and each image will be assigned a separate rating. Therefore, a single ROI (Patient X's appendix) yields 2 observations (1 from each imaging modality). When comparing the accuracy (AUC) of plain film to that of digitized imaging, we must take into account the fact that the 2 AUCs are correlated because they are based on the same sample of cases.

Clustered data

"Clustered data" refers to situations in which (one or more) patients have two or more "regions of interest" (ROIs), each of which contributes a separate measurement. For example, in a brain study, measurements may be obtained from the left and right hemispheres in each patient. In a mammography study, each breast image may be subdivided into 5 ROIs, and thus 10 separate ratings may be obtained for each patient. In such cases, it is important to account for intrapatient correlation between measurements obtained on ROIs within the same patient.

Of course, it is possible to have data that is both clustered and correlated. For example, in the mammography example mentioned above, if each breast is imaged in 2 modalities, then there are 10 ROIs per patient, and 2 ratings per ROI. A comparison of the accuracies (AUCs) between the two modalities will have to take into account (a) the fact that the 10 ratings from each patient (for each modality) are correlated, and (b) the fact that the 2 ratings from the 2 modalities (for each ROI) are correlated.

One reader versus multiple readers

The preceding discussion assumes either (a) that the underlying variable is a measurement or rating obtained from only a single source (i.e., one reader) or (b) that AUCs will only be reported *separately* for each reader. If it is desired that the AUCs of multiple readers (i.e., 3 doctors independently assigning ratings) be averaged in order to arrive at one overall "average AUC" per modality, then special methods must be used to handle inter-rater correlation (correlation between the ratings of different readers due to the fact that the same set of cases is rated by each reader).

Partial ROC area

In some cases, rather than looking at the area under the entire ROC curve, it is more helpful to look at the area under only a portion of the curve- for example, within a certain range of false-positive (FP) rates (i.e., restricted to a portion of the X-axis). Often, interest does not lie in the entire range of FP rates, and consequently, only part of the area under the curve is relevant. For example, if we know ahead of time that a particular diagnostic test would not be useful if its FP rate is greater than 0.25, we might want to restrict our attention to that portion of the ROC curve where FP rates are less than or equal to 0.25.

Another possible reason to analyze the partial AUC rather than the entire AUC was discussed by Dwyer (Radiology 1997;202:621-625):

"A major drawback to the area below the entire ROC plot as an index of performance is its global nature. Its summarization of the entire ROC plot fails to consider the plot as a composite of different segments with different diagnostic implications. ROC plots that cross may have similar total areas but differ in their diagnostic efficacy in specific diagnostic situations. Prominent differences between ROC plots in specific regions may be muted or reversed when the total area is considered. Moreover, plots with different total areas may be similar in specific regions...A solution to the problems of a global assessment of the entire ROC plot is the assessment of specific regions...on the ROC plot."

Which program should you use?

  1. For ROC sample size calculations:
    ROCPOWER.SAS (1 reader; 1 or 2 ROC curves)
    (for help, look at ROCPOWER_HELP.TXT and/or ROCPOWER_DOC.WPD)

    DESIGNROC.FOR (for partial ROC area or SENS at fixed FPR)
    (for help, look at DESIGNROC_HELP.TXT)

    MULTIREADER_POWER.SAS (if multiple readers)

  2. For plotting ROC curves using S-PLUS:
    rocPlot.s (written by Hemant Ishwaran, PhD, Cleveland Clinic) (for this program, please refer to: rocPlot.s)

    For plotting ROC curves using SAS/Graph:
    CREATE_ROC.SAS (for help, look at CREATEROC_HELP.TXT)

  3. For inference on *partial* ROC area:
    PARTAREA.FOR (for help, look at PARTAREA_HELP.TXT)

    PARTAREA.FOR uses a parametric approach to estimate partial AUC. Margaret Pepe has developed Stata software to implement a nonparametric method of estimating partial (or full) AUC. This program can be obtained from:

    (Click on the Programs link and download: aucbs.ado and aucbs.hlp)

    Reference: Dodd LE, Pepe MS. Partial AUC estimation and regression. Biometrics. 59(3):614-23, 2003 Sep.

  4. For inference on the area(s) under one or more ROC curves:

    1. Is data clustered?

      If Y ===> Correlated data?

      If N ===> If rating data, skip to (B)
      If continuous data, skip to (C)
    2. (For rating data) Would you prefer a parametric or nonparametric method?

      If parametric: Is data correlated?
      Y ===> CORROC2.F
      N ===> ROCFIT.F (for 1 curve) or INDROC.F (for 2 curves)

      If nonparametric: Is data correlated?
      Y ===> DELONG.FOR
      N ===> DELONG.FOR

      NOTE: I have also included in this subdirectory a SAS program, ROC_MASTERPIECE.SAS, which computes nonparametric estimates of ROC area for 2 or more (possibly) correlated modalities and performs a global chi-square test comparing all the AUCs.

    3. (For continuous data) Would you prefer a parametric or nonparametric method?

      If parametric: Is data correlated?
      Y ===> CLABROC
      N ===> LABROC4

      If nonparametric: Is data correlated?
      Y ===> DELONG.FOR
      N ===> DELONG.FOR

      NOTE: I have also included in this subdirectory a SAS program, ROC_MASTERPIECE.SAS, which computes nonparametric estimates of ROC area for 2 or more (possibly) correlated modalities and performs a global chi-square test comparing all the AUCs.

  1. For inference on the area(s) under one or more ROC curves:

    (for rating or continuous data) ===> OBUMRM2.FOR

      (for help, see readme.txt; also see 3 sets of example input and output files: FORMACONT.DAT, FORMACONT_OUT.DAT, FORMAORDIN.DAT, FORMAORDIN_OUT.DAT, FORMB.DAT, and FORMB_OUT.DAT)

      This is a FORTRAN program written by Nancy Obuchowski which implements the Obuchowski-Rockette method for analyzing multireader and multimodality ROC data. The program produces nonparametric estimates of ROC area, but allows for the comparison of user-input parametric AUCs, partial AUCs, or sensitivities at a fixed FPR (see "Format B" for examples).

      Note: the assumption that there can only be one test result per patient (i.e., no clustered data) is needed only for "Format A." For "Format B," one could first use software that handles clustered data to get estimates of the ROC areas and their variance-covariance matrix, and then these estimates could be entered into OBUMRM2 via "Format B."

    (for rating or continuous data) ===> MULTIVARIATEROC.S

      This is an S-PLUS program written by Hemant Ishwaran which uses the nonparametric method of Delong et al (Biometrics, 1988) and allows for multiple observations per patient (i.e. ratings from multiple readers). It does not allow for multiple readers *and* multiple modalities. This program does not handle missing data. This program does not come up with an overall average ROC area among readers; it does provide a global chi-square test comparing the ROC areas of the readers and allows for linear contrasts of selected readers (i.e., '1, 2, and 3 vs 4 and 5'; '4 vs 5'). To access this program, go to: multivariateRoc.s

    (for rating or continuous data) ===> LABMRMC

      This program uses the Dorfman-Berbaum-Metz (DBM) algorithm to compare multiple readers and multiple treatments. The program uses jackknifing and ANOVA techniques to test the statistical significance of the differences between treatments and between readers. LABMRMC can be downloaded from the website of the Kurt Rossman Laboratories at the Univ. of Chicago at:
      LABMRMC is suitable for use with subjective probability-rating scales (0-100); outcomes for each reader-treatment combination are categorized in an attempt to produce an appropriate spread of operating points on each ROC curve, followed by application of the DBM procedure. (Note that all of Dr. Metz's ROC software is available in three different formats: IBM-compatible, Mac-compatible, and generic text-only source code for UNIX/VMS environments.) (Documentation for LABMRMC can be obtained from Dr. Metz's website or in this directory in the file LABMRMC.DOC)

    (for rating data) ===> MRMC

      The MRMC (Dorfman, Berbaum, Abu-Dagga, and Schartz, software implements the procedure for discrete rating data (up to 20 rating categories) as described in Dorfman, Berbaum, and Metz (1992). This program uses the Dorfman-Berbaum-Metz algorithm to compare multiple readers and multiple treatments.

LINKS TO ADDITIONAL ROC SOFTWARE (degenerate data; Bayesian ROC methods; LROC analysis; verification bias; etc.)

    ===> the ROC site of the Kurt Rossman Laboratories at the Univ. of Chicago; download ROCKIT, LABMRMC, PlotROC.xls, ROCFIT, LABROC1, CORROC2, CLABROC, INDROC, ROCPWRPC)
  2. Pan and Metz have developed a program, PROPROC, for "hooked" data (i.e., the ROC curve crosses below the diagonal "chance line") and/or degenerate data. They offer the following definition (Pan X and Metz CE. The "proper" binormal model: Parametric ROC curve estimation with degenerate data. Academic Radiology 1997;4:380-389):

      "ROC data sets are said to be 'degenerate' when they can be fit exactly by a conventional binormal ROC curve that consists of only horizontal and vertical line segments. Data degeneracy can be quite common in circumstances where data sets include only a small number of cases and/or the data are obtained on a discrete ordinal scale in which the categories are few and poorly distributed...

      In general, a degenerate data set involves empty cells in the data matrix that represents the outcome of an ROC experiment that employs a discrete (e.g., five-category) confidence-rating scale, and particular patterns of these empty data cells inherently cause iterative maximization fail to converge... . Dorfman and Berbaum (see below) developed an approach and a corresponding computer program, RSCORE4, that is able to estimate ROC curves for degenerate data sets. Their approach is still based on the conventional binormal ROC model, but it eliminates degeneracy by assigning small positive values to all empty cells. ... We have proposed an alternative approach to the problem of data degeneracy that employs a "proper" binormal model, and we have developed a corresponding computer program, PROPROC, for maximum-likelihood estimation of proper binormal ROC curves."

    PROPROC is a parametric approach, developed for discrete categorical data but also applicable to continuous data. It is a zip file located in: There is no "generic" version of the code. To obtain a UNIX-compatible version, rewrite the file IOFILE.F so that it does not use the WINDOWS API.
    ===> Kevin Berbaum's ROC software site; download MRMC; includes RSCORE for degenerate rating data (see "RSCORE 4.66 User's Manual.pdf" in this directory, or go to Berbaum's site for further information); also includes BIGAMMA for degenerate rating data (both RSCORE and BIGAMMA are parametric and have the same data-input format; Berbaum recommends trying RSCORE before BIGAMMA, and if the ROC curves don't make sense or are difficult to interpret- i.e., cross the "chance line"- then try BIGAMMA) (also see PROPROC, above, for degenerate data)
    ===> Univ. of Arizona Radiology Department website; download most of the Metz and Dorfman/Berbaum software discussed previously; also download Richard Swensson's LROC, which takes target location into account in analyzing ROC data
  5. Those interested in a Bayesian approach to ROC analysis are directed to: This directory contains MATLAB software and an example ("Example 2") from Chapter 5 of Ordinal Data Modeling, by Johnson and Albert, Springer (NY), 2001
  6. Andrew Zhou has developed software to deal with verification bias in ROC analyses. Andrew's COMPROC program is not currently available online, but his email address (as of June 2002) for those interested in this software is:

    Verification bias is a phenomenon that occurs when inclusion in an ROC study depends on the availability of a gold standard which is only obtained on a particular nonrandom subset of patients in the study population (i.e., only women with "suspicious" mammograms get biopsied, and biopsy is the gold standard).
  7. Nancy Obuchowski of the Cleveland Clinic has developed FORTRAN software to provide an ROC-type summary measure of accuracy when the gold standard is ordinal or continuous, rather than dichotomous. (See: Obuchowski NA. An ROC-Type Measure of Accuracy When the Gold Standard is Rank- or Continuous- Scale. Submitted for publication.)

    Artificially dichotomizing ordinal or continuous gold standards in order to fit them into a traditional ROC analysis leads to bias and inconsistency in estimates of diagnostic accuracy. Obuchowski provides nonparametric estimators of accuracy which have interpretations analogous to the AUC, and permit the use of ordinal ( ROC_ORDGOLD.for) or continuous (ROC_CONTGS.for) gold standards.

  8. Two ROC-analysis software packages that may be of particular interest to clinical chemists and persons in related fields are MedCalc ( and EP Evaluator (developed by David Rhoads Associates, Both packages are available by subscription only (for a fee).

    EP Evaluator has a particularly useful functionality in which the user has the ability to move a cursor along the ROC curve and view sensitivity and specificity at each cutoff value.