May 14, 2024

Procedure to Estimate Percentiles

Confidence intervals for percentiles in survey data are typically calculated using large-sample normal approximations.
However, this approach has limited application in the setting of data that are not normally distributed.
The National Exposure Report uses a method that produces asymmetric confidence intervals consistent with skewed biologic data distributions.

Overview

Includes procedures for percentiles whose estimate falls on a value that is repeated multiple times in the dataset

A common practice to calculate confidence intervals from survey data is to use large-sample normal approximations. Ninety-five percent confidence intervals on point estimates of percentiles are often computed by adding and subtracting from the point estimate a quantity equal to twice its standard error. This normal approximation method may not be adequate, however, when estimating the proportion of subjects above or below a selected value, especially when the proportion is near 0.0 or 1.0 or when the effective sample size is small. In addition, confidence intervals on proportions deviating from 0.5 are not theoretically expected to be symmetric around the point estimate. Further, adding and subtracting a multiple of the standard error to an estimate near 0.0 or 1.0 can lead to impossible confidence limits (i.e., proportion estimates below 0.0 or above 1.0). The approach used for the Report data tables (and for previous Reports) produces asymmetric confidence intervals consistent with skewed (non-normal) biologic data distributions.

The method we use to estimate percentiles and their confidence limits for the Report data tables and for previous reports is adapted from a method proposed by Woodruff (1952) for percentile estimation and a method described by Korn and Graubard (1998) for estimating confidence intervals for proportions. This method involves first obtaining an empirical point estimate of the desired percentile by creating a rank ordered listing of the sampled observations along with their sampling weights. From this listing and the sampling weights, it is possible to determine an empirical percentile estimate for the target population. After this point estimate of the percentile has been obtained, the fraction of results below the estimate is calculated. The fraction below the point estimate should be very close to the proportion corresponding to the desired percentile, but can deviate from this proportion depending on the frequency of non-unique sampled observations in the vicinity of the empirical percentile estimate and depending on the sampling weight associated with the sampled observation.

For example, when measuring some compounds as part of NHANES there may be multiple results below a common limit of detection (LOD) or multiple results with identical measured values due to the reporting limitations of the instrument. This phenomenon coupled with the sampling weight assigned to each measured result can lead to difficulties in accurately estimating some percentiles and their corresponding confidence limits because an exact percentile may fall within a large group of results with identical measured values. We circumvented this potential problem for the Report data tables by adding a unique, negligibly small number to each measured result. This small number was later subtracted from the percentile estimate without affecting the percentile estimate and without altering any of the original measured results.

By adding a unique, negligibly small number to each sampled observation, it was possible to associate a single sampled observation with the percentile estimate and thus to minimize the difference between the fraction below the point estimate and the proportion corresponding to the desired percentile. However, due to sample weighting, it is still possible to obtain a different point estimate (which will only differ by the difference between numerically adjacent analyte values) depending on how the data are sorted before adding the unique number to each result. We circumvented this potential problem by replacing actual sample weights with an average sample weight where the average is computed across subjects in the same demographic domain who have identical measured results. We computed standard error estimates in a separate step using the original (unaltered) data. Clopper-Pearson 95% confidence intervals around the estimated proportion are obtained using the method described by Korn and Graubard (1998).

We describe below how SAS Proc Univariate and SUDAAN can be used to carry out this method of percentile and confidence interval estimation. SAS code for calculating these confidence intervals can be downloaded from SAS Code Example. In the narrative that follows, the term 'demographic domain' refers to a demographic group of interest, for example non-Hispanic Blacks. The term 'set of subsample weights' refers to the sampling weights that correspond to the variable for which percentiles will be estimated, for example the set of subsample weights associated with total blood mercury measurements. The term 'analyte' refers to the biological or chemical compound measured in a group of subjects and for which percentiles are to be estimated.

Calculating percentile estimates and confidence intervals

Step 1: Obtain a percentile estimate using the original (unaltered data):

Create a separate file with original data (ORIG_DATA). Use SAS (SAS Institute Inc. 1999) Proc Univariate (with default percentile definition equivalent to option PCTLDEF = 5 and with the Freq option variable equal to the original subsample sampling weight) to obtain a point estimate of the percentile (PTLE_ORIG) of an analyte's original results for the demographic domain of interest, for example, the 95th percentile of total blood mercury results for adults aged 20+ years.

Step 2: Obtain a percentile estimate using the altered data:

Create a separate file for use with altered data (ALTR_DATA). Sort the data by analyte measured value separately for each demographic domain and set of subsample weights. Use SAS Proc Means to compute the average sampling weight (WT_AVE) for each unique measured result. For each unique measured result, use a counter from 1 to the total number of subjects with identical values to create a unique integer to associate with each measured observation. Each of these numbers should then be divided by 1,000,000,000 and added to the corresponding measured observation. This will result in each measured observation having an additional fractional amount beyond the fourth decimal as long as there are less than 10,000 subjects with the same measured result. Use SAS Proc Univariate (again with default percentile definition equivalent to option PCTLDEF = 5 but now with the Freq option variable equal to WT_AVE) to obtain a point estimate of the percentile (PTLE_ALTR) of an analyte's altered results for the demographic domain of interest.

Step 3: Sort the data in the ORIG_DATA file by the stratum (sdmvstra) and primary sampling unit (sdmvpsu) variables. Use SUDAAN (SUDAAN User's Manual, 2001) Proc Descript with Taylor Linearization DESIGN = WR (i.e., sampling with replacement), with proper NEST statement, and with the original subsample sampling weight variable to estimate the proportion (P_ORIG) of subjects with results below the percentile estimate (PTLE_ORIG) obtained in Step 1 and discard P_ORIG but retain the standard error (SEMEAN_ORIG) associated with PTLE_ORIG.

Step 4: Sort the data in the ALTR_DATA file by the stratum (sdmvstra) and primary sampling unit (sdmvpsu) variables. Use SUDAAN (SUDAAN User's Manual, 2001) Proc Descript with Taylor Linearization DESIGN = WR (i.e., sampling with replacement), with proper NEST statement, and with the average sampling weight variable (WT_AVE) to estimate the proportion (P_ALTR) of subjects with results below the percentile estimate (PTLE_ALTR) obtained in Step 2 and keep P_ALTR but discard the standard error (SEMEAN_ALTR) associated with PTLE_ALTR. Compute the degrees-of-freedom adjusted effective sample size:

(1) n_df =((t_num/t_denom)²)P_ALTR(1 – P_ALTR)/(SEMEAN_ORIG²)

where t_num and t_denom are 0.975 critical values of the Student's t distribution with degrees of freedom equal to the actual sample size minus 1 and the number of primary sampling units (PSUs) minus the number of strata, respectively. Note: the degrees of freedom for t_denom can vary with the demographic domain of interest (e.g., males).

Step 5: After obtaining an estimate of P_ALTR (i.e., the proportion obtained in Step 4), compute the Clopper-Pearson 95% confidence interval (P_L(x,n_df), P_U(x,n_df)) as follows:

(2) P_L(x,n_df) = v₁F_v1,v2 (0.025)/(v₂ + v₁F_v1,v2(0.025))

P_U(x,n_df) = v₃F_v3,v4 (0.975)/(v₄ + v₃F_v3,v4(0.975))

where x is equal to P_ALTR times n_df, v₁= 2x, v₂= 2(n_df – x + 1), v₃= 2(x + 1), v₄= 2(n_df – x), and F_d1,d2(β) is the β quantile of an F distribution with d₁ and d₂ degrees of freedom. (Note: If n_df is greater than the actual sample size or if P_ALTR is equal to zero, then the actual sample size should be used in place of n_df.) This step will produce a lower and an upper limit for the estimated proportion obtained in Step 4.

Step 6: Use SAS Proc Univariate (again with default percentile definition equivalent to option PCTLDEF = 5 and with the Freq option variable equal to WT_AVE) to determine the analyte values that correspond to the desired percentile (proportion) and to the lower and upper proportion limits obtained in Step 5. Round these results to 2 or 3 decimals depending on the significant figures associated with the original measured values.

Example:

To estimate the 95th percentile of total blood mercury in adults 20+ years of age in the 2013-2014 survey, create two separate files: name one ORIG_DATA and the other ALTR_DATA. For the ORIG_DATA file use SAS Proc Univariate with the Freq option and the subsample sampling weight (or in this case the full sample sampling weight because total blood mercury is the analyte of interest) to get a weighted point estimate of the analyte value that corresponds to the 95^th percentile (PTLE_ORIG). For this example, the value is 4.88 µg/L.

Sort the results in the ALTR_DATA file by analyte measured value. Use SAS Proc Means to compute the average sampling weight (WT_AVE) for each unique analyte measured value. For each unique measured result, use a counter from 1 to the total number of subjects with identical values to create a unique integer to associate with each measured observation. Divide each counter value by 1,000,000,000 and add this amount to the corresponding measured observation. For this altered data file (ALTR_DATA) use SAS Proc Univariate with the Freq option and WT_AVE to get a weighted point estimate of the analyte value that corresponds to the 95^th percentile (PTLE_ALTR). For this example, the value is also 4.88 µg/L.

For the ORIG_DATA file use SUDAAN to estimate the weighted proportion (P_ORIG) of subjects with results below the value of PTLE_ORIG (which can differ from 0.95 depending on the number of results with identical values; for this example the proportion is 0.9491) and the standard error (SEMEAN_ORIG) associated with P_ORIG (for this example SEMEAN_ORIG = 0.0044).

For the ALTR_DATA file use SUDAAN to estimate the weighted proportion of subjects (P_ALTR) with results below the value of PTLE_ALTR (which should also be very close to 0.95 regardless of the number of original results with identical values; for this example the proportion is also 0.9491). Then obtain a confidence interval on P_ALTR by computing the weighted Clopper-Pearson 95% confidence limits (equation 2 above) using the degrees-of-freedom adjusted effective sample size as described in equation 1 above. For this example, the effective sample size is 2113.64 resulting in lower and upper confidence limits of 0.93885 and 0.95808, respectively. Then use SAS Proc Univariate (with the Freq option variable equal to WT_AVE) to determine the analyte values corresponding to the weighted 93.9^th and 95.8^th percentiles. These point estimates are the lower and upper confidence limits on the 95th percentile estimate. Round the 95^th percentile estimate and its confidence limits to 2 or 3 decimals depending on the significant figures associated with the original measured values. For this example, the rounded point estimate is 4.88 µg/L with lower and upper confidence limits of 4.36 and 5.21 µg/L, respectively.