STATISTICAL METHODS AND SOFTWARE FOR THE
ANALYSIS OF OCCUPATIONAL EXPOSURE DATA WITH
NON-DETECTABLE VALUES
Edward L.
Frome
Computer
Science and Mathematics Division
Oak Ridge
National Laboratory
Paul F.
Wambach
U. S.
Department of Energy
Date
Published: September 2005
ABRIDGED HTML VERSION
CONTAINS: SECTIONS IN RED
Appendix Contains some errors in notation---see original PDF Version
CONTENTS
ABSTRACT ………………………………………………………………………………
1.
INTRODUCTION
…….………………………………………………….………….. 1
2. STATISTICAL ANALYSIS FOR COMPLETE SAMPLES
……………………… 2
2.1 CONFIDENCE LIMITS FOR THE MEAN EXPOSURE
LEVEL ……..…. . 3
2.2 CONFIDENCE LIMIT FOR PTH PERCENTILE
…………………………… 3
2.3 CONFIDENCE LIMITS FOR EXCEEDANCE
FRACTION…………………
4
3. ANALYSIS OF
DATA WITH NON-DETECTS …………………………………… 5
3.1 MAXIMUM LIKLIHOOD ESTIMATEION FOR LOGNORMAL
DATA
WITH NON-DETECTS
………………………………………………………….
5
3.2
CONFIDENCE LIMITS FOR
THE MEAN EXPOSURE LEVELS
WITH NON-DETECTS …………………………………………………………. 6
3.3
CONFIDENCE LIMITS FOR
THE PTH PERCENTILE WITH
NON-DETECTS ………………………………………………………………….. 7
3.4
CONFIDENCE LIMITS FOR
EXCEEDANCE FRACTIONS
WITH NON-DETECTS………………………………………………………….. 7
3.5
NON-PARAMETRIC METHODS
FOR SAMPLES WITH
NON-DETECTS………………………………………………………………….. 8
3.6
NON-PARAMETRIC UPPER
TOLERANCE LIMIT AND
EXCEEDANCE FRACTION…………………………………………………… 9
4.
APPLICATIONS……………………………………………………………………… 9
4.1 EXAMPLE 1. SURFACE WIPE SAMPLES FROM ELEVATED ……….. 10
SEMELTER SURFACES
4.2 EXAMPLE 2.
TWA BERYLLIUM EXPOSURE DATA …………………….. 12
5. DISCUSSION
…………………………………………………………………………. 15
6.
ACKNOWLEDGEMENTS ………………………………………………………….. 19
REFERENCES
……………………………………………………………………..…….. 20
APPENDIX
…………………………………………………………………….…………. 22
ABSTRACT
Environmental exposure measurements are, in general,
positive and may be subject to left censoring; i.e,. the measured value is less
than a “detection limit.” In
occupational monitoring, strategies for assessing workplace exposures typically
focus on the mean exposure level or the probability that any measurement exceeds
a limit. Parametric methods used to determine
acceptable levels of exposure, are often based on a two parameter lognormal
distribution. The mean exposure level, an upper percentile, and the exceedance
fraction are used to characterize exposure levels, and confidence limits are
used to describe the uncertainty in these estimates. Statistical methods for random samples (without non-detects) from
the lognormal distribution are well known for each of these situations. In this report, methods for estimating these
quantities based on the maximum likelihood method for randomly left censored
lognormal data are described and graphical methods are used to evaluate the
lognormal assumption. If the lognormal
model is in doubt and an alternative distribution for the exposure profile of a
similar exposure group is not available, then nonparametric methods for left
censored data are used. The mean exposure level, along with the upper
confidence limit, is obtained using the product limit estimate, and the upper
confidence limit on an upper percentile (i.e., the upper tolerance limit) is
obtained using a nonparametric approach.
All of these methods are well known but
computational complexity has limited their use in routine data analysis with
left censored data. The recent
development of the R environment for statistical data analysis and graphics has
greatly enhanced the availability of high-quality nonproprietary (open source)
software that serves as the basis for implementing the methods in this
paper. Numerical examples are provided
and R(2004) functions are available at the analysis of occupational exposure
data web site http://www.csm.ornl.gov/esh/aoed/
(AOED).
INTRODUCTION
Regulatory and advisory criteria for evaluating the
adequacy of occupational exposure controls are generally expressed as limits
that are not to be exceeded in a work shift or shorter time-period if the agent
is acutely hazardous. Exposure
monitoring results above the limit require minimal interpretation and should
trigger immediate corrective action. Demonstrating compliance with a limit is
more difficult.
The American Industrial Hygiene Association (AIHA)
has published a consensus standard with two basic strategies for evaluating an
exposure profile [Mulhausen and Damiano (1998)]. The first approach is based on the mean of the exposure
distribution, and the second approach considers the “upper tail” of the
exposure profile. Statistical methods
for estimating the mean, an upper percentile of the distribution, the exceedance
fraction, and the uncertainty in each of these parameters are described. Most of the AIHA methods are based on the
assumptions that the exposure data does not contain non-detects, and that a
lognormal distribution can be used to describe the data. Exposure monitoring
results from a compliant workplace tend to contain a high percentage of
non-detected results when the detection limit is close to the exposure limit,
and in some situations, the lognormal assumption may not be reasonable. There are parametric methods for censored
lognormal data and non-parametric methods that can be used with left censored
data to calculate all of the statistics recommended by the AIHA for the
complete data case. However, the only practical way to compute these statistics
is with statistical software. The
recent availability of free, high-quality statistical software means that
complex calculations are no longer a barrier to the statistical analysis of
almost any occupational exposure data set.
This also eliminates the need for special tables and graphical methods
that are used in the complete data case for the lognormal distribution.
Statistical methods for the analysis of right
censored data using various parametric and non-parametric methods are well
known and generally referred to as “survival analysis” [Cox and Oakes (1984) or
Kabfleish and Prentice (1980)]. In this
situation, the dependent or response variable (say T) is usually time to the
occurrence of event; i.e., the “survival time” (or time to failure) of an observational
or experimental unit (e.g., animal, person, or machine). T may be referred to as a “lifetime random
variable” and is by definition positive, and may be subject to “censoring.” As a typical example, let Ti
represent the survival time of the ith
patient in a clinical trial. If the
trial ends and the patient is not known to have “failed” the observed survival
time, say
, is right censored (i.e., it is only known that Ti
is greater than
). This can occur for
several reasons. Suppose, for example,
that all patients enter the trial at the same time and are followed until a
specified end date, then those individuals still at risk have a censored
survival time that is the same for all surviving patients (Type I
censoring). If patients enter the trial
at random and the trial ends at a fixed date, then the value of
is different for each
surviving patient (random censoring).
Statistical methods for the analysis of right censored data are widely
used and computer software for survival analyses is available in most general
purpose statistical programs
(e.g., the R survival library).
In this report, the dependent or response variable
of interest is the amount, say X, of a measured quantity. X is a positive random variable and as the
result of the analytic methods used, the observed value for the ith measurement may be reported as
(left) “censored” and is referred to as a non-detect or as being less than a
“detection limit”(DL) say
(i.e., it is only
known that
is less than
). A frequent
assumption is that the distribution of X is lognormal . See Aitchison and
Brown(1969) and Crow and Shimizu(1988)
for general treatment of the lognormal distribution and its application. Schmoyer et. al. (1996) considered the
lognormal model for contaminant concentrations in environmental risk assessment
for both complete and left censored samples.
Akritas et. al. (1994) provide a detailed discussion of various methods
that have been proposed for parameter estimation for left censored data. Methods were classified into three general
areas as described by Helsel (1990).
They are: (1) simple substitution; e.g., replace a censored observation
with one-half of its value; (2) parametric methods; e.g., censored data maximum
likelihood (ML), and (3) “robust parametric methods” based on variations
of “probability plot regression.” Simulations studies have been done to
compare these methods under various conditions with the primary focus on bias
and mean square error of location and scale parameters [see Helsel and
Cohen(1988), Newman et. al. (1989)].
Akritas et. al. (1994) have reviewed these and other studies, and note
that ML methods under the lognormal model provide expressions for the variance
of the parameter estimates. This is
important when an upper percentile, say Xp , of the exposure distribution is of
interest. For example, the ML estimate
of log(Xp) is a linear combination of the lognormal parameters, and
the standard error of this quantity can be estimated from the ML parameter
covariance matrix. Consequently,
confidence limits for both Xp and the exceedance fraction can be
obtained using the ML approach. Taylor
et. al. (2001) have noted that regarding all non-detected values as censored
outcomes from a lognormal distribution may not always be appropriate. If there is reason to believe that a
non-negligible proportion of the non-detects are “true zero” exposures, then a
censored lognormal mixture model (a zero-inflated lognormal model with
censoring) should be considered. ML
methods for estimation and hypothesis testing are described and the
relationship between ML parameter estimates from the mixture model and those
based on either a left truncated or censored lognormal model are
described. Moulton and Halsey(1995)
emphasize that it is also possible that non-detects may be from a second
(possibly lognormal) distribution rather than a point mass at zero. Fowlkes (1979) has described methods for
studying the mixture of two lognormal distributions, although he did not
consider left censored data. In
situations where the lognormal model (or some other distribution such as the
gamma or Weibull) is not reasonable, a non-parametric approach can be
used. All of the methods just described
can be implemented using R, the ML method being the most difficult. It is also possible to develop procedures for
all of these methods in several proprietary statistical programs that are available
commercially.
In the discussion that follows the generic term
“acceptable” refers to the situation where the distribution of the exposure
measurement X satisfies a specified criteria indicating, for example, that the
workplace is “safe,” or that the surfaces of a survey unit are “clean.” The term survey unit describes all or part
of an entity (e.g., building, piece of equipment) that is being evaluated. The term “unacceptable” means that the
distribution of the exposure measurements indicates that the workplace or
survey unit is “contaminated” or “hazardous.”
The formal statistical procedures used to demonstrate that an exposure
distribution is acceptable is to state a null hypothesis in the form H0
: q ³ L, where q
represents a parameter of the exposure distribution (e.g., the mean, a
percentile, or the exceedance fraction), and q ³ L
indicates that the exposure distribution is unacceptable. Then, based on a random sample from the
exposure distribution, an estimate of q and
an upper confidence limit (UCL) with a specified confidence level, say g, are
calculated. If the 100g%UCL
is less than L, then the null hypothesis is rejected and the exposure
distribution is acceptable. A Type I
error occurs if H0 is rejected when it is true (i.e., the X distribution
is incorrectly considered to be acceptable).
This will occur with a probability (type I error rate) that is less than
or equal to a = (1 - g),
with a = 1 - g when q = L.
These and other related procedures are described in detail in an occupational
exposure context by Mulhausen and Damiano (1998).
4. APPLICATIONS
In several situations of practical interest
statistical analysis of left censored data from a lognormal distribution are
required. The "exact" results
for complete samples described in Section 2 have not been developed for
censored data. The methods presented here are "large sample"
results and follow directly from the properties of ML estimators described in
Section 3. Each of the examples
will describe the censored data equivalent of the exact methods used with
complete samples. The emphasis here is on describing the methods and
software. The focus in the examples is two areas of application that are
part of the Department of Energy (DOE) Chronic Beryllium Disease Prevention
Program. The DOE is concerned with
monitoring objects (e.g., equipment, buildings) for beryllium contamination and
workers for exposure to beryllium in the workplace. The first example describes the results of a survey to evaluate
possible beryllium “contamination” based on surface wipe sampling of a smelter
facility used to recycle metal. The
second example describes the results of a beryllium worker-monitoring program
using 8-hour TWA. In both situations
“limit values” have been established to determine if exposure levels are
acceptable; i.e., the object is “clean” or the workplace is “safe.” In general, the limit value will depend on
the strategy that is being used as described in the introduction. In the examples the null hypothesis of
interest is that the 95th percentile of exposure distribution does
not exceed the specified limit. Hewett
(1996) explains that occupational exposure limits (OELs) are generally single
shift limits used for day-to-day risk management that will also constrain
long-term, working lifetime exposures of each individual worker to protective
levels. OELs are based on health or
toxicology studies that establish protective mean exposure levels. A work environment that rarely exceeds the
OEL will also maintain mean levels well below the OEL. Day-to-day exposure prevention is achieved
through investigations to determine cause and corrective actions for exposure
measurements above the OEL.
Ninety-five-percent
confidence that fewer than 5% of measurements are above a specified limit is a
statistical definition of compliance that has come into widespread use to
determine the monitoring efforts needed to demonstrate compliance [see chapter
7, Mulhausen and Damiano (1998)]. In
this situation the upper tolerance limit (i.e., the 95% UCL for the 95th
percentile) and the UCL for the exceedance fraction are of primary interest
(see Section 2.3) to determine if the exposure distribution is acceptable. The exact results for samples from a
lognormal (or normal) distribution described in Section 2 and the Appendices of
Mulhausen and Damiano(1998) are based
on the assumption of complete samples; i.e., no left censored data. The statistical methods and computer software
for the analysis of left censored data described in Section 3 can be used to
calculate the censored data equivalent of all of the statistics described by
Mulhausen and Damiano (1998). Details
describing R and the R driver functions used to obtain these results are
described in the Appendix and at the AOED website. All of the R functions can be downloaded from the AOED web
site.
In the examples, results obtained using R
interactively are shown in a monospaced font like this (where “>” is the R
prompt).
To duplicate these results in the examples read the Appendix and then visit
the AOED web site and complete steps 1-4.
Note that Exhibit 1 is listed at the end of readall.R at the AOED web
site, and the data frame SESdata and character string IpSESdata
will be in the R working directory.
4.1 Example 1.
Surface Wipe Samples from Elevated Semelter Surfaces
The data in Exhibit 1 are 31 surface wipe
samples from elevated surfaces of a smelter with beryllium contamination. Exhibit 1 illustrates one method that can
be used to enter data into R in the format required for the ML estimation. This would normally be done by using a text
editor to create a file (example1.txt).
All characters on a line to the right of the # sign are comments. The data is entered into the R working
directory using the R function source;
i.e, if the file is in the directory(folder) where the R session was initiated,
source (“example1.txt”) will input the vectors x, det, and the
data.frame SESdata.
Exhibit 1 of Section 4.1
#
Illustrates one method that can be used to enter
#
data into R in the format required for ML estimates
#
#
Surface Wipe Samples from Smelter
IpSESdata <-
"SESdata: Smelter-Elevated
Surfaces "
#
IpSESdata is Character string for Use by qqlognB()
x <-
(15,15,15,25,25,40,40,40,45,50,50,70,75,95,100,125,125,
145,145,150,150,165,270,290,345,395,395,420,495,840,1140)
x <- x/1000 # wipe samples micrograms per 100
cm^2
det<- c(0,0,0,rep(1,28) ) # first three values are censored
SESdata<- data.frame(x,det) # R data
frame for mlndln()
#
ML
estimates of µ, σ, log(
) , and σ2 are obtained using :
>
mlndln(SESdata)
mu sigma logE sig2 -2Log(L) Conver
mle -2.2907643 1.2760000 -1.4766777 1.6281796
-12.852885390 0
semle 0.2311395 0.1754489 0.3137301 0.4477474 -0.002005525 28
The R function mlndln is described in the
Appendix and is available at the AOED website. The data in Exhibit 1 are shown graphically in Fiqure 1. ML estimates of µ, and σ are shown in
title of the plot. To obtain Figure 1,
use the following at the R prompt:
>qqlognB(SESdata,IpSESdata,L=0.2,unit
= “mug/100 cm^2”,p=0.95,gam=0.95)
If equipment is being evaluated for release to the
public or for non beryllium use the DOE has established a release limit for
removable beryllium contamination of L95
= 0.2 µg/100cm2. The ML estimate of the

Figure 1. Results for Surface Wipe Samples in Example
1.
exceedance fraction (see Section 2.3), 95% LCL, and 95% UCL are obtained using R function efclnd based on the method described in Section 3.4;
>
efclnd(SESdata,gam=0.95,L=0.2)
f_MLE LCL_0.95 UCL_0.95
29.66864
19.45963 41.80762
The nonparametric estimates of the exceedance fraction, 95% LCL, and 95% UCL are obtained using efclnp based on the method described in Section 3.5:
>
efclnp(SESdata,gam=0.95,L=0.2)
fnp
LCL_95 UCL_95
29.03226
16.06111 45.19044.
The exceedance fraction is an estimate of the percentage
of surface area that is expected to exceed the release limit Lp =
0.2 µg/100cm2 with p =
0.95. Both the point
estimate and the UCL for F exceed Fo=100( 1-p)= 5%, indicating that the equipment is not
acceptable. In fact, the 95% LCL
indicates that at the 95% confidence level at least 19.5% of the surface area
exceeds the release limit. These
lognormal based and nonparametric estimates of the exceedance fraction and 95%
CLs are shown in the lower right of Figure 1 along with the lognormal based
estimate of the 95th percentile Xp = 0.825, the lower 95% CL = LX(0.95,0.95) = 0.446, and
upper 95% CL= UX(0.95,0.95)= 1.526 (see
Section 3.3). The GM, GSD, ML estimates
of the (arithmetic) mean of X (with
confidence limits) based on the lognormal model, the distribution free
Kaplan-Meier mean (with confidence limits), the percent non-detects, the sample
size (n), the number of detects (m), R2 (as defined in Section 3.5), and the z value
for the limit Lp are in the upper left corner of Figure 1.
> allss(dd=SESdata,L=0.2,p=0.95,gam=0.95)
sstat
Sec
mu -2.291 ML
estimate of mean of y=log(x) Sec
3.1
se.mu 0.231 Estimate
of standard error of mu Sec 3.1
...
Fnp_0.2 29.030
Nonparmetric estimate of F for limit L Sec 3.6
FnLCL_95 16.060
Nonparmetric estimate of LCL for F Sec 3.6
FnUCL_95 45.190
Nonparmetric estimate of UCL for F Sec 3.6
>
4.2
Example 2. TWA Beryllium
Exposure Data
As part of a chronic disease prevention program the
DOE adopted an 8-hour TWA OEL limit value of 0.2 micrograms per cubic meter
proposed by the American Conference of Government Industrial Hygienists (DOE 10
CRF Part 850 and ACGIH 2004). Figure 2
summarizes the results of 280 personal 8-hour TWA beryllium exposure readings
at a DOE facility. This data contains
175 non-detects that range in value from 0.005 to 0.100 µg/m3. This is an example of random (progressive)
left censored data (available at the AOED web site in file beTWA.txt). The q-q plot in Figure 2 is based on the PLE
as described in Section 3.5 and was calculated using the R function plend(beTWA). Figure 2 can be obtained using R utility function qqlognB.
To obtain Figure 2, use the following at the R prompt:
>beTWA <- read.table(“beTWA.txt”)
>qqlognB(beTWA,IpbeTWA,L=0.2, unit = “mug/m^3”).
ML estimates of µ, σ, log(
) , and σ2
are obtained using :
>
mlndln(beTWA)
mu sigma logE sig2 -2Log(L) Conver
mle -5.1786787
1.5357165 -3.9994324 2.3585614 -2.175955e+02 0
semle 0.1340638
0.1155163 0.1485077 0.3548366
-8.918476e-03 105
>
The ML estimate of the 95-95 geometric upper tolerance limit is
calculate using the results in Section 3.3 equation (9), i.e.
=
+ z.95
-2.652 and
|
var( |
= |
Var( |
|
|
= |
0.13412 + 1.6452(0.1155)2 +
2*1.645(-.008918) |
|
|
= |
0.0247 |
Then, using equation 6 , UX(0.95,0.95) = exp[-2.652
+ 1.659 (0.0247)1/2] = 0.091.
The nonparametric upper tolerance limit from Section 3.6 is 0.13. Both estimates are well below the limit L0.95 = 0.2 µg/m3 indicating that
the workplace acceptable (in compliance).
The lognormal based estimate of the exceedance
fraction for L0.95 = 0.2 is
1.01%, and the 95% upper confidence limit Uf(0.2,0.95) = 3.24% < 5% indicating the workplace is
acceptable. The non-parametric estimate
of the exceedance fraction is 1.43% with 95% UCL = 3.24%. All of the above results provide strong
support for the decision that the workplace is in compliance.

Figure
2. Results for 8-hour TWA Data in
Example 2.
APPENDIX
R (2004) is available as Free
Software under the terms of the Free Software Foundation's
GNU General Public License in
source code form. It compiles and runs on a wide variety of UNIX platforms and
similar systems (including FreeBSD and Linux), Windows and MacOS. Detailed
documentation on all aspects of R is available at the R home page
http://www.r-project.org/. Sources,
binaries, and documentation for R can be obtained via CRAN,
the “Comprehensive R Archive Network”
(click on CRAN on the R home
page menu ). An Introduction to R and
related manuals edited by the R Development
Core Team are provided
under the “Manuals” link. Additional
manuals, tutorials, etc. are provided by users of R under the “Contributed”
Link. References are provide under the
“Publications” link. [Venables and Ripley(2002)] is a highly recommended book
on statistical data analysis using R.
All of the R functions discussed in this report and the data used in the
examples in Section 4 are available at the website
http://www.csm.ornl.gov/esh/aoed (AOED).
Most of the serious computing is done by R base functions optim
and uniroot. The R driver
functions at AOED are provided to assist the reader that may not have
experience with R. They are not
“formal” R functions, i.e. there is limited error checking and the usual type
of R online “help” files are not provided.
Documentation for each function is provided in this report and as
comments in the function. All of the
files at the AOED web site are ascii (txt) files and can be modified using any
text editor (e.g. xemacs, wordpad, vi).
The most important functions with more detailed documentation are
combined into one file oedmain.R (Exhibit 3).
Additional functions that reflect the authors’ interest and that may
require revisions for other applications are also provided in the file oedutil.R
(Exhibit 4). Both of these are available
at the AOED web site the AOED website.
In Section 4 several examples
of the interactive use of these functions for the analysis of left censored
data were provided. These functions can
be used for complete data sets by providing an input matrix with the data in
column 1 and the censoring indicator for each data value (all equal to 1) in
column 2. The results will be the ML
estimates. For complete data sets when
n is small the “exact “ results
described in Section 2 may be of interest. The R function lnexact in Exhibit 2 illustrates the use
of the functions (see Exhibit 3) extol and efcl for the exact
analysis of complete samples. It is used here to provide an introduction to R
(see below). These functions are
appropriate for complete data sets when the “upper tail” of the lognormal
distribution is of interest (see Mulhausen and Damiano, 1998 Appendix
VII). In this situation the industrial
hygienist picks an upper percentile Xp (often the 95th percentile) and specifies a limit Lp
(e.g., the OEL). The value of p is the
minimum proportion of the exposure distribution that must fall below the limit
Lp for the exposure profile to be considered “acceptable.” The exact 100g% upper confidence limit for the pth percentile Xp
is UX(p, g) = exp(
+ K sy) and is referred to as the upper tolerance
limit ( see Section 2.2). The exact 100g% lower confidence limit for Xp is LX(p, g) = exp(
+ K΄sy).
The point estimate of the exceedance fraction FL and 100g% lower and upper confidence limits are calculate
using efcl(x,gam,L) as described in Section 2.3. The Sapiro-Wilk W statistic recommended by
AIHA can be used on complete data sets using R function shapiro.test for
n between 3 and 5000.
To duplicate the results that
follow visit the AOED web site and complete steps 1-4. The R working directory AIHD will contain
all of the functions described in this report and data frames aiha
(example data from Mulhausen and Damiano, 1998, page 259) and aihand
(example data from Mulhausen and Damiano, 1998, page 244 with 3 non-detected
values). Note that aihand is
used as the default data set in all functions that require a two column matrix
or data frame (e.g., mlndln). The input to lnexact(x,p,gam,L) in
Exhibit 2 is a vector x of positive data xi, i = 1,…, n ,
and the values of p, gam(γ), and L. The first line is the function name and arguments. Lines 2 –18 are comments that describe what
the function does, the arguments, and the values returned. All characters on a line to the right of the
# sign are comments. In a formal R function this information is obtained from a
“help” file by entering ? followed by the name of the function; e.g, typing pt
at the R prompt ( the symbol >) describes the probability density function
for Student’s t distribution that is used by extol and efcl. Line 20 describes the error check on line
21. The mean and standard deviation of
y = log(x) and sample size are calculated on line 23 ( note that the semicolon
separates executable statements on the same line). The next 3 lines calculate the point estimate and confidence
limits for Xp using extol
to calculate the values of K and
K´. Line 27 combines the three values
into a vector xp and names it np. Line 28 uses efcl to calculate the point estimate and
confidence limits for the exceedance fraction, and the next line combines
vector xp and fp into a data frame with names based on the input
values of p, gam, and L. These functions can be revised using the R functions fix
or edit, or by using a text editor (e.g., Notepad or XEmacs) to make a
new file.
Exhibit 2 in the Appendix
lnexact
<-function(x=aiha[,1],p=0.95,gam=0.95,L=5){
# Find point estimates and confidence limits for Xp the pth
# percentile and the exceedance fraction F_L = Pr[ x > L]
# for a (complete) sample
from a lognormal distribution
# USAGE: lnexact(x,p,gam,L)
# ARGUMENTS:
# x is a vector of positive lognormal data
# p is proportion that should fall below L
# gam is the one-sided confidence level
# L is the specified limit of interest
# VALUE: data.frame
containing point estimates
# of Xp and F_L in column 1
# 100gam% lower confidence limits in column 2
# 100gam% upper confidence limits in column 3
# Xp LX(p,gam) UX(p,gam)
# F_L Lf(L,gam) Uf(L,gam)
# NOTE: The combined
confidence limits are an approximate
# 100*[1 - (1-gam)*2] Percent Confidence Interval
#
# The next line is an example of "error checking"
if( any(x) <= 0 ) stop("all data values must be
positive")
# calculate mean(yb) and
standard deviation(sy) of y=log(x)
yb <- mean(log(x)) ; sy<- sd(log(x) ) ; n <- length(x)
xp <- exp( yb +
qnorm(p)*sy ) # point estimate of Xp
lxp <- exp( yb +
extol(n,p,1-gam)*sy) # exact LCL for Xp
uxp <- exp( yb +
extol(n,p,gam)*sy ) # exact UCL for Xp
xp <- c(xp,lxp,uxp) ; nx <-
paste("X",100*p,sep="")
fl <- efcl(x,gam,L,T) ;
nf <- paste("F",L,sep="_")
out<- rbind(xp,fl)
dimnames(out)<- list(
c(nx,nf),c("Est","LCL","UCL"))
out
}
The following script demonstrates the use of lnexact
> #
Use data from Hewett and Ganser (HG) 1997 page 135 to
> #
illustrate use of lnexact to obtain exact confidence
> #
limits for Xp percentile and the exceedance fraction
> #
for complete samples from lognormal distribution the
> #
next line assigns the data to the vector variable xhg
>
> xhg<- c(4.25,1.38,3.11,2.20,2.82)
>
> # now use function lnexact with xhg as
input
>
> lnexact(xhg,p=0.95,gam=0.95,L=5)
Est LCL.95 UCL.95
X95 5.145787 3.6328368 15.10336
F_5 5.744611 0.3795139 35.55304
>
> # HG results--- F_5= 5.7 LCL= 0.38
UCL= 35.55
> # --- X95 not
calculated
>
> # Next use aiha[,1] to check results
in Appendix VII
>
# of Mulhausen and Damiano, 1998 (DM)
> Ipaiha
[1] "Strategy for Assessing &
Managing Occ Exp page 259"
>
>
lnexact(aiha$x,p=0.95, gam=0.95)
Est LCL UCL
X95 4.842716 3.9015308
7.045908
F_5 4.241070 0.8570198 15.282684
>
>
> #
DM Appendix VII page 278-280 results UTL= 7.11
> #
F_5= 4.4% 95% LCL = 2% 95% UCL
= 15%
> #
difference is due to graphical interpolation and
> #
rounding of the mean and SD of log(x)
> mean( log(aiha[,1]) ) ; sd(
log(aiha[,1]) )
[1] 0.9079064 # rounded to 0.91
[1] 0.4070693 # rounded t0 0.41
>
> # DM Appendix VII page 277 mean= 0.91
SD = 0.41
>
The results from lnexact
will agree with those from Odeh and Owen for any reasonable values of p, gam,
and L. Note that if the value of L is changed to the 95% UCL for Xp
>
> #
change L to 7.046 the 95% UCL
for X95
>
> lnexact(aiha[,1],p=0.95, gam=0.95,
L=7.046)
Est
LCL.95 UCL.95
X95
4.8427163 3.90153084 7.045908
F_7.046 0.5143435 0.02923116 4.999718
#
and the 95% UCL for F is 5% as expected
This shows the equivalance of
the relationship between the upper tolerance limit and the exceedance fraction;
i.e, with γ = 0.95 , p= 0.95, L=
7.046, F0 = 5% the
upper tolerance limit is UX(0.95,0.95)
= 7.04559 and the upper confidence
limit for the exceedance fraction is
Uf(7.046,0.95)= 4.99972
H0: Xp ³ L
reject if UX(p, γ) <
L REJECT H0
H0: FL ³ 1-p
reject if Uf(L, γ) < F0 REJECT H0
Likewise,
if the value p = 1 - UCLF_5/100 = .8472
is used
> #
change p to 1 - 0.1528= 0.8472
>
> lnexact(aiha[,1],p=0.8472, gam=0.95,
L=5)
Est LCL.95 UCL.95
X84.72 3.76199 3.1313547 5.000301
F_5
4.24107 0.8570198 15.282684
> #
and the 95% UCL for Xp is 5 as expected
This is equivalent to γ
= 0.95 , p= 0.8472, L= 5, F0 = 15.29% , the upper tolerance limit is UX(0.847,0.95)= 4.999157,
and the upper confidence limit
for the exceedance fraction is
Uf(5,0.95) = 15.282684
H0: Xp ³ L
reject if UX(p, γ)
< L REJECT H0
H0: FL ³ 1-p
reject if Uf(L, γ) < F0 REJECT H0
Exhibit 3 in the Appendix
oedmain.R
Exhibit 4 in the Appendix oedutil.R
.
REFERENCES
Aitchison, J. and J.A.C. Brown, 1969,
The Lognormal Distribution, Cambridge, U.K., Cambridge University Press.
Akritas, M.G., T. F. Ruscitti, and G.S.
Patil, 1994. Statistical Analysis of
Censored Environmental Data, Handbook of Statistics, Vol. 12, G.P Patil and
C.R. Rao (eds), 221-242, Elsevier
Science , New York.
American Conference of Governmental
Industrial Hygienists (ACGIH), Notice of Intended Change In: 2004 TLVs® and
BEIs®, p. 60, ACGIH, Cincinnati, OH.
Armstrong, B. G., 1992. “Confidence Intervals for Arithmetic Means of Lognormally
Distributed Exposures,” American
Industrial Hygiene Association Journal, 53(8), 481-485.
Chambers, J. M., W. S.
Cleveland, B. Kleiner and P. A. Tukey, 1983.
Graphical Methods for Data Analysis, Duxbury Press, Boston.
Clopper, C. J. & Pearson,
E. S. (1934). The Use of Confidence or Fiducial Limits Illustrated in the Case
of the Binomial. Biometrika_, 26, 404-413.
Cohen, A. C,. 1991. Truncated and Censored Samples. Marcel
Dekker, Inc., New York.
Crow, E. L. and K. Shimizu, 1988. Lognormal Distribution, Marcel Decker, New
York.
Cox,
D. R. and D. V. Hinkley, 1979.
Theoretical Statistics. Chapman & Hall, New York.
Cox,
D.R. and D. Oakes, 1984. Analysis of Survival Data. Chapman and Hall,
New York.
Department of Energy, 10 CFR Part 850, Chronic Beryllium
Disease Prevention Program, Federal Register, Vol. 64, No. 235, 68854-68914,
December 1999.
Fowlkes, E. B. 1979. “Some Methods for Studying the Mixture
of Two Normal (Lognormal) Distributions,” Journal of the American Statistical
Association, 74, 561-575.
Hahn, G.J. and W.Q. Meeker, 1991. Statistical
Intervals. John Wiley and Sons, New York.
Hewett, P. and G. H. Ganser,
1997. “Simple Procedures for
Calculating Confidence Intervals Around the Sample Mean and Exceedance Fraction
Derived from Lognormally Distributed Data,” Applied
Occupational and Environmental Hygiene, 12(2), 132-147.
Helsel, D.,1990. “Less Than
Obvious: Statistical Treatment of Date Below the Detection Limit”, Environmental
Science and Technology, 24(12),
1767-1774.
Hesel, D.R, and T.A. Cohn
,1988, “Estimation of Descriptive Statistics for Multiply Censored Water
Quality Data”, Water Resoureces Research, 24, 1997-2004.
Johnson, N. L. and B. L.
Welch, 1940. “Application of the Non-Central t-Distribution,” Biometrika, 31(3/4), 362-389.
Kalbfleisch, J.D. and
R.L. Prentice, 1980. The
Statistical Analysis of Failure Time Data.
John Wiley and Sons, New York..
Kaplan, E. L. and P. Meir,
P.,1958. “Nonparametric Estimation from
Incomplete Observations,” Journal of the
American Statistical Association, 457-481.
Land, C. E., 1972. “An Evaluation of Approximate Confidence
Interval Estimation Methods for Lognormal Means,” Technometrics, 14(1), 145-158.
Meeker, W.Q and L.A. Escobar,
1998. Statistical Methods for Reliability Data. John Wiley and Sons, New
York.
Moulton, L.H. and N.A.
Halsey, 1995. “A Mixture Model with
Detection Limits for Regression Analysis of Antibody Response on Vaccine,” Biometrics,
51, 1570-1578.
Mulhausen, J. R. and J.
Damiano, 1998. A Strategy for Assesing and Managing Occupational Exposures, Second
Edition, AIHA Press, Fairfax, VA.
Neuman, M.C., P.M.
Dixon, B.B. Looney,and J.E. Pinder,
1989, “Estimating Mean and Variance for Environmental Samples with Below
Detection Limit Observations, Water Resources Bulletin,25, 905-916.
Odeh, R. E. and D. B. Owen, 1980. Tables
for Normal Tolerance Limits, Sampling
Plans, and Screening, Marcel Deker, New York.
R Development Core Team (2004). R: A language
and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN
3-900051-00-3, URL http://www.R-project.org.
Schmee,
J., D. Gladstein and W. Nelson, 1985.
“Confidence Limits for Parameters of a Normal Distribution From Singly Censored
Samples, Using Maximum Likelihood.” Technometrics, 27, 119-128.
Schmoyer, R. L., J. J.
Beauchamp, C. C. Brandt and F. O. Hoffman, Jr., 1996. “Difficulties with the Lognormal Model in Mean Estimation and
Testing,” Environmental and Ecological
Statistics, 3, 81-97.
Sommerville, P. N.,
1958. “Tables for Obtaining
Non-Parametric Confidence Limits,” Annals
of Mathematical Statistics, 29, 599-601.
Taylor, D. J., L. L. Kupper,
S. M. Rappaport, and R. H. Lyles, 2001. “A Mixture Model for Occupational
Exposure Mean Testing with a Limit of Detection”, Biometrics, 57, 681-688.
Tuggle, R. M., 1982. “Assessment of Occupational Exposure Using
One-Sided Tolerance Limits,” American
Industrial Hygiene Association Journal, 43, 338-346.
Turnbull, B. W., 1976. “The
Empirical Distribution Function with Arbitrarily Grouped, Censored and
Truncated Data,” Journal of the Royal
Statistical Society, Series B (Methodological), 38(3), 290-295.
Venables, W. N. and B. D.
Ripley, 2002. “Modern Applied Statistics with S,” 4th edition. Springer-Verlag, New York.
Verrill, S. and R. A.
Johnson, 1998. “Tables and Large-Sample
Distribution Theory for Censored-Data Correlation Statistics for Testing
Normality,” Journal of the American
Statistical Association, 83(404), 1192-1197.
Waller, L. A., and B. W.
Turnbull, 1992. “Probability Plotting with Censored Data,”
The American Statistician, 46(1), 5-12.