4.1.1 Skewness and Dispersion
The coefficient of variation, geometric standard deviation, and coefficient of skewness are all measures of dispersion of a distribution.Both the skewness (asymmetry) and dispersion (spread) in the data can affect the confidence in estimates of the mean. Since it is common for environmental data to exhibit positive skewness (i.e, a longer right tail) or a wide range of concentrations, one challenge for sampling designs is to represent the upper and lower tails of the distribution in the proper proportion, thereby yielding a relatively precise estimate of the mean. For data sets generated with discrete sampling, graphical and exploratory data analysis techniques are commonly used to assess the degree of skewness and dispersion. For example, by plotting the data using histograms and probability plots, the distribution shape and the presence of multiple populations may become apparent. This assessment can be further supplemented by a statistical analysis of the goodness-of-fit (GOF) to normal, lognormal, or gamma distributions. Summary statistics can also be informative and readily calculated with both free and commercial statistics software, including (a) coefficient of skewness; (b) the ratio of the standard deviation (SD) divided by the arithmetic mean—referred to as the “coefficient of variation” or “relative standard deviation” (RSD); and (c) geometric standard deviation (GSD), comparable to the coefficient of variation (CV) (see footnotes of Table 4-1) and used specifically with lognormal distributions.
For convenience in this document, the degree of dispersion of the concentration distribution in a DU is classified in terms “low,” “medium,” and “high,” as shown in Table 4-1. These categories can be used to guide the selection of methods used to calculate the UCL in the mean, as discussed in Section 4.2.
Table 4-1. Data dispersion in terms of CV and GSD
a Coefficient of variation (CV) = standard deviation (SD)/mean.
b Geometric standard deviation (GSD) = exp[sqrt(ln(CV2 + 1))] for lognormal distributions.
The distribution of the contaminant distribution in the DU is different from the distribution of DU means that is characterized by ISM sampling. Table 4-1 provides categories of dispersion for the contaminant distribution throughout the DU rather than the distribution of the DU means. For data sets generated with ISM, fewer exploratory data analysis options are recommended due to the relatively small number of samples. For example, one would not generate a histogram or perform a GOF test on a data set consisting of three replicates. Nevertheless, summary statistics of replicates can provide a measure of the precision in the estimates of the mean, which can be a useful diagnostic for evaluating laboratory DQOs (see Section 18.104.22.168). The mean and variance of the ISM samples can also used to calculate the UCL for the grand mean. However, as discussed in Section 4.3, the RSD statistic does not serve as a reliable performance metric of a UCL calculation because the true DU mean is never known. In addition, simulations demonstrate that, for data sets in which the sample mean is less than the true mean, the likelihood that the UCL also underestimates the mean increases as the sample RSD decreases, due to the positive correlation between the estimated mean and estimated variance.