Related link :
Data Testing for Normal Distribution Program Page
Normal Distribution Plot Program Page
Explanation
R Code
This page supports the programs in the Data Testing for Normal Distribution Program Page
, testing the hypothesis that a
data set is normally distributed. It also provides some basic description of the dataset, similar to the data description
procedure in SPSS. The example used here is the default example data in the Data Testing for Normal Distribution Program Page
.
Data Description The following parameters are calculated and presented.
- The sample size, minimum and maximum values, mean, Standard Deviation, and Standard Error of the mean
- Median, and percentile values from 5th percentile at 4 percentile intervals
Simple Tests of Normal Distribution : The following were the most commonly used tests of normal distribution until
the complex algorithms requiring intense computing became available.
- Skewness evaluates how much the data is evenly distributed around the
mean. Data truncated in the lower values and with a long tail in the higher values
have a negative skew, and the reverse a positive skew. Calculations produce
a measure of skewness and its 95% confidence interval. If this interval
overlaps the zero (0) value then there is no significant skew.
- Kurtosis evaluates whether the fall off in frequencies away from the mean
conforms to the normal distribution. Where excessive data occurs near the mean,
the distribution curve peaks excessively. Where data is evenly distributed across
a wide range the distribution curve is flattened. Calculations produce
a measure of kurtosis and its 95% confidence interval. If this interval
overlaps the zero (0) value then there is no significant bias in kurtosis.
- Significant difference between mean and median. this evaluates the probability of z where
z=(mean-median)/Standard Errormean. This is more an alternative test of skewness, and if p<=0.05
then a significant skew exists.
The following are more formal tests of normal distribution.
The Chi Square goodness of Fit. The data is divided into groups of
1 standard deviations, and the chi square test is used to see whether the numbers
in the groups differ significantly from what they should be if the data are normally distributed.
The program also produces a normal distribution plot so that users can visualize the actual distribution of the data.
The Kolmogorov-Smirnov test is the most commonly accepted test to see whether
a set of data violates the assumption of normality. The data is firstly placed
in order of magnitude, and cumulative probability for each data point is calculated
and matched against a theoretical cumulative probability from a normal distribution.
The largest difference between these two probabilities are tested against the sample
size. The result of the test is whether the data significantly deviates from normality.
The Shapiro-Francia test also tests whether the data significantly deviates from
normality, and has been argued by some as the better test than
the Kolmogorov-Smirnov test when the sample size is small. This is because,
in the smaller sample size (n<1000), the maximum difference between theoretical and
actual cumulative probability may be more variable, so the Kolmogorov-Smirnov
test can be less stable.
The P Plot : Plots of cumulative probability against the data. A dataset that has an exact normal distribution will
plot along the diagonal, so this provides a visual description of the relationship.
The Correlation between actual and theoretical distribution also provides a measure of how close the data is to
normal distribution. This correlation is often used to optimise a transformation towards normal distribution, but has
limited practical use to determine whether the assumption of normality is valid in a set of data. This is because
the correlation coefficient tends to be high in any case, and it is difficult to determine a cut off point for decision.
References
Siegel S, Castellan Jr. NJ (1988) Nonparametric statistics for the behavioral sciences.
McGraw Hill Book Company New York ISBN 0-07-100326-6 p. 45-51; p. 51-55.
Table of significance from: Massey FJ Jr. (1951) The Kolmogorov-Smirnov test
for goodness of fit. Journal of American Statistical Association 40:70
Shapiro SS, Francia RS (1972) An Approximate Analysis of Variance Test for
Normality. J Am Stat Ass 67:337 p 215-216.
Royston JP (1983) A Simple Test for Evaluating the Shapiro-Francia W' Test of
Non-Normality. The Statistician 32:3 p. 297-300
NIST webpage
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
QStat page on standard error of skewness and kurtosis
http://statgen.ncsu.edu/qtlcart/manual/node71.html
I have written the following R codes to check the accuracy of my php programs that are used in these web pages. I have included them here for anyone interested in developing their own resources.
I have not included the Chi Sq Goodness of Fit tTest as this is now infrequently used. I found the algorithm for the Shapiro Francia Test, but have not tried it myself.
# Normal Distribution
# Data
dat = c(-0.0140,0.1533,-0.4360,-0.7235,1.5131,1.0373,0.2067,2.0434,-0.0286,-0.4964,
0.4710,-1.2026,-0.5798,0.5776,-0.8285,0.3922,-0.0775,-1.8929,0.2269,-0.2602)
pcCI = 95 #percent confidence interval
# Calculations
# z for confidence interval
zCI = -(qnorm((100 - pcCI)/200)) # 2 tail z
# Standard mean, SD, SE, confidence intervals for values and mean
n = length(dat)
mean = mean(dat)
sd = sd(dat)
se = sd / sqrt(n)
out = c(cat("n=",n," mean=",round(mean,4), " SD=",round(sd,4), " SE=",round(se,4)))
out = c(cat(pcCI,"% CI for values=", round(mean - sd * zCI,4), " to ", round(mean + sd * zCI,4)))
out = c(cat(pcCI,"% CI for mean=", round(mean - se * zCI,4), " to ", round(mean + se * zCI,4)))
#skew kurtosis and chi sq
skew = sum((dat - mean)^3) / ((n - 1) * sd^3)
seSkew = sqrt(6 / n)
out = c(cat("skew=", round(skew,4), "SE=", round(seSkew,4),
pcCI,"% CI for skew=", round(skew - seSkew * zCI,4), " to ", round(skew + seSkew * zCI,4)))
kurt = sum((dat - mean)^4) / ((n - 1) * sd^4) - 3
seKurt = sqrt(24 / n)
out = c(cat("kurtosis=", round(kurt,4), "SE=", round(seKurt,4),
pcCI,"% CI for kurtosis=", round(kurt - seKurt * zCI,4), " to ", round(kurt + seKurt * zCI,4)))
chiSq = n * skew^2 / 6 + n * kurt^2 /24
prob = pchisq(chiSq, df=2, lower.tail=FALSE)
out = c(cat("Chi Square=", round(chiSq,4), "p=", round(prob,4)))
# Percentile values
quantile(dat,c(seq(from = .05, to = .95,by = 0.05)), type = 1)
# Kolmogorov Smirnov Test for deviation from normality
ks.test(dat, "pnorm", mean=mean(dat), sd=sd(dat))
# Shapiro Francia Test
# I found this from R help page, but have not tried it
#install.package("nortest")
#library(nortest)
#sf.test(dat)
# Plots. Note: two plot programs.
# They must be used separately, or the second plot will overwrite the first one
# Plot 1: QQ Plot
#qqnorm(dat, pch = 1, frame = FALSE)
#qqline(dat, col = "steelblue", lwd = 2)
qqnorm(dat, pch = 1)
qqline(dat)
# Plot 2: Histogram and comparison against Gaussian curve
h <- hist(dat, breaks = 10)
xfit <- seq(min(dat), max(dat), length = 40)
yfit <- dnorm(xfit, mean = mean(dat), sd = sd(dat))
yfit <- yfit * diff(h$mids[1:2]) * length(dat)
lines(xfit, yfit, col = "black", lwd = 2)
|