Related Links:
Sample Size for Unpaired Differences Program Page
Sample Size for Unpaired Differences Tables Page
Unpaired Difference Programs Page
Introduction
Parametric
Nonparametric
Permutation
References
Comparing values between groups is a very common research model in the biomedical domain. The model can be used in
epidemiology and surveys, such as comparing birth weight between
boys and girls, or in randomized controlled trials, such as allocating different medications to groups of patients and
compare their responses.
The research models themselves are complex and sophisticated, requiring careful control of bias. This page does not
cover these aspects, but focussed only on the statistical procedures. To select the correct procedure, the following
issues are addressed.
 The nature of the data
 If the measurements are continuous and normally distributed, the powerful parametric statistical procedures can be used.
Examples of parametric measurements are height and weight
 If the measurements are continuous, but not normally distributed, some form of transformation may be needed before
parametric statistical procedures can be used. Examples of transformable measurements are ratios and time to events.
 If the measurements are not continuous, or if they are not normally distributed and cannot be transformed, then the
nonparametric statistical procedures can be used. Examples of nonparametric measurements are 5 point Likert items,
10 point semantic differential scales, and many psychometric measurements.
 If the data are not measurements, such as counts or classifications, then they cannot be analysed as measurements.
Examples of nonmeasurements are number of adverse events, proportion of surgical complications, sex of newborns.
 The sample size. Programs in the Sample Size for Unpaired Differences Program Page
and tables in the
Sample Size for Unpaired Differences Tables Page
can be used to estimate sample size requirements. For proper statistical
inference, the exact sample size required should be calculated and used. Approximations useful in the early stages
of research planning are :
 Pilot studies requires 6 to 20 subjects per group
 The main study requires 150200 subjects per group to detect a small effect, 1520 for a large effect, and in most
clinical research situations 6070 subjects per group for a moderate effect size.
 Nonparametric tests have approximately 9599% the power efficiency of the equivalent parametric tests. Approximately,
the sample size calculated for parametric tests should increase by 5%10% for equivalent nonparametric tests.
Comparisons
Sample Size
Example
Comparison of Variances
Comparison (2 Groups)
Comparison (>2 Groups)
Parametric comparison of values between groups often do so with an assumption that the variations in all the groups under
comparison are similar (homogeneous). The following tests are provided to test the homogeneity of variances in the groups. Variance is the square of Standard Deviation and represents variations in measurements in a group, and homogeneity means that difference between
variances of groups are not statistically significant
StatsToDo provides three commonly used tests for homogeneity of variance.
 The F Ratio tests for significant difference between variances from two groups. the formula is :
F = (Standard Deviation_{greater})^{2} / (Standard Deviation_{smaller})^{2}
The ratio is the greater value over the smaller value, so it is always >1. The degrees of freedom of the two variances
are one less than the sample size of the appropriate group (df = n1). The probability of this F value, with the two
degrees of freedom, are calculated in the Probability of F page
. The variances are then accepted as not
significantly different, therefore homogeneous, if the probability is greater then 0.05. The F ratio is easy to compute,
but it is excessively sensitive, particularly when the sample size is more than 30 per group.
 The Bartlett's Test is the most commonly used test of significant difference between variances from two or more groups.
The test uses n, mean, and Standard Deviation from each group under comparison, so it can be carried out using summary data.
The result is a Chi Square with degrees of freedom one less than the number of groups. If the Probability of the Chi Square is
greater than 0.05, then the variances are accepted as not significantly different, therefore homogeneous.
 The Levine Test for significant difference between variances is the most precise one, and is the default test
offered by SPSS. It is less often used however, as it requires the original data set of values.
Historically, the distance between the difference between the groups, measured as a ratio of its Standard Error (t=difference/Standard Error) is estimated, as this is assumed to be normally distributed, so the probability of having a difference greater than
the t can also be estimated. Usually if the Probability of t is greater than 0.05, the difference is considered as trivial.
Alternatively, the sample size required is estimated, using a defined nontrivial difference (the Critical Difference CD) to be tested, the Probability of Type I and II Errors to be used for decision making, and an estimate of the Standard Deviation of the measurements. If the difference obtained is greater than the Critical Difference, it can be considered statistically significant.
The procedures of the significant difference however has been misused. Often the sample size is estimated with a presumed Standard
Deviation, but the Probability of Type I Error is then calculated using the Standard Deviation of the data obtained. This has led to
a loss of confidence in the statistical significance concept as it was originally defined.
More recently, the concept of the 95% confidence interval of the difference (95%CI) is increasingly used, as the results depends entirely on the data obtained, and can be flexibly interpreted.
Assuming the difference is the value of group 1  value of group 2, the following interpretations can be made according to the diagram to the right
 A 95%CI that does not traverse the null value allows a conclusion the two groups of measurements are significantly different
 A 95% CI that crosses the null value, but not the +CD value, allows the conclusion that values of group 1 are significantly
not greater than that of group 2
 A 95% CI that crosses the null value, but not the CD value, allows the conclusion that values of group 1 are significantly
not less than that of group 2
 A 95% CI that crosses the null value, but neither the CD nor the +CD value, allows the conclusion that the two groups are
significantly equivalent, that the difference between them can be considered trivial.
 A 95% CI that crosses both CD and +CD values allows the conclusion that data lack sufficient power for interpretation, and no statistical conclusion can be drawn.
The use of 95% confidence intervals and its interpretation however requires careful consideration of the following
 The hypothesis to be tested must be predefined, and tested against the data. It is erroneous to examine the data and decide which of the conclusions to adopt
 The 95% confidence interval using the two tail model is required to test for equivalence
 The 95% confidence interval using either the one or two tail model can be used for the other decisions.
When data presented for analysis contains 3 or more groups, the following analysis are carried out
 One Way Analysis of variance, partitioning the variance to those between the groups, and within the groups, and test
(using F) whether the Type I Error (α) is small enough for the null hypothesis to be rejected.
 Carry out 3 post hoc tests, testing all pairs of groups for significant differences. These are
 Least significant difference using Tukey's algorithm, a robust and reliable test of group differences, but lacks sensitivity
 Least Significant Difference by Scheffe's algorithm, a much more powerful (sensitive) test of group difference
 The 95% confidence interval for all the differences between pairs of groups.
Sample size estimations for unpaired differences in measurements are provided in the Sample Size for Unpaired Differences Program Page
,
which provides two algorithms.
In the common situation where there are only two groups, sample size estimation is based on the z and t
distributions, and four programs are provided.
 When there is no information available regarding a proposed research project, a pilot study is sometimes useful.
The idea is to carry out a small and preliminary study, to test the feasibility of conducting the project, and to obtain
some basic parameters to plan the
main project. The sample size required here can be much smaller than that required for statistical inference, but
nevertheless be informative. The emphasis is cost efficiency. The combination of a nominated Standard Deviation and sample
size allows the calculation of the 95% confidence interval of the results. How this confidence interval decreases with
increasing sample size is examined, and a decision on the most cost effective sample size made. Regardless of the values
used however, most pilot study reach peak cost effectiveness between 6 and 30 subjects per group.
 Sample size calculation is used in the planning phase of the main research project, estimating sample size (per group)
required, based on the following parameters.
 The probability of Type I Error (α), to be used to decide whether the null hypothesis is to be rejected. The most
common value is α=0.05
 The power of the research model (1β), the probability of detecting the difference if it is present. The most
common value used is 0.8
 The effect size, the ratio between the difference to be detected, and the known population or within group Standard
Deviation. Two approaches are often used
 If an accurate estimation of the Standard Deviation is known, and a clinically meaningful difference is determined,
these values can be used in the calculation
 If these values cannot be accurately determined, a ratio of 0.2 can be used to detect a small effect size,
0.8 for a large effect size, and 0.5 in most clinical scenarios for a moderate effect size. The values to be
entered are then the effect size as the difference to be detected, with the Standard Deviation = 1
 Power estimation is useful at the data analysis phase of a research project, particularly if the Type I Error (α)
is too large (>0.05) to reject the null hypothesis
 If a high α value is because the difference between the means is smaller than the critical value defined at
the time of planning, then the correct decision is to reject the null hypothesis and accept the alternative hypothesis.
 Often however the difference between the means exceed the critical value defined during planning, but a high α
value is related to Standard Deviations larger than that envisaged during planning. When this happens, power
estimation provides an estimate of the extent of discrepancy between planning and the reality reflected by the data,
allowing researchers to interpret the results with greater nuance or to provide remedies (such as increasing the
sample size) to validate the results.
 Estimating the confidence interval is used to provide an estimation of 95% confidence interval of the difference found, as
an alternative to the probability of Type I Error (α)
If the hypothesis to be tested is not whether a significant difference exists, but whether a significant nodifference exists (group 1 not less or not more than group 2, or equivalence), the model requires greater power and can relax robustness. A Type I Error of 0.2 instead of 0.05, and the power of 0.95 instead of 0.8, are commonly used.
When more than two groups are being compared, calculations for the sample size is more complicated. Two approaches
can be used.
 Machin et.al., in their book on sample size (see references), suggested that sample size estimation should be the same
as that for two groups, without the need to use the Bonferroni's correction for multiple comparisons.
 Cohen however provided algorithms for calculating sample size, based on the F distribution, suitable for most analysis
of variance, including the One Way Analysis of Variance used to compare multiple group means. StatsToDo provides this calculation, but echo Cohen's caution on how this is used.
 The use of a different probability distribution, particularly when calculations involve iterative approximations,
will produce different results because of rounding errors. An example is in estimating sample size for two groups,
α=0.05, power=0.8, and effect size = difference / Standard Deviation = 0.5. Algorithms based on the z
distribution described in Machin's book results in a sample size of 64 subjects per group, but 63 subjects per
group when algorithms based on the F distribution described in Cohen's book are used.
 The effect size used in calculations is based on a ratio of the difference between groups and the background Standard
Deviation. When there are more than two groups, the differences between each pair of groups tend to be different, so
the effect size needs to be adjusted. Cohen suggested that the largest distance should be used, but provides 3 models
for adjustment
 The first model (f1) is when there is minimal variability in the other differences, that, other than the
maximum difference used for calculation, all other differences are smaller and similar to each other
 The second model (f2) is when all the differences are different and their sizes evenly distributed
 The third model (f3) is that despite variabilities, the other differences between group means are
similar to those used for calculations
 In the first two models (f1 and f2), the averaged difference between group means is smaller than the maximum
difference used for calculation, so sample size required per group need to be adjusted upwards to detect the
smaller difference as the number of groups increases
 In the third model (f3), all the differences are roughly the same as the maximum used in calculation,
so that the sample size required per group decreases as the number of groups increases. This model (f3) therefore
requires the smallest sample size per group. Cohen suggested that this model be used as a default unless there
are reasons to use the other models.
The numbers in these examples are genersated by computers to demonstrate the statistics, they are not real.
Example 1 (2 Groups)
Example 2 (>2 Groups)
Ssiz  CI  Dec  Dec/case  %Dec/case 
4  17  
6  13  4  2  13 
8  11  2  1  8 
10  9  1  1  6 
12  8  1  0  5 
Example 1 . We wish to compare physiological stress by two methods of performing hysterectomy, by
laparotomy and by laparoscopy. We will use the difference between preoperative heart rate and the average heart rate
the day after operation as the indicator of stress. From experience, we expect the Standard Deviation of hear rate change to
be 20 beats per minute, and we consider a difference of 10 beats per minute, in either direction, to be of clinical importance.
Step 1 : Pilot study. We wish to conduct a pilot study to see if the project is feasible, and to check that our parameters
are at least in the ball park. We use the Standard Deviation of 20 and produced the two tail model table as shown to the right.
We decided to conduct the pilot study using 10 cases for each operation, a total of 20, as the most cost effective sample size,
because the confidence interval was by then within the levels we were considering as meaningful.
Step 2 : Sample Size Determination. After a successful pilot study, a decision was made to conduct the full study. Using
α=0.05, power=0.8, within group Standard Deviation=20, and a difference we wish to detect = 10 (effect size es =
difference / Standard Deviation = 10 / 20 = 0.5), and a two tail model, we looked up our sample size table in the
Sample Size for Unpaired Differences Tables Page
and decided that the study should have 64 cases in each group, a total of 128 cases.
Group  n  mean increase in heart rate  Standard Deviation 
Laparotomy  60  15  22 
Laparoscopy  68  1  19 
Step 3 : Using computer generated random numbers, we randomly allocated patients for hysterectomy who volunteer for this study
into the two operation groups. We measure the average pulse rate the day before and after the operation, and used any increase
as an indicator of stress, and summarised the results as shown in the table to the right.
Step 4 : Results of data analysis
 Homogeneity of variance : F Ratio = 3.0, p<0.0001. Bartlett's chi sq=1.35, p=0.25. The F Ratio was considered too sensitive
because of the large sample size. Instead, Bartlett's test was accepted and the two variances considered homogeneous.
 Difference in increased heart rate (beats per minute)(mean_{laparotomy}  mean_{laparoscopy}) = 16,
Standard Error of the difference = 3.6
 t test : t = 4.42, degrees of freedom = 60+682 = 126, α_{2 tail} p<0.001
 95% confidence interval (2 tail) : 9 to 23
 Conclusion : Those who had hysterectomy, compared with those who had laparoscopy, had a greater increase in postoperative
heart rate. The average difference was 16 beats per minute, in excess of our decision criteria of 10, and the 95% confidence
interval was 9 to 23 beats per minute. These results are statistically significant.
We wish to study whether at term babies from 3 different ethnic groups (Caucasian, Chinese, Indians) have different birth weights.
As we will conduct the study by searching birth weight records, we decided that a pilot study was not necessary.
From experience and from publications, we accepted the population Standard Deviation of birth weight is 400g, and we decided that a difference of the means between the groups of 100g would be clinically meaningful. We set our decision criteria at α=0.05
and power=0.8.
Step 1. Sample size : Using α=0.05; power=0.8, Standard Deviation=400, the largest difference to detect=100g, and using the
program in Sample Size for Unpaired Differences Program Page
, the sample sizes required per group were 173 cases per group for
the f1 and f2 model, and 130 cases per group for the f3 model. As we anticipated that the differences between the groups
would be similar, we chose the f3 model, 130 birth weights per ethnic group.
Group  n  mean increase in heart rate  Standard Deviation 
Chinese  130  3520  392 
Indian  130  3600  401 
Caucasian  130  3690  398 
Step 2. Data collection : We searched through our records, and randomly selected 130 birth weights from the three ethnic groups,
the summary of which are shown in the table to the right.
Step 3. Analysis of Results :
Source  df  SSQ  MSQ  F  α,p 
Between Grps  2  1880666.667  940333.3  5.97  0.003 
Within Grps  387  61000101  157623.0  
Total  389  62880767.67 
 Homogeneity of variance : Bartlett Test : chi sq=0.07 df=2 p=0.97. The decision was that the variance were homogeneous.
 The table of Analysis of Variance is as shown to the right. Collectively, the differences between groups were statistically significant at the α p=0.003 level.
 Post hoc analysis, at the α p=0.05 level
Ethnicity  Ethnicity  Observe Difference  lsd (Tukey)  lsd(Scheffe)  95% CI 
Indian  Chinese  80g  930  148  177 to 17 
Indian  Caucasian  170g  930  148  267 to 73 
Chinese  Caucasian  90g  930  148  187 to 6.8 
 According Tukey's algorithm, all between group differences are less than the least significant difference,
and the conclusions are that, individually, the groups are not significantly different to each other
 According to Scheffe's algorithm, the difference between Indian and Caucasian mean birth weights is greater than the least
significant difference, the other differences are not.
 According to the 95% confidence intervals, the interval for differences between Indian and Caucasian birth weights does
not overlap the null (0) value, but the other comparisons do.
 The overall interpretation of the data are :
 Taken the three groups together, they are significantly not homogeneous.
 Individual group comparisons indicate that Indian babies have the lowest birth weight, Caucasians the highest,
and the Chinese in between. The difference between Indian and Caucasian birth weights is statistically significant,
but Chinese birth weights are not significantly different to that of the other groups.
Comparisons
Sample Size
Example
When measurements are not continuous and normally distributed, they are not
parametric. Without the assumptions of normal distribution, the powerful methods
using partition of variance cannot be applied.
In many cases, the data has a distribution that is mathematically related to
normal distribution, such as a squared or exponential, and after some form of mathematical transformation,
the parametric statistical tests can be applied.
In other cases, the nature of the data does not allow for such transformations,
and the data has to be analysed as ordinal or ordered arrays. Examples of these
can be as finely granular as a personality or depression score with a wide
range, or more commonly a 10 point semantic differential scale, a 5 point Likert scale,
or as coarse as a 3 point pain score (none (0), little (1), lots (2)).
StatsToDo offers one nonparametric test comparing two or more groups, two comparing two groups, and one for multiple (>2) groups.
Nonparametric comparisons of two or more groups of measurements : The Median Test
This test evaluates whether the number of cases < and >= the median level in all the groups are similar,
so testing the null hypothesis that all groups have similar median values. The values of all groups are ranked
collectively, and the median value obtained. The number of cases < and >= the median value in each group
are then calculated, and compared.
 When there are only two groups, the Fisher's Exact Probability Test is used if the total sample size
is less than 20, otherwise the Chi Square Test with Yates Correction is used.
 Where there are more than 2 groups, the standard Chi Square Test for goodness of fit is used
Nonparametric comparisons of two groups of measurements :
The Wilcoxon MannWhitney Test is a test of the null hypothesis for two sets of ordinal data, assuming that the distributions in the two groups are similar.
The MannWhitney U Test, described in the 1962 edition of Siegal's Nonparametric Stratistics for behavioral Sciences, has
been renamed the The Robust Rank Ordered Test , as the term MannWhitney U Test now refers to another test, described in
Wikipedia. This test is not provided in StatsToDo,
and the term MannWhitney U Test is only used to maintain backwards compatibility of this web site. The correct name of the test
should be The Robust Rank Ordered Test .
The Robust Rank Ordered Test is considered more robust than the Wilcoxon Mann Whitney Test, because if makes no assumptions that the two groups are from the dsame population, and it is nearly as powerful as the parametric t test. It tests the null hypothesis that the two medians values are not different.
Nonparametric comparisons of three or more groups of measurements : The Kruskall Wallis One Way Analysis of Variance
This test tests the null hypothesis that all the groups are from a homogeneous population.
 As well as providing a significance test (α) for the null hypothesis, the program also produces the mean rank
values for each group, which can be used in post hoc analysis comparing individual pairs of groups.
 The Dunn's Test is one of the post hoc analysis between groups, and is carried out with the main analysis as it
requires the original data for computation
 The more flexible and commonly used post hoc test is the Least Significant Difference between Mean Ranks. This
allows the comparison of the mean ranks between any two groups, assuming that every group is to be compared with
every other group. This test is more flexible as it requires only the total sample size, and the sample sizes
and mean rank values of the two groups being compared.
Sample size determination is difficult for nonparametric data, as a precise estimate
depends on knowing what the distribution patterns in the groups to be tested are.
Such an approach is barely possible using specialised programs when there are only two groups and when
the range of measurements are not too great. In most cases, only an approximation of the sample size
requirements can be estimated.
In Siegel's book (see references), the term power efficiency is used to represent the difference
in sample size required. The WilcoxonMannWhitney Test has similar power as the t test, but only when the sample size is
large (>30 per group).
The Robust Rank Ordered Test and the KruskallWallis test each has 95.5% the power efficiency of
the F or t Test in parametric Analysis of Variance, so the sample size must be calculated with a higher
power requirement. The algorithm for calculating an approximate sample size requirement for nonparametric tests
is described in the Sample Size Introduction and Explanation Page
so will not be repeated here.
The data for the following examples are not real, but made up to demonstrate the statistics.
Example 1 : Comparing two groups
We want to study the attitudes towards educational affirmative action from different
ethic groups, using a 5 point Likert Scale.
The statement is "Having a quota of university places reserved for each ethnic
group is a good thing". The responses are 1=Strongly disagree, 2=disagree, 3=neutral,
4=agree, and 5=Strongly agree.
Sample size
If we take the range (15) as mean ± 1.96SD, then SD = range/3.92 =
5 / 3.92 = 1.28.
We would like to have a sample size capable of detecting a difference of
1 between the two groups. The effect size = difference / SD = 1 / 1.28 = 0.78
We want a power equivalent to 0.8 in a parametric test. As the nonparametric test
has a power efficiency of 95.5%, we define the power required as 10.2*0.955 = 0.81.
Using α = 0.05, power = 0.81, and effect size = 0.78 in the program from the
Sample Size for Unpaired Differences Program Page
,the sample size requirement per group is
28 respondents per group for a two tail study.
The study
 Af  Ca 
SD  1  3 
D  3  3 
N  7  12 
A  10  6 
SA  9  6 
We received the responses from 30 African Americans, and 30 Caucasian American, and
found the results as shown on the left. Af = AfricanAmericans, and Ca = CaucasianAmericans
Median Test
Chi Sq=2.4027 df=1 p=0.1211
WilcoxonMannWhitney Test
z = 1.3524 p = 0.0583
Af : n = 30 W = 1017 Mean rank = 33.9
Ca : n = 30 W = 813 Mean rank = 27.1
Robust Rank Ordered Test
U = 1.663 p=0.0482

The results of analysis (to the right) show no significant difference using the Median and Wilcoxon Mann Whitney Test,
but a significant difference using the Robust Rank Ordered Test. The marginal conclusion is therefore that African and
Caucasian Americans do not differ significantly in their attitudes toward affirmative Actions in education
Example 2 : Comparing 3 groups
What was not shown in example 1 is the data from the third group, the Asian Americans,
which will be shown in this example.
 Af  Ca  As 
SD  1  3  6 
D  3  3  7 
N  7  12  12 
A  10  6  3 
SA  9  6  2 
We received the responses from 30 AfricanAmericans, 30 CaucasianAmerican, and
30 AsianAmericans, and found the results as shown on the left. Af = AfricanAmericans, Ca = CaucasianAmericans,
As = AsianAmericans
Median Test : Chi Sq=27.4667 df=2 p=<0.0001
Kruskall Wallis One Way Analysis of Variance
Grp  n  mean rank 
Af  30  56.9 
Ca  30  47.1 
As  30  32.5 
KruskallWallis H = 14.09 df = 2 p = 0.0009
Minimum significant diff in rank (Siegel and Castellan)
grp  grp  Diff  Dif(0.05)  Dif(0.01)  Dif(0.005) 
Af  Ca  9.9  16.1  19.8  21.2 
Af  As  24.4  16.1  19.8  21.2 
Ca  As  14.5  16.1  19.8  21.2 
Minimum Significant diff in rank (minimum Q by Dunn)
grp  grp  Q  Q(0.05)  Q(0.01)  Q(0.001) 
Af  Ca  1.5  2.4  2.9  3.6 
Af  As  3.7  2.4  2.9  3.6 
Ca  As  2.2  2.4  2.9  3.6 

The results of analysis are as shown to the right.
The 3 groups have significantly different attitudes towards
affirmative actions in university education, both in the Median Test and the Kruskall Wallis One Way Analysis of Variance.
Post hoc analysis using least significant difference shows that the least significant
difference in mean ranks is 16.1 at the p (α) = 0.05 level.
CaucasianAmericans (grp 2) has a mean rank of 47.1, and its difference to
AfricanAmerican (grp 1, mean rank = 56.9) is 9.9, less
than the least significant difference at the p<0.05 level. Its difference to Asian Americans (Grp 3,
mean rank = 32.5) is 14.5, also less than the least significant difference at the p<0.05 level.
CaucasianAmericans are therefore not significantly different to the other two groups.
AfricanAmericans (grp 1, mean rank = 56.9) however are different to AsianAmericans
(Grp 3, mean rank = 32.5), the difference is 24.4 which is larger than the
least significant difference, even at the p (α)<0.005 level.
The Dunn Test shows a similar pattern and supports similar conclusions.
The conclusions to be drawn are therefore AfricanAmericans are significantly more positive towards affirmative
actions in education than AsianAmericans, with the CaucasianAmericans in between, not significantly different to
the other two groups.
Introduction
Example
The Permutation Tests are the most basic of statistical tests, from which other models have developed.
StatsToDo presents two models, the significance test for paired differences presented in
Paired Difference Programs Page
, and the significance test comparing two groups presented in
Unpaired Difference Programs Page
.
The general principles are that, in a randomly allocated study, the data obtained could have been in
either of two groups being compared. The test consists of calculating every
possible permutation of the data, and examine the results. If the results from the original data
is near the extremes (e.g. less than 5 percentile or more than 95 percentile in a one tail model),
then a decision can be made that it is unlikely to be null and therefore statistically significant.
The advantages of using the Permutation tests are :
 Exhaustive permutation allows the calculation of the precise probability that the data presented is null,
so the tests calculate the Type I Error (α), with a power (1β) of 100%.
 The tests are not dependent on any assumption of data distribution, so they can be used in any regular interval data
(where 109 is the same as 43). The tests can therefore be used on parametric measurements, ratios, variances, and time.
 Because of the above two characteristics, the tests can be used with a very small sample size
The disadvantages of using the tests are related to the computation intensity required, both in the large memory use,
and the time required for computation. In the unpaired situation, where the sample size of group 1 is n1 and that of
group 2 is n2, and the total nt=n1+n2, the total number of permutation is the Binomial coefficient nt and n1,n2
(Number of permutation = nt!/(n1!n2!). Computation time therefore increases exponentially with increasing sample size,
and large dataset may either crash the program when available RAM is exhausted, or the computation becomes unacceptably
too long.
The Permutation Test is therefore ideal for comparing two groups using small sets of interval data with uncertain distributions.
With larger sample size, the more common nonparametric (the Median Test, the Wilcoxon Mann Whitney Test or the
Robust Rank Ordered Test), parametric (Unpaired t test) tests should be preferred.
In theory, the Permutation Test can cope with any sample size. However, a probability of <0.05 is not possible with
less than 3 subjects in each group, and computation will take an unacceptably long time if the total sample size (n1 + n2)
exceed 26 subjects.
The mathematical argument of the Permutation Test is as follows
 In two groups of measurements, the null hypothesis is that there is no difference between the means. In other words,
that the values observed can be in either of the group.
The Permutation Test therefore consists of examining the difference between the sums of values in the two groups, preserving
the original sample sizes of the groups but have the data distributed in all possible permutations. The total number of
permutation is nt!/(n1!n2!).
The difference between the sums in the original data is then compared with all possible sums, so that its probability can be
estimated.
We use the default data for the program in the Unpaired Difference Programs Page
as an example.
grp  val 
1  50 
1  57 
1  70 
1  60 
1  55 
2  58 
2  65 
2  70 
2  70 
2  72 
2  70 
2  72 
2  60 
2  77 
2  75 
We have two groups, group 1 with 5 cases, and group 2 with 10 cases, a total of 15 cases. The data is in two columns,
as shown to the right. Col 1 is group designation, and col 2 the value.
Step 1. The difference between the two groups. The sum of measurements in group 1 is 292 and that in group 2 is 689. The difference is 397.
Step 2. The mathematics of permutation. With sample sizes of 5 and 10, the Binomial Coefficient = 15!/(5!10!) =3005.
If we are to use a two tail model at α<0.05, then there are 0.025 on either side.
At the two tail α=0.05 level therefore, one would expect that 0.025 x 3005 = 75 permutations from the most extreme values
that can be considered as statistically significant.
Step 3. The differences of sums from two groups for all 3005 permutations are calculated and compared with 397 from the data.
There are 18 permutations where the difference between the groups is less than 297, 2974 permutations where the difference is
more than 297, and 11 where the difference is 297.
Step 4. Drawing conclusions.
 As 18 is less than the 75 which define the decision border for α=0.05, (2 tail model),
we can conclude that the difference obtained from the original data is unlikely, and therefore statistically significant
at the p<0.05, level (two tails).
 Looking it another way, there are 18 permutations with values less than the 297 from the data. The data is therefore
the 19th value from the minimum, or at 19/3005 x 100 = 0.63^{th} percentile. This is less than the 2.5 percentile required
for statistical significance at the p=0.05 (two tails), so it is statistically significant. Another way of putting it is
that the observed difference is statistically significant at the level of p=0.0063 (one tail) or doubling to 0.0126
for two tail model.
Tests for Homogeneity of Variances
Bartlett's Test : I have not read this reference, but it is quoted in
the nist web site from which I obtained the algorithm. Snedecor, George W.
and Cochran, William G. (1989), Statistical Methods, Eighth Edition,
Iowa State University Press.
Levene's Test : I have not read this reference, but it is quoted in
the nist web site from which I obtained the algorithm. Levene, H. (1960).
In Contributions to Probability and Statistics: Essays in Honor of
Harold Hotelling, I. Olkin et al. eds., Stanford University Press, pp. 278292.
Formulae : I obtained these from the National Institute of Science
and Technology (NIST) resource website. The urls are handbook index,
Bartlett test,
and Levene test.
Parametric Comparisons
Confidence Intervals : Altman DG, Machin D, Bryant TN and Gardner MJ.
(2000) Statistics with Confidence Second Edition. BMJ Books
ISBN 0 7279 1375 1. p. 2831
OneWay Analysis of Variance : Armitage P. Statistical Methods in
Medical Research (1971). Blackwell Scientific Publications. Oxford. P.189207.
Least significant difference (Tukey) :
 Armitage P. Statistical Methods in Medical Research (1971). Blackwell Scientific Publications. Oxford. P.189207
 Steel R.G.D., Torrie J.H., Dickey D.A. Principles and Procedures of Statistics.
A Biomedical Approach. 3rd. Ed. (1997)ISBN 0070610282 p. 191192
 Studentised range tables : Pearson ES, Hartley HO (1966) Biometrika table for statisticians Ed. 3 Table 29.
Least significant difference (Scheffe) :
 Scheffe H (1959) The Analysis of Variance NY Wiley (quoted by everyone else but I have not read it)
 Pedhazur E.J. Multiple regression in behavioral research explanation and prediction
(3rd Ed) 1993. Harcourt Brace College Publishers, Orlando Florida. ISBN 0030728312 p. 369371
 Portney LG, Watkins MP (2000) Foundations of Clinical Research. Applications
to Practice (Second Edition) Prentice Hall Health, New Jersey. ISBN 0838526950. p. 460461
A Biomedical Approach. 3rd. Ed. (1997)ISBN 0070610282 p. 189190
Confidence interval of difference between means : Bird KD (2002) Confidence Intervals for Effect Sizes in Analysis
of Variance. Educational and Psychological Measurements 62:2:197226
Nonparametric Tests :
Siegel S and Castellan Jr. N J (1988) Nonparametric Statistics for the
Behavioral Sciences 2nd. Ed. McGraw Hill, Inc. Boston Massachusetts. ISBN 0070573573
 Median Test p.124 (2 groups), p.200 (3 or more groups).
 Wilcoxon Mann Whitney Test p. 128137.
 Robust Rank Order Test (used to be called the Mann Whitney U Test) p. 137144.
 KruskallWallis One Way Analysis of Variance p. 206215.
 Least significant difference between mean ranks P213214.
Dunn's Test :
 Zar Z.H. (1974) Biostatistical analysis (3rd.Ed) Prentice Hall, New Jersey.
ISBN 0130845426. p 227228.
 Table for Q values for Dunn's Test: App. 106 Dunn O.J. (1964) Multiple
contrasts using rank sums. Technometrics 6:241:252
Permutation Test : Siegel S and Castellan Jr. NJ (2000) Nonparametric Statistics for the Behavioral
Sciences. Second Edition. McGraw Hill, Sydney. ISBN0071003266 p. 151155
sample size
Two Groups : Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical
Studies. Second Ed. Blackwell Science IBSN 0865428700 p. 2425
Multiple (3+) Groups : Cohen J (1988) Statistical power analysis for the behavioral sciences. Second edition.
Lawrence Erlbaum Associates, Publishers. London. ISBN 0805802835 p. 276279, p. 550
Equivalence
Rogers JL, Howard KI, Vessey JT. (1993) Using significance tests to evaluate
equivalence between two experimental groups. Psychological Bulletin 113:553565.
Jones B, Jarvis P, Lewis JA, Ebbutt AF. (1996) Trials to assess equivalence:
the importance of rigorous methods. British Medical Journal 313:3639
Hwang IK, Morikawa T. (1999) Design issues in noninferiority/equivalence
trials. Drug Information Journal 33:12051218
