Content Disclaimer Copyright @2020. All Rights Reserved. |
Links : Home Index (Subjects) Contact StatsToDo
Explanations and References Historical perspectiveThis page provides only a quick summary to provide context for discussions regarding equivalence.In the 19th Century, Fisher developed the idea of Type I Error, based on the Normal Distribution, thus allows a probability estimate for whether the null hypothesis can be rejected. This allows decisions to be made in science and industry on whether a new product or process is better or worse than the current ones available. However, if the null hypothesis cannot be rejected, the researcher cannot draw any statistical conclusion, as a failure to reject null is not the same as an ability to accept null. A generation later, Pearson added the idea of Type II Error and the statistical significance, so that both the ability to reject and accept the null hypothesis can be made. Although this method was widely used in the twentieth century, it was increasingly criticised because results of research are often nor reproducible because of difficulties in determining the population Standard Deviation. To provide robustness to statistical conclusions, researchers increasingly used the 95% confidence interval of the difference, which is an intuitively easier to understand expression of the Type I Error, and to carry out power analysis, which makes no assumption about population parameters. The combination of these two approaches allows researchers to draw confident conclusions whether two sets of observations can be considered significantly different. However, the problem remains that a failure to demonstrate significant difference is not the same as to demonstrate similarity, and the ability to robustly demonstrate similarity is increasingly required, particularly in biomedical research. An example is in cancer treatment. The current treatment may have severe side effects, and a new treatment may have much more acceptable side effects, but the researcher needs to know whether the effectiveness in controlling the cancer is the same, or at least not inferior to the current treatment. The concept of equivalenceGiven the random variations inherent in any set of observations, it is very unlikely to demonstrate two groups to have the same mean values. The term equivalence is therefore used to represent similarity. This is dependent on a pre-determined and arbitrarily assigned Critical Difference (CD) or Tolerance Limit (TL), a difference that can be considered as trivial in the practical sense. Using the 95% confidence interval of the difference to illustrate, the various conclusions that can be drawn are as shown in the diagram (Forest Plot) to the right. Assuming that the difference is that between group 1 and 2 (diff = mean1 - mean2)
One tail or two tail In most text books and published papers, statistics related to equivalence uses the one tail model. This is because most equivalence related research are concerned with non-inferiority, so that not significantly greater or not significantly less are the hypotheses to be tested. As the one tail model allows these conclusions and requires smaller sample size, this is the model to use. StatsToDo however provides calculations for both one and two tails in case any user requires them. Sample size and power calculationSample size and power calculations for equivalence differ from those of significant differences in two ways. Firstly, the decision is based not only on the relationship between the 95% confidence interval and the null value, but in addition with the positive and negative critical values. Secondly, robustness is required for the conclusion of significant equivalence and not on significant difference. The Probability of Type I Error (α) is therefore relaxed, and the common values of 0.1 or 0.2 are used instead of 0.05 or 0.01. The Probability of Type II Error, in terms of power, is made more strict, so the power value of 0.9, 0.95, or 0.99 are used instead of 0.8 Statistical Decisions Users should remember that good statistical practice requires that the hypothesis to be tested is defined at the planning stage, and statistical procedures are used to reject or support that hypothesis. A common malpractice of doing the statistical calculations first, then cherry pick the hypothesis according to how the numbers come together should be avoided. A study to test significant difference or equivalence, the direction of non-inferiority, whether the model should be one or two tail, must be determined at the planning stage before data collection and analysis. Equivalence Between 2 MeansFor equivalence to be validly established, the correct sample size must be estimated and used, and this requires 5 parameters.
The two tail model is only required if the purpose of the study is to establish true equivalence, a confidence interval that is both significantly not less than the negative Critical Difference value, and significantly not greater than the positive Critical Difference value. Analysis of data collected Once the data is collected, analysis requires the following information
The confidence interval is the same as that calculated for comparing the differences between two means, assuming the data to be a sample, and calculations based on the t distribution. This is offered as it is the most common parameter for making statistical decisions, particularly that associated with the 95% confidence interval (α=0.05). Such a decision allows the interpretation for significant difference, significant non inferiority, and equivalence. The power calculation assumes the data to represent population values, so calculations are based on the z distribution. It estimates the probability of detecting the non-inferiority or equivalence at the level of α, if it is truly present. Most clinicians would be content to use the 95% confidence interval to decide whether equivalence exists or not. Statisticians however may wish for more nuance to determine the confidence of the decision, and use the power analysis also. ExampleThe data for the example is artificially created to demonstrate the statistical method, and not based on any real observations.We are offered two nutritional supplements for muscle building, supplement 1 is expensive, and supplement 2 is cheap. We would wish to use supplement 2 if it is not inferior (not building less muscle than supplement 1). We use weight gain over a period of time as the outcome measurement. From experience, we have concluded that the Standard Deviation of weight gain over the trial period with supplements is around 1.5Kg, and decided that difference in weight gain between the two supplement groups is only of interest if it is more than 1Kg. In other words, a difference of less than 1Kg is trivial and has no clinical relevance (CD=1Kg). As we have no difficulty recruiting for either group, we decided to use equal size groups (φ=1) Sample size We decided to use Probability of Type I Error α=0.05 and power=0.90. We anticipated standard deviation=1.5, Critical Difference =1.0, ratio of sample sizes φ=1, and the one tail model. The ratio CD/SD = 1.0 / 1.5 = 0.67. Looking up the table for sample size comparing two means, α=0.05, Power=0.90, CD/SD=0.67, the sample size required is 39 cases per group. Analysis of results
After randomly allocated trial subject into the two groups for the two treatment (group 1 received supplement 1 and group 2 supplement 2), the weight gain obtained are as in the table to the right. These results, with α=0.05 and CD=1.0, are entered into analysis, and the results are as the next table to the right.
Looking at the 95% confidence interval, if the conclusion that supplement 2 is non-inferior to supplement 1, the one tail 95% confidence interval of the difference (meangroup 1 - meangroup 2) must be from -∞ up to +1Kg (positive Critical Difference). The results of the calculation show that the 95% CI is from -∞ to +0.37Kg, not intercepting the positive Critical Difference. We can therefore draw the conclusion that group 1 is significantly not greater than group 2, and that supplement 2 is non-inferior to supplement 1. The data also allow us to draw the conclusion that supplement 1 is non inferior to supplement 2, if we had planned to test that hypothesis, as the one tail 95% confidence interval of the difference would then be -0.77Kg to +∞, not intercepting the negative Critical Difference value. We cannot use these two one tail confidence intervals to conclude that the two supplements are truely equivalent (similar or same), as this will require the two tail confidence interval, which is very much wider. The two tail confidence interval for the difference is -0.88Kg to + 0.48Kg, which allows us to conclude that both supplements are truly equivalent. Another approach to examining the data is to ask the question whether the data has sufficient power to detect equivalence at the α p<0.05 level To test the hypothesis that supplement is non-inferior to supplement 1, we test the power of the difference (meangroup 1 - meangroup 2) not greater than 1Kg., and the power is 0.97. We can therefore conclude with confidence that group 1 is significantly not greater than group 2, and therefore supplement 2 is non-inferior to supplement 1. Power analysis also allows us to test whether supplement 1 is non-inferior to supplement 2, had we planned to test this. The power for this is 0.76, short of the 0.8 required to draw such a conclusion, so the question of whether supplement 1 is non-inferior to supplement 2 remaqins unanswered with the current data sign. The power to test true equivalence requires the two tail model, and the two tail power is 0.66, very much short of the 0.8 required to conclude that the two means are equivalent. ReferencesRogers JL, Howard KI, Vessey JT. (1993) Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin 113:553-565. Jones B, Jarvis P, Lewis JA, Ebbutt AF. (1996) Trials to assess equivalence: the importance of rigorous methods. British Medical Journal 313:36-39 Hwang IK, Morikawa T. (1999) Design issues in noninferiority/equivalence trials. Drug Information Journal 33:1205-1218 Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0 p. 100-104
Power Analysis :
Sample size Table for equivalence between two means
Power=(1-β), β = Probability of Type II Error α=Probability of Type I Error CD/SD = Ratio of Critical Difference (tolerance limit) / Standard Deviation Cells contain sample size per group, assuming equal size groups
|