Content Disclaimer Copyright @2020. All Rights Reserved. |
Links : Home Index (Subjects) Contact StatsToDo
Explanations and References Historical perspectiveThis page provides only a quick summary to provide context for discussions regarding equivalence.In the 19th Century, Fisher developed the idea of Type I Error, based on the Normal Distribution, thus allows a probability estimate for whether the null hypothesis can be rejected. This allows decisions to be made in science and industry on whether a new product or process is better or worse than the current ones available. However, if the null hypothesis cannot be rejected, the researcher cannot draw any statistical conclusion, as a failure to reject null is not the same as an ability to accept null. A generation later, Pearson added the idea of Type II Error and the statistical significance, so that both the ability to reject and accept the null hypothesis can be made. Although this method was widely used in the twentieth century, it was increasingly criticised because results of research are often nor reproducible because of difficulties in determining the population Standard Deviation. To provide robustness to statistical conclusions, researchers increasingly used the 95% confidence interval of the difference, which is an intuitively easier to understand expression of the Type I Error, and to carry out power analysis, which makes no assumption about population parameters. The combination of these two approaches allows researchers to draw confident conclusions whether two sets of observations can be considered significantly different. However, the problem remains that a failure to demonstrate significant difference is not the same as to demonstrate similarity, and the ability to robustly demonstrate similarity is increasingly required, particularly in biomedical research. An example is in cancer treatment. The current treatment may have severe side effects, and a new treatment may have much more acceptable side effects, but the researcher needs to know whether the effectiveness in controlling the cancer is the same, or at least not inferior to the current treatment. The concept of equivalenceGiven the random variations inherent in any set of observations, it is very unlikely to demonstrate two groups to have the same mean values. The term equivalence is therefore used to represent similarity. This is dependent on a pre-determined and arbitrarily assigned Critical Difference (CD) or Tolerance Limit (TL), a difference that can be considered as trivial in the practical sense. Using the 95% confidence interval of the difference to illustrate, the various conclusions that can be drawn are as shown in the diagram (Forest Plot) to the right. Assuming that the difference is that between group 1 and 2 (diff = mean1 - mean2)
One tail or two tail In most text books and published papers, statistics related to equivalence uses the one tail model. This is because most equivalence related research are concerned with non-inferiority, so that not significantly greater or not significantly less are the hypotheses to be tested. As the one tail model allows these conclusions and requires smaller sample size, this is the model to use. StatsToDo however provides calculations for both one and two tails in case any user requires them. Sample size and power calculationSample size and power calculations for equivalence differ from those of significant differences in two ways. Firstly, the decision is based not only on the relationship between the difference between the two groups and the null value, but also the positive and negative critical values. Secondly, robustness is required for the conclusion of significant equivalence and not on significant difference. The Probability of Type I Error (α) is therefore relaxed, and the common values of 0.1 or 0.2 are used instead of 0.05 or 0.01. The Probability of Type II Error, in terms of power, is made more strict, so the power value of 0.9, 0.95, or 0.99 are used instead of 0.8 Statistical Decisions Users should remember that good statistical practice requires that the hypothesis to be tested is defined at the planning stage, and statistical procedures are used to reject or support that hypothesis. A common malpractice of doing the statistical calculations first, then cherry pick the hypothesis according to how the numbers come together should be avoided. A study to test significant difference or equivalence, the direction of non-inferiority, whether the model should be one or two tail, must be determined at the planning stage before data collection and analysis. Equivalence Between 2 ProportionsFor equivalence to be validly established, the correct sample size must be estimated and used, and this requires 5 parameters.
The two tail model is only required if the purpose of the study is to establish true equivalence, a confidence interval that is both significantly not less than the negative Critical Difference value, and significantly not greater than the positive Critical Difference value. Analysis of data collected Once the data is collected, analysis requires the following information
The confidence interval is the same as that calculated for comparing the difference between two proportions. This is offered as it is the most common parameter for making statistical decisions, particularly that associated with the 95% confidence interval (α=0.05). Such a decision allows the interpretation for significant difference, significant non inferiority, and equivalence. The power calculation is based on a difference calculated using the Maximum Likelihood Model. It estimates the probability of detecting the non-inferiority or equivalence at the level of α, if it is truly present. Most clinicians would be content to use the 95% confidence interval to decide whether equivalence exists or not. Statisticians however may wish for more nuance to determine the confidence of the decision, and use the power analysis. ExamplePlease note the data in this example are artificially created to demonstrate the statistical procedure, and not based on any observations.The current chemotherapeutic agent for a particular cancer is fairly effective, with a 1 year recurrence rate of 20% (0.2). However it is rather toxic, producing unacceptable side effects such as marrow depression, infections, and severe nausea. A new agent has been produced, which has very little side effect, so that it would obviously be preferable, providing the cure rate is non-inferior. In statistical terms, the 1 year recurrence rate for the new agent should be significantly not higher than the old agent. Sample size calculation With variations in observation, we feel that a 20% (0.2) difference in recurrence rate is tolerable statistically, that we accept that the new treatment is non-inferior if the recurrence rate is less than 40% (0.4), so we set our critical difference to 0.2. We will use α=0.05 to follow the convention of 95% confidence interval, and power (1-β) of 0.9. We looked the table in the previous panel. For α=0.05, power=0.9, Proportion = 0.2, and CD=0.2, we require 71 cases per group, assuming equal size groups (&phi=1) Analysis of data We randomly allocated patients to receive the two agents. Group 1 receive the new agent, and group 2 the old. At the end of data collection, we obtained the following results.
The results of analysis are as shown in the next table to the right. The recurrence rate for group 1 (new agent) is 22.5% (0.225) and for group 2 (old agent) is 19.7% (0.197). The difference is 2.8% (0.028). As the hypothesis to be tested is whether the new agent is non-inferior to old agent, so we choose the 95% confidence interval for one tail -∞→14.1% (0.141), which does not cross the positive Critical difference of 20%. We can therefore conclude that group 1 is significantly not greater than group 2, so the new agent is non-inferior to the old agent, in terms of one year recurrence rate.
The power estimation for group 1 not greater than group 2, using the one tail model, is 0.8, which is less than the 0.9 planned but still marginally acceptable statistically as having sufficient power. The same conclusion as that from the 95% confidence interval can be drawn, but with a little less confidence. ReferencesRogers JL, Howard KI, Vessey JT. (1993) Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin 113:553-565. Jones B, Jarvis P, Lewis JA, Ebbutt AF. (1996) Trials to assess equivalence: the importance of rigorous methods. British Medical Journal 313:36-39 Hwang IK, Morikawa T. (1999) Design issues in noninferiority/equivalence trials. Drug Information Journal 33:1205-1218 Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0 p. 100-104
Power Analysis :
Table of sample size for equivalence between two proportions
Power=(1-β), β = Probability of Type II Error α=Probability of Type I Error π = Anticipated proportion CD(tl) = Critical Difference (tolerance limit) Cells contain sample size per group, assuming equal size groups
|