Content Disclaimer Copyright @2020. All Rights Reserved. |

**Links : **Home
Index (Subjects)
Contact StatsToDo

Explanations
Comparing two proportions is a common approach to many aspects of clinical research, in comparing efficacies and adverse outcomes under different circumstances, in quality control, and as a basis for meta-analysis to establich evidence for practice.
Javascript Program
This page provides the 6 most commonly used methods of comparing proportions in two groups.
## Chi Square Test for 2x2 contigency tablesThis tests the probability that the number of positive and negative cases in two groups are from a similar population, a goodness of fit test. Until tests using confidence interval became popular in the 1990s, this was the principle test used to compare two proportions.For Chi Square to be used, the total number in the study should exceed 30 cases, and each cell should have at least 5 cases. Short of these numbers, the assumptions of Chi Squares distribution cannot be assured and there is a possibility of misinterpretation.
## Fisher's Exact probabilityThis estimates the probability of observing the difference in proportions in two groups, departing from the null hypothesis that the two groups are the same. The calculation provided estimates the probability for the two tail test.As the probability is estimated using permutation, no assumption of the underlying binomial distribution is made, and the power of the model is 100%. Because of this, the test remains valid even when the sample size is small. Fisher's test is therefore often used when the sample or cell size are insufficient for the chi square test. The test uses Factorial numbers repeatedly, so that computing time increases exponentially with the number of cases. Therefore this test is only used when the conditions for a Chi Square Test cannot be satisfied. With the high speed and large memory capacity of modern computers however, Fisher's Test can still be easily computed for sample size as large as 1000
We repeat the study in the Chi Square Panel, but use Fisher's Exact Probability instead. We wish to study the difference in the preferences for specialty training between the male and female interns in a hospital. We found a group of 16 male interns, 10 of whom chose surgical specialties. We also found a group of 21 female interns, 5 of them chose surgical specialties. Fisher's Exact Probability = 0.02, and we can conclude that male and female interns differ significantly (at the p<0.05 decision level), in choosing surgical specialties for training. ## Risk DifferenceThis is often used to evaluate the results of a controlled trial, where risk is the proportion of the group with positive outcomes, so that risk in group 1 r1=Pos1/(Pos1+Neg1), and risk in group 2 r2=Pos2/(Pos2+Neg2). Risk difference is then rd=r1-r2.A particular useful notation is the There are two methods to estimate the confidence interval of the risk difference, based on the assumption of distribution of the risk difference. The The problem with the Standard method is that the assumption of normal distribution is only true when the sample size is large and the risk is near the 0.5 value. As the risk value approaches 0 or 1, the distribution becomes increasingly asymmetrical, with a narrower tail towards the extreme values (0 or 1) and longer tail towards the center (0.5). This error is accentuated if the sample size is small. The
We wish to study whether preoperative antibiotics reduces postoperative infections in a randomised controlled trial. Group 1 are 11 patients that received antibiotics, and 4 of them had postoperative infections. Groups 2 are 15 patients that received no antibiotics and 12 of them had postoperative infections. Risk of infection in the group receiving antibiotic is r1 = 4/11 = 0.31 (31%), and risk in the group that did not receive antibiotic is r2 = 12/15 = 0.8 (80%). Because the sample size is small, the exact method is used. The risk difference is 0.31 - 0.8 = -0.49 (-49%), with the 95% confidence interval -0.71 to -0.12. In other words, those receiving antibiotics were 12% to 70% less likely to have postoperative infections than those not receiving antibiotics. If the traditional method was used, the 95% confidence interval would be -0.81 to -0.17 (17% to 81%) Number needed to treat is based on the risk difference, which is the same in both methods. NNT = 1 / RD = 1 / 0.49 = 2.03, rounded upwards to 3. In other words, for the reduction of 1 case of infection, three more patients will need to receive pre-operative antibiotics. ## Risk Ratio (also known as Relative Risks)Although risk difference is a useful research model, it assumes that the sample size in the two groups are roughly the same. Large discrepancies in sample size will therefore introduce bias to the results. The model is therefore less useful in epidemiological studies, where the sample size cannot be easily equalized. For example, to study death rate comparing smokers with non-smokers, the number of smokers are very much less than non-smokers.Risk Ratio is therefore introduced, initially to cope with the circumstances around epidemiological studies. The model is so effective that it is now increasing adapted for general comparison of 2 proportions. Risk ratio, as all other ratios, is log normally distibuted. This means the log(risk ratio) is normally distributed, and used for all calculations. The results are then transformed back to non-logarithmic value by the exponential function.
We wish to study whether men are more likely to have car accidents than women. We collected a database of drivers, and found that out of 500 men 23 had an accident in the past year, while out of 530 women 10 had an accident during the same period. Risk of car accident amongst men is r1 = 23/500 = 0.046 (4.6%), risk of accident in women is r2 = 10/530 = 0.019 (1.9%). Risk Ratio is rr = 0.046/0.019 = 2.44, with 95% CI 1.17 to 5.07. In other words, male drivers have 1.17 to 5.07 times the risk of having a car accident compared with female drivers. ## Odds RatioWhile risk is the proportion of positive cases in the sample (risk = P / (P+N)), odd is the ratio of the positive and negative cases in the sample (odd = P / N). Althogh risks and odds are both conceptul representations of proportion, and risk ratio and odds ratio both represents differences in proportions in 2 groups, they are numerically different, and have different properties.- Gamblers commonly use odds to represent the ratio of winning against losing, while clinicians often prefer risks as it represents predictions or expected outcomes
- Risk is unidirectional, in that the group (treatment or characteristic) precurs the outcome, while odds represents the relationship between before and after, without a requirement of directions. Because of this, odds ratio can be used more flesibly. Examples are the relationship between social class and educational achievement, when it is not clear which is the cause and which is the effect. Another example is in retrospective paired controlled studies, where cases are grouped according to outcome, and the test is whether the precursor variable is different
- Odd, because it is mathematically simpler, can be more flexiblly used, particularly in multivariate statistical models.
We found 20 babies with occipital posterior position in early labour, 12 of whom has a deflexed head (Pos1=12) and 8 with a flexed head (Neg1=8). Odd1 is therefore o1 = 12 / 8 = 1.5 We also found 35 babies with occipital anterior position, 10 of which had deflexed head (Pos2=10) and 25 with a flexed head (Neg2=25). The odd is o2 = 10 /25 = 0.4 The Odds ration is or = 1.5 / 0.4 = 3.75, the 95% confidence interval 1.2 to 11.9 (2 tail). As the confidence interval does not include the null value of 1, we can conclude that occipital posterior and deflexion of the fetal head in ealy labour are related. ## ReferencesAltman D (1994) Practical Statistics for medical Research. Chapman Hall, London. ISBN 0 412 276205 (First Ed. 1991)- Chi square : p.252
- Fisher's Exact probability : p.233
Altman GD, Machin D, Bryant TN, Gardner MJ (2000) Statistics with confidence (2nd Ed). BMJ Books ISBN 0 7279 1375 1 - Odds ratio : p.60-62
- Relative risks : p.57-59
- Risk difference : p.233
// ## Chi Square Test for 2x2 contingency table# Program 1 Chi Sq dat = (" P1 P2 N1 N2 12 5 4 16 150 100 210 100 55 34 12 20 ") df <- read.table(textConnection(dat),header=TRUE) # conversion to data frame #df # optional display of input data # vectors for result XSq <- vector() P <- vector() # Calculate for(i in 1 :nrow(df)) { mx <- matrix(c(df$P1[i],df$P2[i],df$N1[i],df$N2[i]),nrow=2,ncol=2) test <- chisq.test(mx) XSq <- append(XSq, test$statistic) P <- append(P, test$p.value) } # Add results to data frame for display df$XSq <- XSq df$P <- P df # input data + chiSq + pThe results are > df # input data + chiSq + p P1 P2 N1 N2 XSq P 1 12 5 4 16 7.631331 0.005736296 2 150 100 210 100 3.283567 0.069976688 3 55 34 12 20 4.682993 0.030462625 ## Fisher's Exact Probability for 2x2 contingency table# Program 2 Fisher's Exact Probability # data input dat = (" P1 P2 N1 N2 12 5 4 16 3 5 4 1 14 12 11 13 ") df <- read.table(textConnection(dat),header=TRUE) # conversion to data frame #df # optional display of input data # vector for results P <- vector() # Calculate for(i in 1 :nrow(df)) { mx <- matrix(c(df$P1[i],df$P2[i],df$N1[i],df$N2[i]),nrow=2,ncol=2) test <- fisher.test(mx) P <- append(P, test$p.value) } # combine result in data frame for display df$P <- P df # input data + Fisher's Exact probabilityThe results are > df # input data + Fisher's Exact probability P1 P2 N1 N2 P 1 12 5 4 16 0.002979686 2 3 5 4 1 0.265734266 3 14 12 11 13 0.777511985 ## Risk DifferenceThis subpanel provides R algorithms for risk differences, divided into 3 sections- Section 1 Calculates risks in groiup1 and 2 (R1, R2), the risk difference (RD), and the Numbers needed to treat (NNT). These parameters are common to both Standard and Exact algorithms
- Section 2 takes the dataframe created in section 1, calculates the Standard Error (SE) to the risk difference, and the confidence intervals using the standard model, assuming that risk difference is approximately normally distributed
- Section 3 takes the dataframe created in section 1, and calculates the exact confidence interva;, based on the assumtion that risk difference is binomially distributed
Section 1: Input data and the basic risk parameters
# Risk Difference : Section 1. Initial calculations # data input dat = (" P1 P2 N1 N2 Pc 12 5 4 16 90 150 100 210 100 99 55 34 12 20 95 ") df <- read.table(textConnection(dat),header=TRUE) # conversion to data frame #df # optional display of input data # 3a: calculate risk difference df$R1 <- df$P1 / (df$P1 + df$N1) # Risk 1 df$R2 <- df$P2 / (df$P2 + df$N2) # Risk 2 df$RD <- df$R1 - df$R2 # Risk Difference df$NNT = ceiling(1 / df$RD) df # show initial risk differenceThe initial data frame produced are as follows. - P1, N1, P2, N2 are number of positives and negatives in groups 1 and 2
- Pc = oercent of confidence in confidence interval. usuallu 90, 95, or 99
- R1 and R2 are risk of being positive in groups 1 and 2
- RD = R1 - R2 = risk difference
- NNT = Number needed to treat = cwiling (1 / RD)
> df # show initial risk difference P1 P2 N1 N2 Pc R1 R2 RD NNT 1 12 5 4 16 90 0.7500000 0.2380952 0.51190476 2 2 150 100 210 100 99 0.4166667 0.5000000 -0.08333333 -12 3 55 34 12 20 95 0.8208955 0.6296296 0.19126589 6 Section 2. Calculation of Standard Error and confidence interval using the standard model, assuming risk difference to be approximately normally distributed.
# Calculate SE and CI (standard) dfS <- df # copy df from section 1 dfS # optional display of input data # vectors for results SE <- vector() # standard error LL1 <- vector() # lower limit 1 tail UL1 <- vector() # upper limit q tail LL2 <- vector() # lower limit 2 tail UL2 <- vector() # upper limit 2 tail # calculation for each row of data for(i in 1 : nrow(dfS)) { r1 = as.numeric(dfS$R1[i]) dfS$R1[i] = sprintf(r1,fmt="%#.4f") r2 = as.numeric(dfS$R2[i]) dfS$R2[i] = sprintf(r2,fmt="%#.4f") p1 = dfS$P1[i] p2 = dfS$P2[i] n1 = dfS$N1[i] n2 = dfS$N2[i] pc = dfS$Pc[i] rd = as.numeric(dfS$RD[i]) dfS$RD[i] = sprintf(rd,fmt="%#.4f") se = sqrt((r1 * (1.0 - r1) / (p1 + n1)) + (r2 * (1.0 - r2) / (p2 + n2))) SE <- append(SE,sprintf(se,fmt="%#.4f")) prob = (1 - pc / 100) / 1 z = qnorm(1 - prob) LL1 <- append(LL1, sprintf(rd - z * se,fmt="%#.4f")) UL1 <- append(UL1, sprintf(rd + z * se,fmt="%#.4f")) prob = (1 - pc / 100) / 2 z = qnorm(1 - prob) LL2 <- append(LL2, sprintf(rd - z * se,fmt="%#.4f")) UL2 <- append(UL2, sprintf(rd + z * se,fmt="%#.4f")) } dfS$SE <- SE dfS$LL1 <- LL1 dfS$UL1 <- UL1 dfS$LL2 <- LL2 dfS$UL2 <- UL2 dfS # input data and resultsThe results for section 2 are as follows - SE = Standard Error of risk difference
- LL1 and UL1 = lower and upper limits of confidence interval, 1 tail
- LL2 and UL2 = lower and upper limits of confidence interval, 2 tail
> dfS # input data and results P1 P2 N1 N2 Pc R1 R2 RD NNT SE LL1 UL1 LL2 UL2 1 12 5 4 16 90 0.7500 0.2381 0.5119 2 0.1427 0.3291 0.6948 0.2772 0.7466 2 150 100 210 100 99 0.4167 0.5000 -0.0833 -12 0.0439 -0.1854 0.0187 -0.1964 0.0297 3 55 34 12 20 95 0.8209 0.6296 0.1913 6 0.0807 0.0585 0.3240 0.0331 0.3494 Section 3. confidence interval (exact model), assuming risk difference to have the binomial distribution
# Calculate CI (Exact) # Subroutine to estimate confidence interval ExactLimits <- function(np,ng,z) # numbers of positive and negatives, z { n = np + ng q = ng / n z2 = z * z A = 2 * np + z2; B = z * sqrt(z2 + 4.0 * np * q) C = 2 * (n + z2) lower = (A - B) / C upper = (A + B) / C return (c(lower, upper)) } # Main program dfE <- df # copy df from section 1 dfE # optional display of input data LL1 <- vector() UL1 <- vector() LL2 <- vector() UL2 <- vector() for(i in 1 : nrow(dfE)) { r1 = as.numeric(dfE$R1[i]) dfE$R1[i] = sprintf(r1,fmt="%#.4f") r2 = as.numeric(dfE$R2[i]) dfE$R2[i] = sprintf(r2,fmt="%#.4f") p1 = dfE$P1[i] p2 = dfE$P2[i] n1 = dfE$N1[i] n2 = dfE$N2[i] pc = dfE$Pc[i] rd = as.numeric(dfE$RD[i]) dfE$RD[i] = sprintf(rd,fmt="%#.4f") prob = (1 - pc / 100) / 1 z = qnorm(1 - prob) ar = ExactLimits(p1,n1,z) lower1 = ar[1] upper1 = ar[2]; ar = ExactLimits(p2,n2,z) lower2 = ar[1]; upper2 = ar[2]; lower = rd - sqrt((r1 - lower1)^2 + (upper2 - r2)^2) upper = rd + sqrt((r2 - lower2)^2 + (upper1 - r1)^2) LL1 <- append(LL1, sprintf(lower,fmt="%#.4f")) UL1 <- append(UL1, sprintf(upper,fmt="%#.4f")) prob = (1 - pc / 100) / 2 z = qnorm(1 - prob) ar = ExactLimits(p1,n1,z) lower1 = ar[1] upper1 = ar[2]; ar = ExactLimits(p2,n2,z) lower2 = ar[1]; upper2 = ar[2]; lower = rd - sqrt((r1 - lower1)^2 + (upper2 - r2)^2) upper = rd + sqrt((r2 - lower2)^2 + (upper1 - r1)^2) LL2 <- append(LL2, sprintf(lower,fmt="%#.4f")) UL2 <- append(UL2, sprintf(upper,fmt="%#.4f")) } dfE$LL1 <- LL1 dfE$UL1 <- UL1 dfE$LL2 <- LL2 dfE$UL2 <- UL2 dfE # result outputThe results are as follows. Please note that the confidence interval is not symmetrical on both side of the risk difference, as binimial distribution has a narrower distribution towards the extremes (0 or 1) than towrds the center (0.5) > dfE # result output P1 P2 N1 N2 Pc R1 R2 RD NNT LL1 UL1 LL2 UL2 1 12 5 4 16 90 0.7500 0.2381 0.5119 2 0.3043 0.6594 0.2413 0.6887 2 150 100 210 100 99 0.4167 0.5000 -0.0833 -12 -0.1835 0.0183 -0.1939 0.0291 3 55 34 12 20 95 0.8209 0.6296 0.1913 6 0.0578 0.3202 0.0324 0.3436 ## Risk Ratio# Risk Ratio # Data input dat = (" P1 P2 N1 N2 Pc 12 5 4 16 90 150 100 210 100 99 55 34 12 20 95 ") df <- read.table(textConnection(dat),header=TRUE) # conversion to data frame #df # optional display of input data # vectors for calculations R1 <- vector() R2 <- vector() RR <- vector() LRR <- vector() SE <- vector() LL1 <- vector() UL1 <- vector() LL2 <- vector() UL2 <- vector() # Calculations for(i in 1:nrow(df)) { p1 = df$P1[i] # number in grp 1 pos p2 = df$P2[i] # number in grp 2 pos n1 = df$N1[i] # number in grp 1 neg n2 = df$N2[i] # number in grp 2 neg pc = df$Pc[i] # % confidence r1 = p1 / (p1 + n1) # Risk Grp 1 R1 <- append(R1,sprintf(r1,fmt="%#.4f")) r2 = p2 / (p2 + n2) # Risk Grp 2 R2 <- append(R2,sprintf(r2,fmt="%#.4f")) rr = r1 / r2 # risk ratio RR <- append(RR,sprintf(rr,fmt="%#.4f")) lrr = log(rr) # log(risk ratio) LRR <- append(LRR,sprintf(lrr,fmt="%#.4f")) se = sqrt(1.0 / p1 - 1.0 /(p1 + n1) + 1.0 / p2 - 1.0 / (p2 + n2)) # Standard Error (lrr) SE <- append(SE,sprintf(se,fmt="%#.4f")) prob = (1 - pc / 100) / 1 z = qnorm(1 - prob) lower = exp(lrr - z * se) # lower margin of confidence interval 1 tail upper = exp(lrr + z * se) # upper margin of confidence interval 1 tail LL1 <- append(LL1,sprintf(lower,fmt="%#.4f")) UL1 <- append(UL1,sprintf(upper,fmt="%#.4f")) prob = (1 - pc / 100) / 2 z = qnorm(1 - prob) lower = exp(lrr - z * se) # lower margin of confidence interval 2 tail upper = exp(lrr + z * se) # upper margin of confidence interval 2 tail LL2 <- append(LL2,sprintf(lower,fmt="%#.4f")) UL2 <- append(UL2,sprintf(upper,fmt="%#.4f")) } # combine results into data frame for display df$R1 <- R1 df$R2 <- R2 df$LRR <- LRR df$SE <- SE df$RR <- RR df$LL1 <- LL1 df$UL1 <- UL1 df$LL2 <- LL2 df$UL2 <- UL2 df # show data frame with resulktsThe results are as follows - R1 and R2 = risk in grp 1 and 2
- RR = Risk Ratio (Relative Risks) = R1 / R2
- LRR and SE = log(Risk Ratio) and its Standard Error
- LL1 and UL1 = lower and upper border of confidence interval, 1 tail
- LL2 and UL2 = lower and upper border of confidence interval, 2 tail
> df # show data frame with resulkts P1 P2 N1 N2 Pc R1 R2 LRR SE RR LL1 UL1 LL2 UL2 1 12 5 4 16 90 0.7500 0.2381 1.1474 0.4162 3.1500 1.8479 5.3697 1.5886 6.2462 2 150 100 210 100 99 0.4167 0.5000 -0.1823 0.0943 0.8333 0.6692 1.0377 0.6537 1.0624 3 55 34 12 20 95 0.8209 0.6296 0.2653 0.1190 1.3038 1.0721 1.5855 1.0326 1.6461 ## Odds Ratio# Odds Ratio # Data input dat = (" P1 P2 N1 N2 Pc 12 5 4 16 90 150 100 210 100 99 55 34 12 20 95 ") df <- read.table(textConnection(dat),header=TRUE) # conversion to data frame #df # optional display of input data # vectors for calculations O1 <- vector() O2 <- vector() OR <- vector() LOR <- vector() SE <- vector() LL1 <- vector() UL1 <- vector() LL2 <- vector() UL2 <- vector() # Calculations for(i in 1:nrow(df)) { p1 = df$P1[i] # number of Pos in grp 1 p2 = df$P2[i] # number of Pos in grp 2 n1 = df$N1[i] # number of Neg in grp 1 n2 = df$N2[i] # number of Neg in grp 2 pc = df$Pc[i] # % in confidence o1 = p1 / n1 # odd grp 1 O1 <- append(O1,sprintf(o1,fmt="%#.4f")) o2 = p2 / n2 # odd grp 2 O2 <- append(O2,sprintf(o2,fmt="%#.4f")) or = o1 / o2 # odds Ratio OR <- append(OR,sprintf(or,fmt="%#.4f")) lor = log(or) # log(odds ratio) LOR <- append(LOR,sprintf(lor,fmt="%#.4f")) se = sqrt(1.0 / p1 + 1.0 / p2 + 1.0 / n1 + 1.0 / n2) # Standard Error of log(or) SE <- append(SE,sprintf(se,fmt="%#.4f")) prob = (1 - pc / 100) / 1 z = qnorm(1 - prob) lower = exp(lor - z * se) # lower limit of confidence interval 1 tail upper = exp(lor + z * se) # upper limit of confidence interval 1 tail LL1 <- append(LL1,sprintf(lower,fmt="%#.4f")) UL1 <- append(UL1,sprintf(upper,fmt="%#.4f")) prob = (1 - pc / 100) / 2 z = qnorm(1 - prob) lower = exp(lor - z * se) # lower limit of confidence interval 2 tail upper = exp(lor + z * se) # upper limit of confidence interval 2 tail LL2 <- append(LL2,sprintf(lower,fmt="%#.4f")) UL2 <- append(UL2,sprintf(upper,fmt="%#.4f")) } df$O1 <- O1 df$O2 <- O2 df$LOR <- LOR df$SE <- SE df$OR <- OR df$LL1 <- LL1 df$UL1 <- UL1 df$LL2 <- LL2 df$UL2 <- UL2 df # show data frame with resulktsThe results are as follows - O1 and O2 = Odd in grp 1 and 2
- OR = Odds Ratio = O1 / O2
- LOR and SE = log(Odds Ratio) and its Standard Error
- LL1 and UL1 = lower and upper border of confidence interval, 1 tail
- LL2 and UL2 = lower and upper border of confidence interval, 2 tail
> df # show data frame with resulkts P1 P2 N1 N2 Pc O1 O2 LOR SE OR LL1 UL1 LL2 UL2 1 12 5 4 16 90 3.0000 0.3125 2.2618 0.7719 9.6000 3.5699 25.8160 2.6969 34.1728 2 150 100 210 100 99 0.7143 1.0000 -0.3365 0.1773 0.7143 0.4729 1.0789 0.4524 1.1277 3 55 34 12 20 95 4.5833 1.7000 0.9918 0.4254 2.6961 1.3393 5.4273 1.1713 6.2058 |