This page presents
analysis of covariance in its simplest form, with 2 groups and 1 covariate.
It is assumed that users will have some idea of covariance analysis, so no detailed explanations are offered on this page. Some on line tutorials are offered in the references, and inexperienced users are encouraged to consult expert statisticians before interpreting the results.
The conversation that follows is therefore to orientate users to the programs specific to this page, and help them to follow the calculations and interpretations.
The Javascript algorithm was initially developed in Fortran, using formulae from Armitage's classic text in the 1970s. The calculations are based on the partition of variances.
Analysis of covariance has come a long way since, and the current programs, as represented by the R algorithm, relies on the General Linear Model, using numerical approximations to estimate coefficients. This model greatly increases the ability to handle complex situations, allowing simultaneous analysis of multiple factors (groups) and values (measurements), and able to adapt to outcomes that have different distributions. More discussions and example codes on the General Linear Model can be found in GLM.php
Example used on this page
Grp | GAge | Bwt |
G | 37 | 3048 |
B | 36 | 2813 |
G | 41 | 3622 |
G | 36 | 2706 |
B | 35 | 2581 |
B | 39 | 3442 |
G | 40 | 3453 |
B | 37 | 3172 |
G | 35 | 2386 |
B | 39 | 3555 |
G | 37 | 3029 |
B | 37 | 3185 |
G | 36 | 2670 |
B | 38 | 3314 |
G | 41 | 3596 |
B | 38 | 3312 |
G | 39 | 3200 |
B | 41 | 3667 |
B | 40 | 3643 |
G | 38 | 3212 |
G | 38 | 3135 |
G | 39 | 3366 |
Please note: The data presented on this page is artificially generated to demonstrate the procedures. It is not real.
We purport to study whether boys and girls have different weight at birth. We understand that birth weight is much influence by the gestational age, so we need to use this as a covariate for correction.
We obtained the data as shown in the table to the right. (Please note: that study of birth weight usually have cases in the hundreds and thousands, the small size here is to make the example easier to visualize).
- The variable Grp represents grouping based on gender, G for girls, and B for boys
- The variable Gage represents the covariate gestational age, in weeks since beginning of pregnancy
- The variable Bwt represents the outcome birth weight, in grams
Method 1: Partition of variance
This follows formulae in Armitage's text, and consists of the following steps
- Regression between covariate (Gage) and outcome (Bwt) are calculated for the two groups, and the 2 different regression coefficients tested for significant difference. A significant difference means that there is an interaction between groups and covariate, in this example that the growth rates in the two groups are different, so it cannot be combined as a common one for the two groups, and covariate analysis should be terminated at this point.
The results from the example show that the two regression slopes differed by 1.6 grams per week, and not statistically significant, so covariance analysis can proceed
- The two regressin coefficients are merged into a single one, and the outcome recalculated as the distance from the regression line. The corrected outcomes are then compared. In this example, the difference between the corrected means is 165 grams, with 95% confidence interval 79 to 251 grams, and statistically significant.
This compared with the uncorrected difference of 150 grams with 95% confidence interval -159 to 458 grams, not statistically significant.
The difference in mean outcomes (Bwt) between the two groups was 158g before and 165g after covariate correction, not much changed. The variance, and therefore the confidence interval was however greatly reduced by the use of the covariate, thus redering the difference statistically different.
- The results can be demonstrated in the plot as shown to the right. The dots represent data points and lines regression line. Red represents boys and blue girls. The black line is the combined regression line
Method 2: procedure lm from CRAN
The procedurre
lm is a linear model for Gaussian distribution, available from the R archive CRAN (see reference). The basic formula is
lm(formula = outcome ~ var1 + var2 + var2 + ..., data=data frame or file)
Potentially the independent variables can be either factors (groups represented by text) or values (numerical data). The dependent value is assumed to be normally distributed, and values are either binomial or at least ordinal but not necessarily normally distributed.
The advantage of using R is that a single line of code will produce all the answers required, regardless of the size and complexity of the data set. The disadvantage is the users are constrained by what the packaged program produced.
In this example, the formula used twice
- lm(formula = Bwt ~ Grp + Gage + Grp * Gage, data=df) includes an interaction item Grp * Gage. This testa whether the two slopes are significantly different, and in this example the result is not significant, so the second calculation can proceed
- lm(formula = Bwt ~ Grp + Gage, data=df) does not include the interaction item, as it is not significant, and the program uses the same covariate (Gage) for both groups
The coefficients estimates are the final information required, and in this example girls are 165.33 grams less than boys, and the covariate representing growth rate of 186.22 grams per week
References
Armitage P.(1980) Statistical Methods in Medical Research. Blackwell Scientific
Publications. Oxford UK. ISBN 0 632 05430 1. p.279-301
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm Documentation for linear model from CRAN
R codes for compare 2 regressions using raw data
This page presents 2 meahods for this comparison, using generic R algorithm obtained from CRAN, and a replication of the Javascript program, in R. They both use the same example data set
Data entry using example data
The data is in 3 columns
- Col 1 = group, using characters. It can be more than 1, but usually 1 chr will do. In this example it is gender, B for boys and G for girls
- Col 2 = the x variable (the covariate). In this example it is gestational age in weeks
- Col 3 the y vaiable (the dependent or outcome variable). In this example it is birth weight in grams
# Data entry
myDat = ("
Grp Gage Bwt
G 37 3048
B 36 2813
G 41 3622
G 36 2706
B 35 2581
B 39 3442
G 40 3453
B 37 3172
G 35 2386
B 39 3555
G 37 3029
B 37 3185
G 36 2670
B 38 3314
G 41 3596
B 38 3312
G 39 3200
B 41 3667
B 40 3643
G 38 3212
G 38 3135
G 39 3366
")
df <- read.table(textConnection(myDat),header=TRUE)
#df # optional display of data frame
Program 1: Using Generic R Codes from CRAN
This uses the data set already presented
res1 <- lm(formula = Bwt ~ Grp + Gage + Grp * Gage, data=df)
summary(res1) # linear model with interaction
res2 <- lm(formula = Bwt ~ Grp + Gage, data=df)
summary(res2) # linear model with no interaction
The results of the first linear model, with interaction are as follows. The purpose is to evaluate the statistical significane of the interaction (Grp * Gage). If this is significant, then the relationships between gestational age (Gage) and birth weight (Bwt) are different in the two groups. In this result, the interaction is not statistically significant.
> summary(res1) # linear model with interaction
Call:
lm(formula = Bwt ~ Grp + Gage + Grp * Gage, data = df)
Residuals:
Min 1Q Median 3Q Max
-157.20 -65.56 -3.80 85.68 131.87
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3771.733 683.205 -5.521 3.05e-05 ***
GrpG -226.830 891.411 -0.254 0.802
Gage 185.267 17.960 10.315 5.53e-09 ***
GrpG:Gage 1.617 23.411 0.069 0.946
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 98.37 on 18 degrees of freedom
Multiple R-squared: 0.9383, Adjusted R-squared: 0.9281
F-statistic: 91.31 on 3 and 18 DF, p-value: 4.41e-11
Once interaction is found to be not significant, the algorithm can be repeated without the interaction, assuming a common regression. The emphasis here is the coefficients. It shows that
- coefficient for GrpG = -165.33, interpreted as girls are 165.33 grams lighter than boys
- coefficient for Gage = 186.22, interpreted as growth rate for the two groups combined to be 186.22 grams per week
> summary(res2) # linear model with no interaction
Call:
lm(formula = Bwt ~ Grp + Gage, data = df)
Residuals:
Min 1Q Median 3Q Max
-160.055 -64.447 -5.227 86.543 131.153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3807.89 427.23 -8.913 3.25e-08 ***
GrpG -165.33 41.01 -4.031 0.000713 ***
Gage 186.22 11.21 16.605 9.09e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 95.76 on 19 degrees of freedom
Multiple R-squared: 0.9383, Adjusted R-squared: 0.9318
F-statistic: 144.5 on 2 and 19 DF, p-value: 3.205e-12
Program 2: Using Native R duplicating the Javascript Program
Partition of Variance
The following is a single program, divided into parts so its easier to follow. Part 1 is the data entry already shown above.
Part 2: Convert input data to series of vectors with 2 elements for 2 groups
dfB <- subset(df,Grp=="B") # data frame with all Bs
dfG <- subset(df,Grp=="G") # data frame with all Gs
# create vectors
arN <- c(nrow(dfB), nrow(dfG)) # sample size
arMeanX <- c(mean(dfB$Gage),mean(dfG$Gage)) # mean x
arSdX <- c(sd(dfB$Gage),sd(dfG$Gage)) # SD x
arMeanY <- c(mean(dfB$Bwt),mean(dfG$Bwt)) # mean Y
arSdY <- c(sd(dfB$Bwt),sd(dfG$Bwt)) # SD y
# do regressions
matrix_coef <- summary(lm(dfB$Bwt ~ dfB$Gage))$coefficients
a1 = matrix_coef[1,1]
b1 = matrix_coef[2,1]
matrix_coef <- summary(lm(dfG$Bwt ~ dfG$Gage))$coefficients
a2 = matrix_coef[1,1]
b2 = matrix_coef[2,1]
arB <- c(b1, b2) # slope, regression coefficient b
arA <- c(a1,a2) # constant
# Derived parameterrs
arSsx = c(0, 0) # sum of squares x
arSsy = c(0, 0) # sum of squares y
arSxy = c(0, 0) # sum products
arRho = c(0, 0) # correlation coefficient rho
arT = c(0, 0) # student t
arP = c(0, 0) # significance p
for(j in 1:2)
{
arSsx[j] = arSdX[j]^2 * (arN[j] - 1)
arSsy[j] = arSdY[j]^2 * (arN[j] - 1)
arSxy[j] = arB[j] * arSsx[j]
arRho[j] = arSxy[j] / sqrt(arSsx[j] * arSsy[j])
arT[j] = arRho[j] * sqrt((arN[j] - 2) / (1 - arRho[j]^2))
arP[j] = 1 - pt(arT[j], arN[j] - 2) # 2 tail
}
# initial output of vectors
arN # sample size
arMeanX # mean x
arSdX
arMeanY # mean Y
arSdY # SD y
# slopes
arB # slope, regression coefficient b
arA # constant a
# correlation coefficients
arRho # correlation coefficient rho
arT # student t
arP # significance p
The initial parameters, as vectors are as follows
> # initial output of vectors
> arN # sample size
[1] 10 12
> arMeanX # mean x
[1] 38.00000 38.08333
> arSdX
[1] 1.825742 1.975225
> arMeanY # mean Y
[1] 3268.400 3118.583
> arSdY # SD y
[1] 351.4447 380.3308
> # slopes
> arB # slope, regression coefficient b
[1] 185.2667 186.8835
> arA # constant a
[1] -3771.733 -3998.563
> # correlation coefficients
> arRho # correlation coefficient rho
[1] 0.9624533 0.9705682
> arT # student t
[1] 10.02857 12.74448
> arP # significance p
[1] 4.154629e-06 8.277119e-08
Part 3: Compare the two xs and ys
diffX = arMeanX[1] - arMeanX[2]
dfX = arN[1] + arN[2] - 2
pooledX = sqrt(((arN[1]-1)*arSdX[1]^2 + (arN[2]-1)*arSdX[2]^2) / dfX)
seX = pooledX * sqrt(1/arN[1] + 1/arN[2])
llX = diffX - 1.96 * seX
ulX = diffX + 1.96 * seX
diffY = arMeanY[1] - arMeanY[2]
dfY = arN[1] + arN[2] - 2;
pooledY = sqrt(((arN[1]-1)*arSdY[1]^2 + (arN[2]-1)*arSdY[2]^2) / dfY)
seY = pooledY * sqrt(1/arN[1] + 1/arN[2]);
llY = diffY - 1.96 * seY
ulY = diffY + 1.96 * seY
# output 95% CI difference in x and y
c(llX, ulX) # 95% CI difference in x
c(llY, ulY) # 95% CI difference in y
The results are as follows
> # output 95% CI difference in x and y
> c(llX, ulX) # 95% CI difference in x
[1] -1.685749 1.519082
> c(llY, ulY) # 95% CI difference in y
[1] -158.6923 458.3256
Part 4: comparing the two slopes
diffSlope = arB[1] - arB[2] # diff slope
s2 = ((arSsy[1] - arSxy[1] * arSxy[1] / arSsx[1]) +
(arSsy[2] - arSxy[2] * arSxy[2] / arSsx[2])) /
(arN[1] + arN[2] - 4)
seSlope = sqrt(s2 * (1 / arSsx[1] + 1 / arSsx[2])) # SE of difference
tSlope = diffSlope / seSlope # t test
dfSlope = arN[1] + arN[2] - 4; # degrees of freedom
pSlope = (1 - pt(abs(tSlope), dfSlope)) * 2 # Type I error 2 tail
# output comparison 2 slopes
diffSlope # difference between slopes
seSlope # standard error of difference
tSlope # t
dfSlope # deg freedom
pSlope # significance p (2 tail)
The results are as follows
> # output comparison 2 slopes
> diffSlope # difference between slopes
[1] -1.616828
> seSlope # standard error of difference
[1] 23.41085
> tSlope # t
[1] -0.0690632
> dfSlope # deg freedom
[1] 18
> pSlope # significance p (2 tail)
[1] 0.9457008
Part 5: Combining the two slopes and compare the two adjusted y variables
commonSlope = (arSxy[1] + arSxy[2]) / (arSsx[1] + arSsx[2]) # common slope
grandMean = (arMeanX[1] * arN[1] + arMeanX[2] * arN[2]) / (arN[1] + arN[2]) # mean of x
adjMean1 = arMeanY[1] + commonSlope * (grandMean - arMeanX[1])
adjMean2 = arMeanY[2] + grandMean * (grandMean - arMeanX[2])
diffMean = arMeanY[1] - arMeanY[2] - commonSlope * (arMeanX[1] - arMeanX[2]) # adjusted diff
s2 = (arSsy[1] + arSsy[2] - (arSxy[1] + arSxy[2]) * (arSxy[1] + arSxy[2]) / (arSsx[1] + arSsx[2])) /
(arN[1] + arN[2] - 3);
seMean = s2 * (1.0 / arN[1] + 1.0 / arN[2] + (arMeanX[1] - arMeanX[2]) * (arMeanX[1] - arMeanX[2]) /
(arSsx[1] + arSsx[2]));
seMean = sqrt(seMean) # SE difference
tMean = diffMean / seMean # t test
dfMean = arN[1] + arN[2] - 3; # degrees of freedom
pMean = (1 - pt(tMean, dfMean)) * 2 # p Type I Error (2 tail)
# 95% CI
t = abs(qt(0.025,dfMean)) # t value for p=0.05 2 tail
ll = diffMean - t * seMean # lower limit 95% CI
ul = diffMean + t * seMean # upper limit 95% CI
# output of combined data
commonSlope # common slope
diffMean # difference between adjusted means
seMean # Standard Error of difference
tMean # t
dfMean # df
pMean # significance p 2 tail
c(ll,ul) # 95% confidence interval of adjusted difference in y
The results are as follows
> commonSlope # common slope
[1] 186.2183
> diffMean # difference between adjusted means
[1] 165.3349
> seMean # Standard Error of difference
[1] 41.0136
> tMean # t
[1] 4.03122
> dfMean # df
[1] 19
> pMean # significance p 2 tail
[1] 0.0007134561
> c(ll,ul) # 95% confidence interval of adjusted difference in y
[1] 79.4924 251.1773
Part 6: plot the dots and regression lines
# Plot
palette(c("red","blue")) # change palette to B=red G=blue (alphabetical order)
par(pin=c(4.2, 3)) # set plotting window to 4.2x3 inches
plot(x = df$Gage, # Gest age on the x axis
y = df$Bwt, # BWt on the y axis
col = df$Grp, # in 2 colors (B and G)
pch = 16, # size of dot
xlab = "Gestation (Gage)", # x label
ylab = "Birth Weight (Bwt)") # y lable
abline(lm(dfB$Bwt~dfB$Gage),col="red") # regression line for B
abline(lm(dfG$Bwt~dfG$Gage),col="blue") # regression line for G
The result plot is as shown to the right. In this plot, the palette is assigned red and blue color in alphabeticl order, so that the group B for boys is represented in red, and G for girls in blue. The dots are the data points, and the lines the regression lines for the two groups