![]() | Content Disclaimer Copyright @2020. All Rights Reserved. |
Links : Home Index (Subjects) Contact StatsToDo |
Explanations
Javascript Program
R Codes
D
E
F
This page provides calculations for parametric correlation (Pearsoon) and linear regression analysis
CorrelationPearson's Correlation coefficient (ρ) is a measure of associatin between two normally distributed measurements. It describes the strength of association, but does not indicate which is cause and which is effect.In earlier algorithms, ρ is assumed to have an approximate normal distribution, and its Standard Error to be sqrt((1-ρ2) / (n-2)), where n is sample size. From this the t value and statistical significance based on the t distribution can be estimated. More recently, that the value of ρ is constrained between -1 and 1 is taken into consideration, and that its distribution is only symmetrical when the value is closed to 0. As ρ value approaches the extreme of -1 or 1, the distribution becomes increasingly asymmetrical, with a long tail towards 0 and short tail towards -1 or 1. Significant test based on the t test is therefore considered at best approximate, and possibly misleading, and the confidence interval, calculated after ρ is transformed to the parametric Fisher's Z is considered more appropriate. The calculations on this page provides both, allowing the user to decide which one to show as results. The formulae used for Fisher's Z transformation are
Standard Error of Z: SE = 1.0 / sqrt(n - 3), where n = sample size 95% CI of Z = Z ± z * SE, where z = 1.64 for 1 tail, and z=1.96 for 2 tail Reverse transformation : ρ = exp(2Z - 1) / exp(2Z + 1)
Pearson's Correlation analysis obtained the following results
Linear RegressionLinear regression describes the relationship between an independent variable (x) and a dependent variable (y). The relationship is directional in that x comes before y, and x values predicts or influences the values of y.The values of y are assumed to be continuous and normally distributed while x can be binary, ordinsal, or continuous measurements that are not necessarily normally distributed The regression formula is y = a + bx, where a is the constant, the value of y when x=0, and b is the regression coefficient, the cahge in y value for each unit of change in x value. The regressed value of y is the mean value for y given x. Its error is a combination of the Standard Error of b and the Standard Error of mean y. Because of this combination, the error of regressed y changes with the value of x, less near the mean and widens towards the extremes. Example: Using the same example as the for correlation, the results of calculations are
ReferencesCorrelation and Regression Analysis
To Harvest the Bitmap
The following is a single R program, but divided into parts so it is easier to follow
Part 1: Data entry # Correlation Regression Analysis # data entry dat = (" X Y 37 3048 36 2813 41 3622 36 2706 35 2581 39 3442 40 3453 37 3172 35 2386 39 3555 37 3029 37 3185 36 2670 38 3314 41 3596 38 3312 38 3414 41 3667 40 3643 33 1398 38 3135 39 3366 ") df <- read.table(textConnection(dat),header=TRUE) # conversion to data framePart 2: means, SD, and Correlation Analysis # Correlation n = nrow(df) meanX = mean(df$X) sdX = sd(df$X) meanY = mean(df$Y) sdY = sd(df$Y) # Initial output of parameters n # sample size c(meanX, sdX) # mean and SD x c(meanY, sdY) # mean and SD Y cor.test(df$X,df$Y) # Correlation resultsThe results are as follows > # Initial output of parameters > n # sample size [1] 22 > c(meanX, sdX) # mean and SD x [1] 37.772727 2.136571 > c(meanY, sdY) # mean and SD Y [1] 3113.955 532.697 > cor.test(df$X,df$Y) # Correlation results as output by R Pearson's product-moment correlation data: df$X and df$Y t = 10.78, df = 20, p-value = 8.814e-10 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8222882 0.9682270 sample estimates: cor 0.923674Part 3: Regression Analysis # regression regRes <- lm(df$Y ~ df$X) mxC <- summary(regRes)$coefficients a = mxC[1,1] # constant b = mxC[2,1] # regression ssx = sdX^2 * (n - 1) # sum square x ssy = sdY^2 * (n - 1) # sum square y sxy = b * ssx # sum product rv = (ssy - sxy**2 / ssx) / (n - 2); # Residual mean Squares seB = sqrt(rv / ssx); # SE of slope b tB = b / seB # t p = (1 - pt(abs(b), n-2)) * 2 # Type I error 2 tail t95 = abs(qt(0.025,n-2)) # t value for p=0.05 2 tail ll = b - t95 * seB # lower limit 95% CI b ul = b + t95 * seB # upper limit 95% CI b # output regression analysis regRes # Regression results as output by R c(a,b,seB,tB,p) # result of regression a, b, SE, t and p c(ll,ul) # 95% CI regression coefficient bThe results are as follows > # output regression analysis > regRes # Regression results as output by R Call: lm(formula = df$Y ~ df$X) Coefficients: (Intercept) df$X -5584.9 230.3 > c(a,b,seB,tB,p) # result of regression a, b, SE, t and p [1] -5584.85917 230.29350 21.36240 10.78032 0.00000 > c(ll,ul) # 95% CI regression coefficient b [1] 185.7323 274.8547Part 4: Add regressed values to the data frame # create regressed values df$Regy <- a + b * df$X # regressed y value df$SE_RY <- sqrt(rv * (1 / n + (df$X - meanX)^2 / ssx)) # SE df$CI_low <- df$Regy - t95 * df$SE_RY # 95% CI low df$CI_high <- df$Regy + t95 * df$SE_RY # 95% CI high df$Zy <- (df$Y - df$Regy) / df$SE_RY # z of Y related to regressed y df # display input data and regressed valuesThe result data frame is as follows > df # display input data and regressed values X Y Regy SE_RY CI_low CI_high Zy 1 37 3048 2936.000 47.55016 2836.813 3035.188 2.355397483 2 36 2813 2705.707 58.50335 2583.671 2827.743 1.833964011 3 41 3622 3857.174 82.10704 3685.902 4028.447 -2.864242618 4 36 2706 2705.707 58.50335 2583.671 2827.743 0.005008771 5 35 2581 2475.413 74.14155 2320.757 2630.070 1.424120926 6 39 3442 3396.587 51.72894 3288.683 3504.492 0.877893831 7 40 3453 3626.881 65.21022 3490.855 3762.907 -2.666468366 8 37 3172 2936.000 47.55016 2836.813 3035.188 4.963170022 9 35 2386 2475.413 74.14155 2320.757 2630.070 -1.205983220 10 39 3555 3396.587 51.72894 3288.683 3504.492 3.062357670 11 37 3029 2936.000 47.55016 2836.813 3035.188 1.955819432 12 37 3185 2936.000 47.55016 2836.813 3035.188 5.236565530 13 36 2670 2705.707 58.50335 2583.671 2827.743 -0.610340655 14 38 3314 3166.294 44.85642 3072.725 3259.863 3.292862572 15 41 3596 3857.174 82.10704 3685.902 4028.447 -3.180902422 16 38 3312 3166.294 44.85642 3072.725 3259.863 3.248275865 17 38 3414 3166.294 44.85642 3072.725 3259.863 5.522197931 18 41 3667 3857.174 82.10704 3685.902 4028.447 -2.316177572 19 40 3643 3626.881 65.21022 3490.855 3762.907 0.247185395 20 33 1398 2014.826 111.28225 1782.696 2246.957 -5.542900720 21 38 3135 3166.294 44.85642 3072.725 3259.863 -0.697647721 22 39 3366 3396.587 51.72894 3288.683 3504.492 -0.591303087Plotting the results # plot df <- df[order(df$X),] # sort data frame by order of x par(pin=c(4.2, 3)) # set plotting window to 4.2x3 inches plot(x = df$X, # x = Gest age on the x axis y = df$Y, # y BWt on the y axis pch = 16, # size of dot xlab = "Gestation (X)", # x label ylab = "Birth Weight (Y)") # y lable lines(df$X, df$Regy, col = "red") # regression line lines(df$X, df$CI_low, col = "red") # low value 95% CI regression line lines(df$X, df$CI_high, col = "red") # high value 95% CI regression line
Contents of D:3
Contents of E:4
Contents of F:5
|