Kappa was first described by Fleiss in 1969. It is a measurement of concordance or agreement between two or more judges, in the way they classify or categorise subjects into different groups or categories. It became a popular method of measuring concordance for nominal data
Cohen modified Fleiss's algorithm for use when there are only two raters or measurements, by inserting a weighting to the differences between the pair of measurements. This increases the influences with the width of difference, making the algorithm suitable for ordinal scales
Cohen's Kappa is therefore a measurement of concordance when the data is ordinal
Nomenclature
Ordinal data These are data sets where the numbers are in order,
but the distances between numbers are unstated. In other words 3 is bigger than 2
and 2 is bigger than 1, but 3-2 is not necessarily the same as 2-1.
A common ordinal data is the Likert scale, where 1=strongly disagree, 2=disagree,
3=neutral, 4=agree, and 5=strongly agree. Although these numbers are in
order, the difference between strongly agree and agree (5-4) is not necessarily
the same as between disagree and strongly disagree (2-1).
In the example on this page, babies are classified as small (1), as expected (2), and large(3). Large (3) is bigger than expected (2), and expected (2) is bigger than small (1). However, the difference between large and expected is not the same as between expected and small
Instrument is any method of measurement. For example, a ruler, a
Likert Scale (5 point scale from strongly disagree to strongly agree), or
a machine (e.g. ultrasound measurement of bone length). In the example of this page, the instrument is the judgement of the two doctors concerned
Subjects are the subjects of the measurements. The babies in this example
Example
The example on this page are artificially created to demonstrate the procedure, and they do not reflect any real clinical situation. The data purports to be from two doctors evaluating the size of 30 babies in their mother's abdomen, and classified them as smaller than expected (1), size as expected (2) and larger than expected (3). Cohen's Kappa then evaluates how much the two doctors agreed with each other (in concordance)
The data can be entered in two manners
As a table of 30 rows (cases) and two columns (doctors), each cell containing the evaluation (1, 2, or 3)
As a table of counts, with rows representing doctor 1's evaluation (1, 2, or 3) and column as doctor 2's evaluation (1, 2, or 3). Each cell contains the number of cases so evaluated.
The result consists firstly the display of the count matrix, then the Kappa, its Standard Error, and its 95% confidence interval. Two common methods of interpretation can be used
If the 95% confidence interval does not traverse the null value (0), a conclusion that concordance significantly stronger than random chance has been reached. In this example significant concordance cannot be concluded
A rule of thumb, where a Kappa of <0.2 is considered poor agreement, 0.21-0.4 fair, 0.41-0.6 moderate, 0.61-0.8 strong, and more than 0.8 near complete agreement. From our example, the conclusion that poor to fair concordance can be made. This is not surprising as the sample size is clearly too small.
References
Cohen J. A coefficient of agreement for nominal scales. Educational and
Psychological Measurement. 20:37-46, 1960.
Cohen J. Weighted kappa: nominal scale agreement with provision for scale
and disagreement or partial credit. Psychol. Bull. 70:213-20. 1968.
Fleiss, Joseph L.; Cohen, Jacob; Everitt, B. S. (1969) Large sample standard
errors of kappa and weighted kappa. Psychological Bulletin, Vol 72(5): p 323-327
Fleiss JL Statistical methods for rates and proportions second edition.
Wiley Series in probability and mathematical statistics.
Chapter 13 p. 212-236
Landis JR, Koch GG. The measurement of observer agreement for categorical data.
Biometrics. 1977; 33: 159-74.
Data Entry using Table of Raw Scores The data is a matrix of numbers with 2 columns
- Each row a subject
- The columns are from 2 ordinal scales
- Each cell contains the scores (ordinal value)
- Data is converted to ranks
then a table of counts by ranks for analysis.
Data Entry using Table of Counts by Ranks The data is a square matrix of counts
- The number of rows and cols are the ranks of the two ordinal scales
- The lowest scale value is ranked to 1
- Each cell contains the count of the two rankss
R Codes
This panel presents the algorithms for Cohen's Kappa for ordinal data
Firstly, the subroutine function that calculates Kappa from a matrix of counts by ranks
# Cohen Kappa for ordinal data
# function for Kappa Algorithm using matrix of counts by ranks
CalCohenKappa <- function(mx)
{
print("Matrix of count by ranks")
print(mx)
g = nrow(mx)
# converts values into ranks
# ranking by range of values and not by number of cases
n = 0 # n = total number of paired values
mxSq <- matrix(data=0, nrow=g+1,ncol=g+1, byrow=TRUE) # data matrix with row and col totals added
for(i in 1:g) for(j in 1:g)
{
v = mx[i,j]
n = n + v
mxSq[i,j] = v
mxSq[i,g+1] = mxSq[i,g+1] + v # col total
mxSq[g+1,j] = mxSq[g+1,j] + v # row total
mxSq[g+1,g+1] = mxSq[g+1,g+1] + v
}
# print(mxSq) # optional print out
# Calculate Cohen's (weighted) Kappa
mxp <- matrix(data=0, nrow=g,ncol=g, byrow=TRUE)
mxpe <- matrix(data=0, nrow=g,ncol=g, byrow=TRUE)
mxw <- matrix(data=0, nrow=g,ncol=g, byrow=TRUE)
for(i in 1:g)for(j in 1:g)
{
mxp[i,j] = mxSq[i,j] / mxSq[g+1,g+1]
mxpe[i,j] = mxSq[i,g+1] * mxSq[g+1,j] / mxSq[g+1,g+1] / mxSq[g+1,g+1]
if(i==j) { mxw[i,j] = 0; } else { mxw[i,j] = abs(i-j) }
}
sumWP = 0
sumWPe = 0
sumW2P = 0
for(i in 1:g) for(j in 1:g)
{
sumWP = sumWP + mxw[i,j] * mxp[i,j]
sumWPe = sumWPe + mxw[i,j] * mxpe[i,j]
sumW2P = sumW2P + mxw[i,j] * mxw[i,j] * mxp[i,j]
}
kappa = 1.0 - sumWP / sumWPe #Cohen Kappa
se = sqrt((sumW2P - sumWP * sumWP) / (n * sumWPe * sumWPe)) # SE
print(paste("Cohen's Kappa=", kappa," SE=", se ))
print(paste0("95% CI = ", (kappa - 1.96 * se), " to ", (kappa + 1.96 * se)))
}
Program 1: data entry is by pairs of values
#Program 1: data entry by 2 coulumns of paired values
datValues = ("
1 1
1 1
1 1
1 1
1 1
2 2
2 2
2 2
2 2
2 2
3 3
3 3
3 3
3 3
3 3
1 2
1 3
1 3
1 2
1 2
2 1
2 3
2 3
2 1
2 1
3 1
3 2
3 1
3 2
3 2
")
datMx <- read.table(textConnection(datValues),header=FALSE) # matrix of count in ranks
# datMx # optional printout for original data
n = nrow(datMx)
# ranking
tmpMx <- datMx # temporary scratch matrix
rankMx <- matrix(data=0, nrow=n,ncol=2, byrow=TRUE) # data ranked to range of values
minv = 0
rank = 0
cycle = 0
while(minv<1e10 & cycle<2*n)
{
minv = 1e10
rank = rank + 1
cycle = cycle + 1
minv = min(tmpMx)
if(minv<1e10)
{
for(i in 1:n)for(j in 1:2)if(tmpMx[i,j]==minv)
{
rankMx[i,j] = rank
tmpMx[i,j] = 1e10
}
}
}
g = rank - 1 # number of ranks
# rankMx # optional printout of ranks
# Create count matrix
countMx <- matrix(data=0, nrow=g,ncol=g, byrow=TRUE)
for(i in 1:n)countMx[rankMx[i,1],rankMx[i,2]] = countMx[rankMx[i,1],rankMx[i,2]] + 1
# countMx # optional printout of count matrix
CalCohenKappa(countMx) # call function to calculate and present results
The results are
[1] "Matrix of count by ranks"
[,1] [,2] [,3]
[1,] 5 3 2
[2,] 3 5 2
[3,] 2 3 5
[1] "Cohen's Kappa= 0.278481012658228 SE= 0.14691180903751"
[1] "95% CI = -0.00946613305529259 to 0.566428158371748"
Program 2 allows data entry using the count matrix by ranks (if this has already been calculated)
# Program 2: data entry using matrix of counts by ranks
datCount = ("
5 3 2
3 5 2
2 3 5
")
mx <- read.table(textConnection(datCount),header=FALSE) # matrix of count in ranks
CalCohenKappa(mx)
The results are
[1] "Matrix of count by ranks"
[,1] [,2] [,3]
[1,] 5 3 2
[2,] 3 5 2
[3,] 2 3 5
[1] "Cohen's Kappa= 0.278481012658228 SE= 0.14691180903751"
[1] "95% CI = -0.00946613305529259 to 0.566428158371748"