MChoice

Content Disclaimer
Copyright @2020.
All Rights Reserved.

StatsToDo: Multiple Choice Questions: Item Analysis

Links : Home Index (Subjects) Contact StatsToDo

Explanations

Introduction and References

This page provides explanations and support for item analysis of multiple choice questions as in the Calculations panel

Item analysis is a tool kit used during the development of multiple choice questions. It analyse each question within the context of the whole test, and estimates two characteristics of how difficult the question is, and how well it discriminates (separates) those who scores low and high in the test.

Nomenclature and clarifications

Users should be aware of the following nomenclature and data presentations used on this page

A row in the data, a subject, and a response are the same thing, and refers to answers of all questions from an individual in the data
A column in the data, a question, and an item are the same thing, and refers to answers of all responses to a question
A cell in the data table, and an answer are the same thing, and refers to the answer provided by a response to a question
By default, this page displays results of calculation with 4 decimal point precision. In the support panels of the page, to improve brevity and clarification 2 decimal point precision are presented
Proportions and probabilities are displayed as a value between 0.0 and 1.0. Users can multiply this by 100 to transform the results to percent(%).

Detailed explanations and discussions are in the Example and Discussion subpanels

References

The creation and evaluation of multiple choice questions is a very large subject, and this page cannot hope to cover the references adequately. However, the following are easily accessible references on which this page and the accompanying program are based

https://www.washington.edu/assessment/scanning-scoring/scoring/reports/item-analysis/ A short but excellent tutorial on item analysis

http://www.ericae.net/ft/tamu/Espy.htm A paper presented in a conference, containing the algorithms which are presented on this page. Please note the paper contains few typos, and some of the tables are misaligned. However, the contents are clear enough to follow easily, and detailed enough to be the basis of the algorithm on this page. To gueard against lost in the future, as this article is in the public domain, and as I am fully referencing it, a copy of this article can be viewed here

https://en.wikipedia.org/wiki/Phi_coefficient The phi φ coefficient by Wikipedia

https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient Point Serial Correlation by Wikipedia

Example

This panel takes the user through the steps of calculation and explains the results produced, using the default example data in the Calculations panel

Default Example Data

The default example data are artificially created to demonstrate the algorithm, and does not reflect reality. It purports to be the results of a multiple choice test, using 10 questions (items) and on 50 students (responses). Three inputs are required.

A B C B B A D C E B
A B C C D A A E A B
A B C B D B D C C D
A A B B C D D C A D
A B B C B B B D E B
.....

The first input is the response data collected, presented as a table as shown to the right. Each row is a response, each column a question, and the cells are the answers for that question from that response.

A B C B D A D C E B The second is the correct response, as presented to the left. This is a row of correct answers, the number of answers is the same as the number of columns in the data, and the values reflect the correct answer to each question.

The third is the percentile setting to identify high and low scorers, the default value being 27, which is the reccommended value for most cases. This percentile is used to identify the low scorers (27 percentile from the lowest scorer upwards) and the high scorers (27 percentile from the highest scorer downwards), for the purpose of estimating the Discrimination Index (see later)

Calculations and results

The program in the Calculations panel produces 4 tables of results, shown as follows

Table 1
Question	Answer	Count	p	p_random
1	A*	37	0.74	0.5
1	B	13	0.26	0.5
2	A	15	0.3	0.5
2	B*	35	0.7	0.5
3	A	6	0.12	0.33
3	B	17	0.34	0.33
3	C*	27	0.54	0.33
4	A	12	0.24	0.33
4	B*	31	0.62	0.33
4	C	7	0.14	0.33
5	A	4	0.08	0.25
5	B	6	0.12	0.25
5	C	9	0.18	0.25
5	D*	31	0.62	0.25
6	A*	28	0.56	0.25
6	B	6	0.12	0.25
6	C	3	0.06	0.25
6	D	13	0.26	0.25
7	A	6	0.12	0.2
7	B	8	0.16	0.2
7	C	5	0.1	0.2
7	D*	26	0.52	0.2
7	E	5	0.1	0.2
8	A	7	0.14	0.2
8	B	5	0.1	0.2
8	C*	25	0.5	0.2
8	D	5	0.1	0.2
8	E	8	0.16	0.2
9	A	9	0.18	0.2
9	B	1	0.02	0.2
9	C	4	0.08	0.2
9	D	9	0.18	0.2
9	E*	27	0.54	0.2
10	A	4	0.08	0.2
10	B*	25	0.5	0.2
10	C	5	0.1	0.2
10	D	10	0.2	0.2
10	E	6	0.12	0.2

Table 1. Count for Answer from each question is a descriptive table as shown to the right

Column 1 contains the question numbers (column in the data). Column 2 contains each of the answer options for that question. An asterix * marks the correct answer for that question
Column 3 is the count of answers for each option, and column 4 translate that count into proportion (p = count / total). For example
For question 1, there are 37 responses for A, with p=37/50=0.74
For question 1, there are 13 responses for B, with p=13/50=0.26
Column 5 is the expected proportion is the answers are chosen at random by the responders, where p_random = 1 / number of options in the question. The option:p_random values are therefore 2:0.5, 3:0.33, 4:0.35, 5:0.2, 6:0.17, 7:0.14, 8:0.13, 9:0.11, 10:0.10, and so on

Table 1 provides an initial description of the questions that allows for error checking of the data.

User should note: that the data contains only those options that have been chosen at least once. For example, in a quesion with options A, B, C, and D, if C has never been chosen then only A, B, and D appears in the data and included in the calculations. Table 1 is therefore important for user to review how much each option in each question have been chosen.

Table 2. Score for each Response

Row	n	p	L/M/H
1	9	0.9	H
2	6	0.6	M
3	7	0.7	H
4	4	0.4	L

Table 2 counts the number of correct answers for each response (subject). After ranking the responses, the values to identify low and high scores (in this example 27^th (score <=5) and 73^rd (score >=7) percentiles. The results of the first 4 responses are shown in the table to the left

Response 1 correctly answered 9 out of the 10 questions, so p=9/10=0.9, and designated a high scorer (H).
Response 4 correctly answered 4 out of the 10 questions, so p=4/10=0.4, and designated a low scorer (L).
Response 2 correctly answered 6 out of the 10 questions, so p=6/10=0.6, and designated neither a low nor a high scorer (M).

Table 2 is descriptive, allowing users to check the score of each response and their distribution in the data

Table 3. Proportion Correct

Table 3 displays the proportion of correct answers for each question, and compare this with the proportion if the answer is randomly chosen , and what it ideally should be. this is shown in the table to the right

Question	Choices	N_correct	p_correct	p_random	p_ideal
1	2	37	0.74	0.5	0.75
2	2	35	0.7	0.5	0.75
3	3	27	0.54	0.33	0.67
4	3	31	0.62	0.33	0.67
5	4	31	0.62	0.25	0.63
6	4	28	0.56	0.25	0.63
7	5	26	0.52	0.2	0.6
8	5	25	0.5	0.2	0.6
9	5	27	0.54	0.2	0.6
10	5	25	0.5	0.2	0.6

Proportions correct (p_correct) and when chosen at random (p_random) have already been discussed with table 1.

The ideal proportion (p_ideal) is calculated as midpoint between 1 and random. Using question 3 as an example

_random

_ideal

_random

Number of options:p_ideal are therefore 2:0.75, 3:0.67, 4:0.63, 5:0.60, 6:0.58,7:0.57, 8:0.56, 9:0.56, 10:0.55, and so on

Table 3 provides an initial estimate of the level of difficulty for each question, as the rate of success (P_correct) can be compares with that from a ransom guess (P_random) and when half the responses were correct if no responses were a random guess (P_ideal).

Table 4. Characteristics of Questions

Question	N_low	N_high	Idx_Diff	Idx_Disc	phi φ	rho ρ
1	14	13	0.75	-0.05	0.13	0.17
2	9	13	0.61	0.2	0.37	0.33
3	5	13	0.5	0.4	0.56	0.4
4	9	13	0.61	0.2	0.37	0.37
5	9	13	0.61	0.2	0.37	0.32
6	9	10	0.53	0.05	0.17	0.22
7	6	14	0.56	0.4	0.58	0.47
8	7	12	0.53	0.25	0.4	0.41
9	8	11	0.53	0.15	0.29	0.32
10	10	10	0.56	0	0.13	0.18

Table 4 displays the formal indeces that are used to define the characteristics of each question

Number in low score group (L) is the number of responses with low scores, in this example the 20 responses of the bottom 27 percentile
Number in high score group (H) is the number of responses with high scores, in this example the 16 responses of the top 27 percentile
N_low is the number of correct answer in the low scores group for each question
N_high is the number of correct answer in the high scores group for each question
Idx_Diff is the index of Difficulty, calculated using the low (L) and high (H) groups, by diviiding the sum of correct scores by the total number from both groups. Using question 2 as an example
Idx_Disc is the index of Discrimination, calculated using the low (L) and high (H) groups, by dividing the difference between the correct answers of the high and low groups by the larger number of responses in the two groups. Using question 2 as an example
phi φ is the Binary Correlation Coefficient, an expression of Idx_Disc in the form of a correlation coefficient. Using question 2 as an example
rho ρ is the Biserial Correlation Coefficient, using correct (1) and incorrect (0) answer as the x variable, and the score of the response as the y variable. The program uses the algorithm of Pearson's correlation coefficient for this calculation.

Discussions

Item analysis is a generic term that applies to many tests that uses multiple measurements. In this page item analysis refers to the evaluation of questions or items in multiple choice questions intended for evaluation of students in the educational setting.

The development of multiple choice questions, and the selection from a bank of such questions for a particular examination, involve complex methodologies that are beyond the scope of his page, which covers only the statistical aspects concerning difficulty and discrimination.

Users are reminded that evaluations of items are context sensitive. The results from the same questions will be different if the responders are from different age group, educational experience, language, and other parameters that affect the ability to choose the correct answere.

The low (L) and high (H) scoring groups Some of the indeces in item analysis are carried out on a subset of the total population being tested, those with very high and very low total scores from all questions. This results in indeces that reflect the importance of extreme values, and are useful to identify outlyers for awards or failures.

The general recommendation (see references) is to select responses whose total scores are in the top and bottom 27 percentile. The idea is to make the two groups as different as possible. In small data sets, with fewer responses and questions, a larger percentile may be necessary to obtain sufficient sample size for stable results. In large data set, a smaller percentile can be used to enhance the difference between those with low and high scores.

Measurement of difficulty measures how likely the question will be answered correctly. Two measurements of difficulties are provided

The count (n) and proportion (p) of responses that are correct for that question, obtained from all responses in the data
The Difficulty index (Idx_Disc), obtained by dividing the sum of correct answers in the low and high groups by the total number of responses in these two groups

Both indeces (p and Idx_Disc) are ecaluated against two parameters

p_random is the expected proportion of correct answers if all the responses were random guesses Two types of samplings can be used for item analysis. A question with an index less than P_random would be too difficult to be useful.
p_ideal is the proportion half way between p_random and 1, and represent the level of difficulty that would result in half of the answers to be correct, if there was no random guessing. p_ideal therefore represents a level of difficulty from a average response

Measurement of Discrimination measures how well the question discriminate the responders with low and high total scrores of all questions. Three measures of discrimination are provided

The Point Serial Correlation Coefficient ρ, which is the same as the Pearson Correlation Coefficient, uses all responses from the data. The x value is corect (1) or incorrect (0) response to the question, and y value is the total score of the response. This index differs from the the other two in that it includes all responses in the data
The index of Discrimination (Idx_Disc) is calculated using the low (L) and high (H) total scores groups, by dividing the difference between the correct answers of the high and low groups by the larger number of responses in the two groups. This index is the easiest to calculate without computers, but is subjected to distortion when the data set is small (few questions , few responses, or both), as the number of responses in the low and high groups may differ, particularly if the distribution of score values is skewed
The Binary Correlation Coefficient φ, based on the 2x2 contingency table of correct and incorrect responses in the two groups of low and high total scores. Conceptually this is similar to the Bisereal Correlation Coefficient, except that it used only responses with low and high total scores instead of everyone in the data set.

All 3 indeces indicated the ability of the question to discriminate responses with high and low overall scores of correct answers, from near 0 representing poor discrimination to near 1 representing good discrimination

General Comments

The creation and evaluation of multiple choice questions, as well as the selection from a bank of questions for a test, are complex and sophisticated tasks that require the combined efforts of subject expert, educational psychologist, and statistician. Most of the expertise required are beyond the scope of this page, and inexperienced users are strongly advised to seek guidance from those with appropriate expertise.

What follows are simplistic and elementary comments of how difficulty and discrimination idex are used.

If an examination is criteria based, evaluating that the responses have achived a level of competence represented by correct answers, then selecting questions base on the difficulty index is appropriate

If an examination is normative based, evaluating the responses against each other, then selecting questions based on discrimination is appropriate.

If, in addition, the objective is to rank the responses, then discrimination index based on the total population, such as the Biserial Correlation ρ is more appropriate. On the other hand, if the objective is to identify the outlyers, to select the high achivers for awards or low achievers for exclusion, then the index based on the low and high scoring population such as Idx_{discrimination} or φ is appropriate.

Calculations

Hints on Data Entry

This panel provides supports for data entry and interpretation of results only. Detailed discussions on Item Analysis for Multiple Choice Questions are provided in the Introduction panel

Data Entry

Three (3) sets of data are required

Multiple Choice Responses is a table of test data for item analysis.

The rows are responses, each row from a subject
The columns are questions or items, each from a question in the in the test. The columns are separated by spaces or tabs
The cells are answers, each from that question and that response. The answers can be any text (single word with no gaps). However, as most multiple choice questions answers are single capital letters (A,B,C ... etc), single capital letters are usually used

Correct Answers is a single row of text separated by spaces or tabs. The number of text must be the same as the number of columns in the data, and each text (or letter in most cases) represents the correct answer for the corresponding question

Percentile is the percentile used to identify the high and low scorers for the purpose of estimating the difficulty and discrimination ideces. The recommended value is 27% (and its correcponsing (100-27=73%) for low and high scorers. This default value should be used unless the users has reasons to change it. More discussions on this issue are provided in the Introduction section

Default Example Data

Questions 1 and 2 are old fashion x type 2 options (true or false) binary responses (A and B)
Questions 3 and 4 have 3 options (A, B and C)
Questions 5 and 6 have 4 options (A, B, C and D)
Questions 7 to 10 have 5 options (A, B, C, D, amd E)

The correct answer for the 10 questions are A B C B D A D C E B

Percentile setting to identify high and low scorers is set at 27 (and its correcponsing 100-27=73 for high scorers)

Please note: The data was generated using random numbers so that the proportions of correct answers were set at around 0.5. This results in the indeces calculated to be lower than expected from a real set of multiple choice questions. Users should not be confused and think the level of difficulties and discrimination presented on this page are at the expected level.

Results

Detailed explanation of results are presented in the explanation panel. The following are summary descriptions

Table 1 displays the number (count) and proportion (p) for each answer in each question, matching against the proportion if the answers were randomly chosen.

Table 2 displays the score (n) and proportion of correct answers from each response (row), and the label as low (L) , high (H), or medium (M) scorers

Table 3 displays the number (N_correct) ane proportion (p_correct) of correct answers for each question, and matching these against the proportion if the question is answered in random (p_random), and the rheoretically ideal proportion (p_ideal)

Table 4 displays the results of item analysis for each question. Detailed descriptions are presented in the Explanation panel

References

Calculations are based on algorithm described in the web page http://www.ericae.net/ft/tamu/Espy.htm. To gueard against lost in the future, as this article is in the public domain, and as I am fully referencing it, a copy of this artical can be viewed here

Javascript Program

Multiple Choice Responses

The data is a table of answers
Each row is from a response
Each column from a question, separated by spaces or tabs
Each cell the answer provided for that question by that response

Correct Answers

The correct answer for each question, separated by spaces or tabs

Percentile to Mark High and Low Scores

Use value for low score, from >0 to <=50.