Related link :
Classification by Basic Bayes Probability Program Page
Classification by Naive Bayes Probability Program Page
Introduction
Simple Bayes
Basic Bayes
Naive Bayes
Discussions
References
G
This page provides explanation and support for the two programs in Classification by Basic Bayes Probability Program Page
and
Classification by Naive Bayes Probability Program Page
. As the programs and this explanation page use specific terms and abbreviations, and these are best demonstrated with examples, this introduction panel will describe the example used and the terminology.
The format of data entry and explanation of results produced are in the Help and Hints panel of the program pages.
The contents of the other panels of this page are
 The Simple Bayes panel provides the calculations used in the one predictor model, the simplest Bayesean model
 The Basic Bayes panel provides same calculations, but in a two predictors model, the default example in the Classification by Basic Bayes Probability Program Page
.
 The Naive Bayes panel provides calculations used in the Naive Bayes model, the default example in the Classification by Naive Bayes Probability Program Page
.
 The Discussions panel provides detailed explanations and comparisons of the two models, an extension of this Introduction panel.
 The References panel presents some references that may be useful to the users.
The remainder of this panel provides a description of the example data used in this and the two program pages, and brief explanations for the terms used in these pages.
The Example
The same example is used in the two programming pages and this explaination page.
Predictor  Outcome 
Hair  Eye  Pattern  French  German  Italian 
Dark (+)  Brown (+)  ++  1  1  3 
Dark (+)  Blue (+)  ++  3  1  2 
Dark (+)  Others (+)  ++  1  1  1 
Light (+)  Brown (+)  ++  2  1  2 
Light (+)  Blue (+)  ++  2  1  1 
Light (+)  Others (+)  ++  1  5  1 
We wish to develop a Bayesean model to identify the ethnicity of people, based on hair and eye color. To build our model, we recruited 10 each of known French, German, and Italians, and observed their hair and eye color. We then use the Bayesean model to predict ethnicity using hair and eye colors, in a community with an expected ratios of French:German:Italian of 3:2:1. In addition, we introduced a bias in our prediction, as cost ratios of 1:2:1 for French:German:Italian. The count of each combinations are presented in the table to the right, and the explanation of terms and abbreviations used are as follows
Terminology
The following terms are used to explain and present results of the Bayesean Probability models on StatsToDo. Details of calculation and explanations are presented when each model is discussed in the following panels.
The Outcome (o) is what we want to predict, and in this example consisting 3 mutually exclusive ethnicities, French, German, and Italian
The predictors are what we use to predict the outcome, in our example hair and eye color
 Each predictor has two or more mutually exclusive attributes (a), in our example dark and light for hair color, brown, blue, and others for eye color. If one of the attribute in a predictor is positive (+), all others are negative (). StatsToDo uses combinations of +s and s to represent each predictor. In hair color, + for dark, and + for light. In eye color, + for brown, + for blue, + for others
 When there are more than one predictor, the combination of attributes forms a pattern (p). The number of possible patterns are the product of the number of attributes in all predictors. StatsToDo uses combinations of +s and s to represent each predictor, and concatenates these symbolds to represent a pattern. In our example, there are 2x3=6 patterns of hair and eye color: dark_brown (++), dark_blue (++), dark_others (++), light_brown (++), lightblue (++), light_others (++)
 Please note: that the use of a text string of +s and s to represent predictors is not an universal practice, but developed specifically for StatsToDo, a compromise between the needs of brevity, clarity, and transferring large table of data to and from web pages (copy and paste)
Prediction. In the Bayesean sense, prediction is not foretelling the future, nor a method of discovering what may be true. It is the mathematical manpulation of probabilities, before and after the application of predictors. In our example, prediction is to estimate the probability of the person being French, German, or Italian, based on observing his/her hair and eye color
Probability, is a number between 0 (0%, no confidence) and 1 (100%, certainty). Although computationally it is treated the same as probability in other domains, in the Bayesean sense probability represents how confident we are with our conclusions. The following probabilities and their abbreviations are used in StatsToDo.
 Probabilities calculated during model development
 The probability of an outcome is P(o), of being positive in an attribute P(+) or P(a), and of a pattern of positives and negatives P(p)
 Bayesean model uses conditional probability, the probability of one thing in the presence of (given) another. This is abbreviated as P(yx) Probability of y given x.
 Probability of being positive in an attribute given an outcome, P(+o) or P(ao) is estimated using reference data during model development. A collection of P(+o)s forms the coefficients of the Baysean model when there is one predictor
 Probability of a pattern, combination of positives and negatives in a list of attributes, given an outcome, P(po) is estimated using reference data during model development. A collection of P(po)s forms the coefficients of the Bayesean model when there are multiple predictors
 Probability of an outcome, given a positve attribute P(o+) or a pattern P(op), also called Maximum Likelihood, describes the model. This is calculated from the coefficients of the model developed, without any other considerations
 Probabilities when using the model
 The a priori probability π is the probability of outcomes before we apply the Bayesean model. In our example, we expect the background ratio of French:German:Italian to be 3:2:1, transformed (by dividung each by the total) to a priori probabilities of 0.5:0.333:0.167
 The a posteriori probability P(o+,π), P(op,π) is the probability we predict, altering the a priori probability using our Bayesean model. This is done with the attribute that is positive when using a single predictor (P(o+,π)), and with a pattern of attributes when using multiple predictors (P(op,π)). The a posteriori probability is the Bayesean Probability, what this page is all about.
 The a posteriori probability may include a bias, using cost coefficients c, P(o+,π,c), P(op,π,c). In this example, we assigned the relative cost (importance) to French:German:Italian of 1:2:1, transformed (by dividung each by the total) to cost coefficients of 0.25:0.5:0.25
This panel presents the simplest Bayesean Probability model, a single predictor to predict the outcomes. The purpose is to familarize the reader to the basic concepts and mathematics of Bayes probability.
The data from the Introduction panel are compressed to produce two models
 Prediction 3 ethnicity using 2 hair colors
 Prediction 3 ethnicity using 3 hair colors
The master data file and terminology are presented in the Introduction panel, and discussions of concepts are presented in the discussion panel, and they will not be repeated here. This panel merely takes the reader through the steps of calculations
Model 1: Hair color and Ethnicity
Table 1:1 Model Information
Hair  French  German  Italian 
Dark (+)  5  3  6 
Light (+)  5  7  4 
Total  10  10  10 

Aprior  0.5  0.333  0.167 
Cost  0.25  0.5  0.25 
We are using hair color with 2 sttributes (a = dark, light) to predict outcome in 3 ethnicities (o = French, German, Italian). The reference data we used to build our model are 10 cases from each ethnicity. Once we have built the model, we will apply it in a population where the ratio of French:German:Italian are 3:2:1, converted into proportions in the a priori array as [0.5, 0.333, 0.167]. As we are searching for a German translator, we set the relative bias of French:German:Italian as 1:2:1, converted into proportions in the cost array as [0.25,0.5,0.25].
After collecting the data, the necessary information to build our model is presented to the right in table 1.1 Model Information.
Building the model: Coefficients P(ao)
Table 1:2 P(ao) : P(hairColorethnicity)
Hair  French  German  Italian  Total 
Dark (+)  5/10=0.5  3/10=0.3  6/10=0.6  0.5+0.3+0.6=1.4 
Light (+)  5/10=0.5  7/10=0.7  4/10=0.4  0.5+0.7+0.4=1.6 
The model consists of the probability coefficients from the reference data, the probability of each attribute (a) given each outcome (o) is calculated by dividing the number of attributes for that outcome by the total number of that outcome
P(a_{i,j}o_{j}) = N_{i,j}/N_{j} for each outcome j
The results are shown in the table of 1.2 P(hairColorethnicity) to the right.
Describing the model: Maximum Likelihood P(oa)
Table 1:3 P(oa) : P(ethnicityhairColor)
Hair  French  German  Italian 
Dark (+)  0.5/1.4=0.357  0.3/1.4=0.214  0.6/1.4=0.429 
Light (+)  0.5/1.6=0.313  0.7/1.6=0.438  0.4/1.6=0.250 
The model consists of the probability of outcomes given each attribite P(oa), calculated from the coefficient table.
For each outcome j, its probability to be predicted by an attribute i (a _{i,j}), the calculation is
P(o_{j}a_{i,j}) = P(a_{i,j}o_{j}) / sum (P(a_{i,j}o_{j})) for all outcomes
The results are shown in the table 1.3 P(ethnicityhairColor) to the right. The model suggests that, without knowing the a priori probability or including a bias, those with dark color hair are most likely to be Italians (42.9%), and those with light color hair most likely to be Germans (43.8%)
Using the model: Bayes Probability P(oa,π)
Table 1:4 P(ao,π) : P(hairColorethnicity,π)
Hair  French  German  Italian  Total 
Dark (+)  0.5x0.5=0.25  0.3x0.333=0.1  0.6x0.167=0.1  0.25+0.1+0.1=0.45 
Light (+)  0.5x0.5=0.25  0.7x0.333=0.233  0.4x0.167=0.067  0.25+0.233+0.067=0.55 
Table 1.5 Bayes Probability P(oa,π) : P(ethnicity:hairColor,π)
Hair  French  German  Italian 
Dark (+)  0.25/0.45=0.556  0.1/0.45=0.222  0.1/0.45=0.222 
Light (+)  0.25/0.55=0.455  0.233/0.55=0.424  0.067/0.55=0.121 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), the P(ao) coefficients are firstly multiplied by the a priori probabilities before the final P(oa) are calculated
P(a_{i,j}o_{j},π_{j}) = P(a_{i,j}o_{j}) x aprior_{j}
P(o_{j}a_{i,j},π_{j}) =
P(a_{i,j}o_{j},π_{j}) / sum (P(a_{i,j}o_{j},π_{j})) for all outcomes
The results are shown above and to the right, in the table 1.4 P(hairColorethnicity,π) and 1.5 Bayes Probability P(ethnicityhairColor,π). Bayesean Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, Being French is the most likely for both dark hair (55.6%) and light hair (45.5%), although with light hair, being German comes a close second at 42.4%.
Adding a bias: Bayes Probability P(oa,π,c)
Table 1:6 P(ao,π,c) : P(hairColorethnicity,π,c)
Hair  French  German  Italian  Total 
Dark (+)  0.25x0.25=0.063  0.1x0.5=0.05  0.1x0.25=0.025  0.063+0.05+0.025=0.138 
Light (+)  0.25x0.25=0.063  0.233x0.5=0.117  0.121x0.25=0.017 
0.063+0.117+0.017=0.196 
Table 1.7 Probability with bias P(oa,π,c) : P(ethnicity:hairColor,π,c)
Hair  French  German  Italian 
Dark (+)  0.063/0.138=0.455  0.05/0.138=0.364  0.025/0.138=0.182 
Light (+)  0.063/0.196=0.319  0.117/0.196=0.596  0.017/0.196=0.085 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), and a bias towards deciding the outcome as German with the cost coefficients of to be French:German:Italian as 0.25:0.5:0.25, each P(ao,π) from Table 1:4 is multipled by each cost coefficient before the final P(oa) are calculated
P(a_{i,j}o_{j},π_{j},c_{j}) = P(a_{i,j}o_{j},π_{j}) x c_{j}
P(o_{j}a_{i,j},π_{j},c_{j}) =
P(a_{i,j}o_{j},π_{j},c_{j}) / sum (P(a_{i,j}o_{j},π_{j}),c_{j}) for all outcomes
The results are shown above and to the right, in the table 1.6 P(hairColorethnicity,π,c) and 1.7 Probability with bias P(ethnicityhairColor,π,c). Bias Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, and a bias of 25%:50%:25%, we would conclude a person with dark hair to be French at 45.5%, and light hair to be German at 59.6%.
Model 2: Eye color and Ethnicity
Table 2:1 Model Information
Eye  French  German  Italian 
Brown (+)  3  2  5 
Blue (+)  4  6  3 
Others (+)  3  2  2 
Total  10  10  10 

Aprior  0.5  0.333  0.167 
Cost  0.25  0.5  0.250 
This section repeats the calculations and results for predicting the outcomes of ethnicity by the predictor eye color. The purpose is to demonstrate that the model works well with multiple attributes as well as multiple outcomes.
We are using eye color color with 3 sttributes (a = brown, blue, and others) to predict outcome in 3 ethnicities (o = French, German, Italian). The reference data we used to build our model are 10 cases from each ethnicity. Once we have built the model, we will apply it in a population where the ratio of French:German:Italian are 3:2:1, converted into proportions in the a priori array as [0.5, 0.333, 0.167]. As we are searching for a German translator, we set the relative bias of French:German:Italian as 1:2:1, converted into proportions in the cost array as [0.25,0.5,0.25].
After collecting the data, the necessary information to build our model is presented to the right in table 2.1 Model Information.
Building the model: Coefficients P(ao)
Table 2:2 P(ao) : P(eyeColorethnicity)
Eye  French  German  Italian  Total 
Brown (+)  3/10=0.3  2/10=0.2  5/10=0.5  0.3+0.2+0.5=1.0 
Green (+)  4/10=0.4  6/10=0.6  3/10=0.3  0.4+0.6+0.3=1.3 
Others (+)  3/10=0.3  2/10=0.2  2/10=0.2  0.3+0.2+0.2=0.7 
The model consists of the probability coefficients from the reference data, the probability of each attribute (a) given each outcome (o) is calculated by dividing the number of attributes for that outcome by the total number of that outcome
P(a_{i,j}o_{j}) = N_{i,j}/N_{j} for each outcome j
The results are shown in the table of 2.2 P(eyeColorethnicity) to the right.
Describing the model: Maximum Likelihood P(oa)
Table 2:3 P(oa) : P(ethnicityeyeColor)
Eye  French  German  Italian 
Brown (+)  0.3/1.0=0.3  0.2/1.0=0.2  0.5/1.0=0.5 
Blue (+)  0.4/1.3=0.308  0.6/1.3=0.462  0.3/1.3=0.321 
Others (+)  0.3/0.7=0.429  0.2/0.7=0.286  0.2/0.7=0.286 
The model consists of the probability of outcomes given each attribite P(oa), calculated from the coefficient table.
For each outcome j, its probability to be predicted by an attribute i (a _{i,j}), the calculation is
P(o_{j}a_{i,j}) = P(a_{i,j}o_{j}) / sum (P(a_{i,j}o_{j})) for all outcomes
The results are shown in the table 2.3 P(ethnicityeyeColor) to the right. The model suggests that, without knowing the a priori probability or including a bias, those with brown eyes are most likely to be Italians (50.0), those with blue eyes German (46.2%), and those with other color eyes French (42.9%)
Using the model: Bayes Probability P(oa,π)
Table 2:4 P(ao,π) : P(eyeColorethnicity,π)
Eye  French  German  Italian  Total 
Brown (+)  0.3x0.5=0.15  0.2x0.333=0.067  0.5x0.167=0.083  0.15+0.067+0.083=0.3 
Blue (+  0.4x0.5=0.2  0.6x0.333=0.2  0.3x0.167=0.05  0.2+0.2+0.0.05=0.45 
Others ()  0.3x0.5=0.15  0.2x0.333=0.067  0.2x0.167=0.033  0.15+0.067+0.033=0.25 
Table 2.5 Bayes Probability P(oa,π) : P(ethnicity:eyeColor,π)
Eye  French  German  Italian 
Brown (+)  0.15/0.3=0.5  0.067/0.3=0.222  0.083/0.3=0.278 
Blue (+)  0.2/0.45=0.444  0.2/0.45=0.444  0.05/0.45=0.111 
Others (+)  0.15/0.25=0.6  0.2/0.25=0.267  0.033/0.25=0.133 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), the P(ao) coefficients are firstly multiplied by the a priori probabilities before the final P(oa) are calculated
P(a_{i,j}o_{j},π_{j}) = P(a_{i,j}o_{j}) x aprior_{j}
P(o_{j}a_{i,j},π_{j}) =
P(a_{i,j}o_{j},π_{j}) / sum (P(a_{i,j}o_{j},π_{j})) for all outcomes
The results are shown above and to the right, in the table 2.4 P(eyeColorethnicity,π) and 2.5 Bayes Probability P(ethnicityeyeColor,π). Bayesean Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, those with blue eyes are most likely to be German at 44.4% and those with brown or other color eyes French at 50% and 60%.
Adding a bias: Bayes Probability P(oa,π,c)
Table 2:6 P(ao,π,c) : P(eyeColorethnicity,π,c)
Eye  French  German  Italian  Total 
Brown (+)  0.15x0.25=0.038  0.067x0.5=0.033  0.083x0.25=0.021  0.038+0.033+0.021=0.092 
Blue (+  0.2x0.25=0.05  0.2x0.5=0.1  0.05x0.25=0.013  0.050+0.100+0.013=0.163 
Others ()  0.15x0.25=0.038  0.067x0.5=0.033  0.033x0.25=0.008  0.038+0.033+0.008=0.079 
Table 2.7 Probability with bias P(oa,π,c) : P(ethnicity:eyeColor,π,c)
Brown (+)  0.038/0.092=0.409  0.033/0.092=0.364  0.021/0.092=0.227 
Blue (+  0.05/0.163=0.308  0.1/0.163=0.615  0.013/0.163=0.077 
Others ()  0.038/0.079=0.474  0.033/0.079=0.421  0.008/0.079=0.105 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), and a bias towards deciding the outcome as German with the cost coefficients of to be French:German:Italian as 0.25:0.5:0.25, each P(ao,π) from Table 2:4 is multipled by each cost coefficient before the final P(oa) are calculated
P(a_{i,j}o_{j},π_{j},c_{j}) = P(a_{i,j}o_{j},π_{j}) x c_{j}
P(o_{j}a_{i,j},π_{j},c_{j}) =
P(a_{i,j}o_{j},π_{j},c_{j}) / sum (P(a_{i,j}o_{j},π_{j}),c_{j}) for all outcomes
The results are shown above and to the right, in the table 2.6 P(eyeColorethnicity,π,c) and 2.7 Probability with bias P(ethnicityeyeColor,π,c). Bias Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, and a bias of 25%:50%:25%, we would conclude a person with green eyes to be German at 61.5%, and brown or other color eyes to be French a40.9% and 47.4%.
This panel presents the calculations used in the Classification by Basic Bayes Probability Program Page
, with the addition of coefficients for a priori and cost. The data are the same as that described in the Introduction panel of this page.
The Basic Bayesean model is structurally and computationally identical to the simple Bayesean model described in the previous panel. The only difference is that individual attributes are replaced with patterns of combined attributes, as shown in table 1.
Table 1. Model Information
Patterns  French  German  Italian 
Dark_Brown (++)  1  1  3 
Dark_Blue (++)  3  1  2 
Dark_Others (++)  1  1  1 
Light_Brown (++)  2  1  2 
Light_Blue (++)  2  1  1 
Light_Others (++)  1  5  1 

Aprior  0.5  0.333  0.167 
Cost  0.25  0.5  0.25 
A note to repeat the definitions used
 A predictor is a variable used to predict. e.g. Hair color, eye color
 An attribute (a) is a feature of the predictor. e.g. Dark hair, Brown eye
 A pattern (p) is a combination of attributes. e..g. Dark_Brown, ++
 An outcome (o) is what is being predicted. e.g. French, German, Italian
We are using patterns of hair and eye color to predict outcomes of ethnicity. There are 6 patterns, being the combinations of 2 hair color attributes (dark, light) and 3 eye color attributes (brown, blue, other). The reference data we used to build our model are 10 cases from each ethnicity. Once we have built the model, we will apply it in a population where the ratio of French:German:Italian are 3:2:1, converted into proportions in the a priori array as [0.5, 0.333, 0.167]. As we are searching for a German translator, we set the relative bias of French:German:Italian as 1:2:1, converted into proportions in the cost array as [0.25,0.5,0.25].
After collecting the data, the necessary information to build our model is presented to the right in table 1 Model Information.
Building the model: Coefficients P(po)
Table 2. P(po) : P(patternethnicity)
Pattern  French  German  Italian  Total 
Dark_Brown (++)  1/10=0.1  1/10=0.1  3/10=0.3  0.1+0.1+0.3=0.5 
Dark_Blue (++)  3/10=0.3  1/10=0.1  2/10=0.2  0.3+0.1+0.2=0.6 
Dark_Others (++)  1/10=0.1  1/10=0.1  1/10=0.1  0.1+0.1+0.1=0.3 
Light_Brown (++)  2/10=0.2  1/10=0.1  2/10=0.2  0.2+0.1+0.2=0.5 
Light_Blue (++)  2/10=0.2  1/10=0.1  1/10=0.1  0.2+0.1+0.1=0.4 
Light_Others (++)  1/10=0.1  5/10=0.5  1/10=0.1  0.1+0.5+0.1=0.7 
The model consists of the probability coefficients from the reference data, the probability of each pattern (p) given each outcome (o) is calculated by dividing the number of patterns for that outcome by the total number of that outcome
P(p_{i,j}o_{j}) = N_{i,j}/N_{j} for each outcome j
The results are shown in the table of 2 P(patternethnicity) to the right.
Describing the model: Maximum Likelihood P(op)
Table 2:3 P(op) : P(ethnicitypattern)
Pattern  French  German  Italian 
Dark_Brown (++)  0.1/0.5=0.2  0.1/0.5=0.2  0.3/0.5=0.6 
Dark_Blue (++)  0.3/0.6=0.5  0.1/0.6=0.167  0.2/0.6=0.333 
Dark_Others (++)  0.1/0.3=0.333  0.1/0.3=0.333  0.1/0.3=0.333 
Light_Brown (++)  0.2/0.5=0.4  0.1/0.5=0.2  0.2/0.5=0.4 
Light_Blue (++)  0.2/0.4=0.143  0.1/0.4=0.714  0.1/0.4=0.143 
Light_Others (++)  0.1/0.7=0.5  0.5/0.7=0.25  0.1/0.7=0.25 
The model consists of the probability of outcomes given each pattern P(op), calculated from the coefficient table.
For each outcome j, its probability to be predicted by a pattern i (p _{i,j}), the calculation is
P(o_{j}p_{i,j}) = P(p_{i,j}o_{j}) / sum (P(p_{i,j}o_{j})) for all outcomes
The results are shown in the table 2.3 P(ethnicitypattern) to the right. The model suggests that, without knowing the a priori probability or including a bias, those with dark hair and brown eyes are most likely to be Italians (60.0), those with light hair and blue eyes Germans (71.4%), those with dark hair and blue eyes French and those with light hair and other color eyes French (50% each), while the other patterns did not clearly discriminate between the 3 ethnicities.
Using the model: Bayes Probability P(op,π)
Table 2:4 P(op,π) : P(outcomepattern,π)
Pattern  French  German  Italian  Total 
Dark_Brown (++)  0.1*0.5=0.05  0.1x0.333=0.033  0.3x0.167=0.05  0.05+0.033+0.05=0.133 
Dark_Blue (++)  0.3*0.5=0.15  0.1x0.333=0.033  0.2x0.167=0.033  0.15+0.033+0.033=0.217 
Dark_Others (++)  0.1*0.5=0.05  0.1x0.333=0.033  0.1x0.167=0.017  0.05+0.033+0.017=0.1 
Light_Brown (++)  0.2*0.5=0.1  0.1x0.333=0.033  0.2x0.167=0.033  0.1+0.033+0.033=0.167 
Light_Blue (++)  0.2*0.5=0.05  0.1x0.333=0.167  0.1x0.167=0.017  0.05+0.167+0.017=0.233 
Light_Others (++)  0.1*0.5=0.1  0.5x0.333=0.033  0.1x0.167=0.017  0.1+0.033+0.017=0.15 
Table 2.5 Bayes Probability P(op,π) : P(ethnicity:pattern,π)
Pattern  French  German  Italian 
Dark_Brown (++)  0.05/0.133=0.375  0.033/0.133=0.25  0.05/0.133=0.375 
Dark_Blue (++)  0.15/0.217=0.692  0.033/0.217=0.154  0.033/0.217=0.154 
Dark_Others (++)  0.05/0.1=0.5  0.033/0.1=0.333  0.017/0.1=0.167 
Light_Brown (++)  0.1/0.167=0.6  0.033/0.167=0.2  0.033/0.167=0.2 
Light_Blue (++)  0.05/0.233=0.214  0.167/0.233=0.714  0.017/0.233=0.071 
Light_Others (++)  0.1/0.15=0.667  0.033/0.15=0.222  0.017/0.15=0.111 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), the P(po) coefficients are firstly multiplied by the a priori probabilities before the final P(op) are calculated
P(p_{i,j}o_{j},π_{j}) = P(p_{i,j}o_{j}) x aprior_{j}
P(o_{j}p_{i,j},π_{j}) =
P(p_{i,j}o_{j},π_{j}) / sum (P(p_{i,j}o_{j},π_{j})) for all outcomes
The results are shown above and to the right, in the table 2.4 P(outcomepattern,π) and 2.5 Bayes Probability P(ethnicitypattern,π). Bayesean Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, those with light hair and blue eyes are most likely to be German at 71.4%, and all other combinations likely to be French, although those with dark hair and brown eyes are equally likely to be Italians at 37.5%.
Adding a bias: Bayes Probability P(oa,π,c)
Table 2:6 P(op,π,c) : P(outcomepattern,π,c)
Pattern  French  German  Italian  Total 
Dark_Brown (++)  0.05x0.25=0.013  0.033x0.5=0.017  0.05x0.25=0.013  0.013+0.017+0.013=0.042 
Dark_Blue (++)  0.15x0.25=0.038  0.033x0.5=0.017  0.033x0.25=0.008  0.038+0.017+0.008=0.063 
Dark_Others (++)  0.05x0.25=0.013  0.033x0.5=0.017  0.017x0.25=0.004  0.013+0.017+0.004=0.033 
Light_Brown (++)  0.1x0.25=0.025  0.033x0.5=0.017  0.033x0.25=0.008  0.025+0.017+0.008=0.050 
Light_Blue (++)  0.05x0.25=0.013  0.167x0.5=0.083  0.017x0.25=0.004  0.013+0.083+0.004=0.1 
Light_Others (++)  0.1x0.25=0.025  0.033x0.5=0.017  0.017x0.25=0.004  0.025+0.017+0.004=0.046 
Table 2.7 Probability with bias P(op,π,c) : P(ethnicity:pattern,π,c)
Pattern  French  German  Italian 
Dark_Brown (++)  0.013/0.042=0.3  0.017/0.042=0.4  0.013/0.042=0.3 
Dark_Blue (++)  0.038/0.063=0.6  0.017/0.063=0.267  0.008/0.063=0.133 
Dark_Others (++)  0.013/0.033=0.375  0.017/0.033=0.5  0.004/0.033=0.125 
Light_Brown (++)  0.025/0.05=0.5  0.017/0.05=0.333  0.008/0.05=0.167 
Light_Blue (++)  0.013/0.1=0.125  0.083/0.1=0.833  0.004/0.1=0.042 
Light_Others (++)  0.025/0.046=0.545  0.017/0.046=0.364  0.004/0.046=0.091 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), and a bias towards deciding the outcome as German with the cost coefficients of to be French:German:Italian as 0.25:0.5:0.25, each P(op,π) from Table 2:4 is multipled by each cost coefficient before the final P(op) are calculated
P(p_{i,j}o_{j},π_{j},c_{j}) = P(p_{i,j}o_{j},π_{j}) x c_{j}
P(o_{j}p_{i,j},π_{j},c_{j}) =
P(p_{i,j}o_{j},π_{j},c_{j}) / sum (P(p_{i,j}o_{j},π_{j}),c_{j}) for all outcomes
The results are shown above and to the right, in the table 2.6 P(outcomepattern,π,c) and 2.7 Probability with bias P(ethnicitypattern,π,c). Bias Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, and a bias of 25%:50%:25%, we would conclude those identified as French would have dark hair/blue eye (60%), light hair/brown eye (50%), and light hair/other color eyes (54.5%). Thise identified as Germans would have dark hair/brown eye (40%), dark hair/other color eyes (50%), and light hair blue eyes (83.3%). No one would be identified as Italian.
This panel presents explanations for the calculations involved in the Naive Bayes model, in support of the program in the Classification by Naive Bayes Probability Program Page
. This panel will focus on the technical aspects, and leave general discussions on the model and its comparison to the Basic Bayes model to the Discussion panel.
The Example Data
Table 1. Model Information
Patterns  French  German  Italian 
Dark_Brown (++)  1  1  3 
Dark_Blue (++)  3  1  2 
Dark_Others (++)  1  1  1 
Light_Brown (++)  2  1  2 
Light_Blue (++)  2  1  1 
Light_Others (++)  1  5  1 

Sample sie  10  10  10 
Aprior  0.5  0.333  0.167 
Cost  0.25  0.5  0.25 
Naive Bayes model is only necessary when there are multiple predictors to a set of outcomes. The same set of data as used in both Classification by Basic Bayes Probability Program Page
and Classification by Naive Bayes Probability Program Page
, as well as in all other panels on this page, will therefore be used.
We are using patterns of hair and eye color to predict outcomes of ethnicity. There are 6 patterns, being the combinations of 2 hair color attributes (dark, light) and 3 eye color attributes (brown, blue, other). The reference data we used to build our model are 10 cases from each ethnicity. Once we have built the model, we will apply it in a population where the ratio of French:German:Italian are 3:2:1, converted into proportions in the a priori array as [0.5, 0.333, 0.167]. As we are searching for a German translator, we set the relative bias of French:German:Italian as 1:2:1, converted into proportions in the cost array as [0.25,0.5,0.25].
After collecting the data, the necessary information to build our model is presented to the right in table 1 Model Information.
Format of Input Data
In the basic Bayes Model, both the outcome and the patterns of attributes can be presented as names, text strings, so the format in the input data can be flexible providing they are consistent throughout.
As Naive Bayes model naively assumes that all attributes in the prediction pattern are uncorrelated and independent of each other, predictions are made by combining those attributes that are present (+) in any individual pattern. This means that the input data for predictors (the pattern) must contain information on the presence or absence of all attributes in all predictors. Different applications have their own format of how predictors patterns are presented, but the two programs Classification by Basic Bayes Probability Program Page
and Classification by Naive Bayes Probability Program Page
and this page use the following format
 Each predictor is represented by text strings with columns of +s or s, the number of which is the same as the number of attributes. In our example data:
 Hair color has 2 attributes, so 2 columns, + for dark (and not light), + for (not dark) and light
 Eye color has 3 attributes, so 3 columns, + for brown (not blue of others), + for (not brown) and blue and (not others), + for (not brown or blue) and others
 The attributes strings are then concatenated in the same order, ++ for dark hair brown eye, ++ for dark hair blue eye, ++ for dark hair other color eye, ++ for light hair brown eye, ++ for light hair blue eye, and ++ for light hair other color eye.
The patterns for input in the example data are shown (in brackets) in table 1, Model Information
Creating the Model P(+o) or P(ao)
Table 2. P(+o)
Attribute +  French  German  Italian 
Hair Dark (col 1)  5/10=0.5  3/10=0.3  6/10=0.6 
Hair Light(col 2)  5/10=0.5  7/10=0.7  4/10=0.4 
Eye Brown (col 3)  3/10=0.3  2/10=0.2  5/10=0.5 
Eye Blue (col 4)  4/10=0.4  6/10=0.6  3/10=0.3 
Eye Others (col 5)  3/10=0.3  2/10=0.2  2/10=0.2 
The number of positives for each attribute in each outcome is counted, and this is divided by the sample size of each outcome to form the probability of positive given outcome for each attribute P(+o).
P(+_{i,j}o_{j}) = N^{+}_{i,j} / N_{j}
The results are shown in table 2 P(+o), and these are the coefficients for the model, to be used in calculating probabilities in the future.
Calculating the Probability of a Pattern Given Outcome P(po)
Table 3. P(po)
Patterns  Col^{+}  French  German  Italian  Total 
Dark_Brown (++)  1,3  0.5x0.3=0.15  0.3x0.2=0.06  0.6x0.5=0.3  0.15+0.06+0.3=0.51 
Dark_Blue (++)  1,4  0.5x0.4=0.2  0.3x0.6=0.18  0.6x0.3=0.18  0.2+0.18+0.18=0.56 
Dark_Others (++)  1,5  0.5x0.3=0.15  0.3x0.2=0.06  0.6x0.2=0.12  0.15+0.06+0.12=0.33 
Light_Brown (++)  2,3  0.5x0.3=0.15  0.7x0.2=0.14  0.4x0.5=0.2  0.15+0.14+0.2=0.49 
Light_Blue (++)  2,4  0.5x0.4=0.2  0.7x0.6=0.42  0.4x0.3=0.12  0.2+0.42+0.12=0.74 
Light_Others (++)  2,5  0.5x0.3=0.15  0.7x0.2=0.14  0.4x0.2=0.08  0.15+0.14+0.08=0.37 
The probability of a pattern given an outcome P(po) is the product of P(+o) for all attribute positives in that pattern
P(p_{i,j}o_{j}) = product(P(+_{i,j}o_{j}))
For example, in the pattern dark hair and brown eye, the pattern is ++, so attribute 1 and 3 are positive. From the P(+o) in table 2, the calculations for P(po) are, for the P(pFrench) 0.5x0.3=0.15, for German 0.3x0.2=0.06 and for Italian 0.6x0.5=0.3
Given that the example data has only 2 predictors containing 5 attributes between them, and there are only 6 patterns, the complete P(po) table can be constructed, as shown to the right in table 3. In practice, prediction models can be very much more complex, with tens of attributes and hundreds and thousands of patterns, and arrays of P(po) for any combination of pattern and outcome are usually computed dynamically according to the data presented.
Once the P(po) array for any outcome is calculated, the rest of the procedures are the same as that in the Basic Bayes model.
Describing the model: Maximum Likelihood P(op)
Table 4 P(op) : P(ethnicitypattern)
Patterns  French  German  Italian 
Dark_Brown (++)  0.15/0.51=0.294  0.06/0.51=0.118  0.3/0.51=0.588 
Dark_Blue (++)  0.2/0.56=0.357  0.18/0.56=0.321  0.18/0.56=0.321 
Dark_Others (++)  0.15/0.33=0.455  0.06/0.33=0.182  0.12/0.33=0.364 
Light_Brown (++)  0.15/0.49=0.306  0.14/0.49=0.286  0.2/0.49=0.408 
Light_Blue (++)  0.2/0.74=0.27  0.42/0.74=0.568  0.12/0.74=0.162 
Light_Others (++)  0.15/0.37=0.405  0.14/0.37=0.378  0.08/0.37=0.216 
The model consists of the probability of outcomes given each pattern P(op), calculated from the dynamically calculated array of P(po).
For each outcome j, its probability to be predicted by a pattern i (p _{i,j}), the calculation is
P(o_{j}p_{i,j}) = P(p_{i,j}o_{j}) / sum (P(p_{i,j}o_{j})) for all outcomes
The results are shown in the table 4 P(ethnicitypattern) to the right. The model suggests that, without knowing the a priori probability or including a bias, those with brown eyes are most likely to be Italians (58.8%, 48.8%), those other color eyes are most likely to be French (45.5%,40.5%), those with light hair and blue eye most likely to be Germans (56.8%), while a combination of dark hair and blue eye are not decisive in predicting ethnicity.
Please note: the full table of maximum likelihood can be produced here because of a very limited number of combinations of attributes, patterns, and outcome. In practice, with tens of attributes and hundreds and thousands of patterns, the computer program usually produces Maximum Likelihood for individual cases in the reference data, rather than the exhaustive combinations of all patterns
Using the model: Bayes Probability P(op,π)
Table 5 P(op,π) : P(outcomepattern,π)
Pattern  French  German  Italian  Total 
Dark_Brown (++)  0.15*0.5=0.075  0.06x0.333=0.02  0.3x0.167=0.05  0.075+0.02+0.05=0.145 
Dark_Blue (++)  0.2*0.5=0.1  0.18x0.333=0.06  0.18x0.167=0.3  0.1+0.06+0.03=0.19 
Dark_Others (++)  0.15*0.5=0.075  0.06x0.333=0.02  0.12x0.167=0.02  0.075+0.020+0.020=0.115 
Light_Brown (++)  0.15*0.5=0.075  0.14x0.333=0.047  0.2x0.167=0.033  0.075+0.047+0.033=0.155 
Light_Blue (++)  0.2*0.5=0.1  0.42x0.333=0.14  0.12x0.167=0.02  0.1+0.14+0.02=0.26 
Light_Others (++)  0.15*0.5=0.075  0.14x0.333=0.047  0.08x0.167=0.013  0.075+0.047+0.013=0.135 
Table 6 Bayes Probability P(op,π) : P(ethnicity:pattern,π)
Pattern  French  German  Italian 
Dark_Brown (++)  0.075/0.145=0.517  0.02/0.145=0.138  0.05/0.145=0.345 
Dark_Blue (++)  0.1/0.19=0.526  0.06/0.19=0.316  0.3/0.19=0.158 
Dark_Others (++)  0.075/0.115=0.652  0.02/0.115=0.174  0.02/0.115=0.174 
Light_Brown (++)  0.075/0.155=0.484  0.047/0.155=0.301  0.033/0.155=0.215 
Light_Blue (++)  0.1/0.26=0.385  0.14/0.26=0.538  0.02/0.26=0.077 
Light_Others (++)  0.075/0.135=0.556  0.047/0.135=0.346  0.013/0.135=0.099 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), the P(po) coefficients are firstly multiplied by the a priori probabilities before the final P(op) are calculated
P(p_{i,j}o_{j},π_{j}) = P(p_{i,j}o_{j}) x aprior_{j}
P(o_{j}p_{i,j},π_{j}) =
P(p_{i,j}o_{j},π_{j}) / sum (P(p_{i,j}o_{j},π_{j})) for all outcomes
The results are shown above and to the right, in the table 5 P(outcomepattern,π) and 6 Bayes Probability P(ethnicitypattern,π). Bayesean Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, those with light hair and blue eyes are most likely to be German at 53.8%, and all other combinations likely to be French.
Adding a bias: Bayes Probability P(oa,π,c)
Table 7 P(op,π,c) : P(outcomepattern,π,c)
Pattern  French  German  Italian  Total 
Dark_Brown (++)  0.075x0.25=0.019  0.02x0.5=0.1  0.05x0.25=0.019  0.019+0.01+0.013=0.041 
Dark_Blue (++)  0.1x0.25=0.025  0.06x0.5=0.03  0.3x0.25=0.008  0.025+0.03+0.008=0.063 
Dark_Others (++)  0.075x0.25=0.019  0.02x0.5=0.01  0.02x0.25=0.005  0.019+0.01+0.005=0.034 
Light_Brown (++)  0.075x0.25=0.019  0.047x0.5=0.023  0.033x0.25=0.008  0.019+0.023+0.008=0.5 
Light_Blue (++)  0.1x0.25=0.025  0.14x0.5=0.07  0.02x0.25=0.005  0.025+0.07+0.005=0.1 
Light_Others (++)  0.075x0.25=0.019  0.047x0.5=0.023  0.013x0.25=0.005  0.019+0.023+0.003=0.045 
Table 8 Probability with bias P(op,π,c) : P(ethnicity:pattern,π,c)
Pattern  French  German  Italian 
Dark_Brown (++)  0.019/0.041=0.455  0.1/0.041=0.242  0.019/0.041=0.303 
Dark_Blue (++)  0.025/0.063=0.4  0.03/0.063=0.48  0.008/0.063=0.12 
Dark_Others (++)  0.019/0.034=0.556  0.01/0.034=0.296  0.005/0.034=0.148 
Light_Brown (++)  0.019/0.5=0.372  0.023/0.5=0.463  0.008/0.5=0.165 
Light_Blue (++)  0.025/0.1=0.25  0.07/0.1=0.7  0.005/0.1=0.05 
Light_Others (++)  0.019/0.045=0.413  0.023/0.045=0.514  0.005/0.045=0.073 
To use the model in a population we believe to have the proportions between French:German:Italian to be 0.5(50%):0.333(33.3%):0.167(16.7%), and a bias towards deciding the outcome as German with the cost coefficients of to be French:German:Italian as 0.25:0.5:0.25, each P(op,π) from Table 2:4 is multipled by each cost coefficient before the final P(op) are calculated
P(p_{i,j}o_{j},π_{j},c_{j}) = P(p_{i,j}o_{j},π_{j}) x c_{j}
P(o_{j}p_{i,j},π_{j},c_{j}) =
P(p_{i,j}o_{j},π_{j},c_{j}) / sum (P(p_{i,j}o_{j},π_{j}),c_{j}) for all outcomes
The results are shown above and to the right, in the table 2.6 P(outcomepattern,π,c) and 2.7 Probability with bias P(ethnicitypattern,π,c). Bias Probabilities suggest that, in a population of French:German:Italian of 50%:33.3%:16.7%, and a bias of 25%:50%:25%, we would conclude those identified as French would have dark hair and have brown or other color eyes (45.5% 55.6%). Those with all other patterns would be identified as Germans.
The mathematical aspects of Bayes Probability are covered in the previous panels, and only conceptual discussions will be presented on this panel.
Components of Bayes Probability Analysis
Bayesean Probability Theory is a mathematical model of making decisions based on experience. The process is to predict, using a set of predictors to determine the probabilities of alternative outcomes. In the Bayesean context, prediction is not to forecast the future, nor to establish what may be true, but to logically determine how sure we can be, in terms of probabilities (a number between 0 and 1, or a percentage), for each of the alternative outcomes contained in our model, based on a set of observed predictors.
The process of Bayesean decisions can be separated into the following stages
 We begin by nominating the a priori probabilities (π, P(o)), a belief of how likely each of the alternative outcomes should be, before knowing the values of predictors. This can be established by a number of means
 We can declare that we do not know, and assign the same value as a priori probabilities to all outcomes
 We can base the a priori probabilities on knowledge, from experience, research, previously collected data, heresay, or cultural belief
 We can propose a priori probabilities as a hypothesis to explore, such as "if the a priori probabilities are ...., then ....."
 We then use the patterns of predictors to change these a priori probabilities to a new set of probabilities, the a posteriori probabilities (P(op,π)). We use the coefficients of our Bayesean model to do this.
 Under some circumstances, we may in addition impose a bias on our decisions, if we hold that the outcomes have different values ((P(op,π,c)). For example, headache may predict anxiety or brain tumour, but missing a brain tumour has far graver consequences than missing anxiety, so we can insert a cost coefficient to bias our decisions towards brain tumour. The term cost refers to the cost of wrongly not identify a particular outcome, reflecting the importance of that outcome. The process insert a deliberate, considered and calibrated bias to our decisions.
Prior to this, we will need to develop the coefficients of our model. The basic logic is that, if a particular attribute (a) from a predictor is commonly seen whenever we observed an outcome (o), then we can conclued that outcome is also common when we see that attribute. We therefore collect an adequate sample that contains the outcomes of interest, count the number of attributes in each outcome group, and establish the probabilities of an attribute given the outcome P(attributeoutcome) P(ao). The collection of P(ao)s forms the coefficients in our model.
Once we have the model, we can describe it by estimating the Maximum Likelihoods, which are the probabilities of outcome when an attribute is observed (P(outcomeattribute) P(oa). These are estimated from the attributes only, without any other considerations. Using the Maximum Likelihood calculated in the first section of the Simple Bayes panel, 50%, 30% and 60% of French, German, and Italian have dark hair. From this, we can conclude that someone with dark hair is most likely to be an Italian, with a 42.9% confidence . Similarly, 50%,70%,40% of French German, and Italians have light hair, so we can also conclude that someone with light hair is most likely a German, with a confidence of 43.8%.
a posteriori Probability P(oa,π), the main subject under discussion, is when we use this model to convert a priori to a posteriori probabilities. Using the same example, we are going to use the coefficients we established in a population where the ratio of French, German, and Italian are 3:2:1. Before we have apply our model, our a priori probabilities are 50%, 33.3%, and 16.7% for French, German, or Italian. We convert these to a posteriori probabilities using hair color as the predictor, with the following a priori>a posteriori changes.
 If the hair is dark, 50%>55.6% for French, 33.3%>22.2% for German, and 16.7%>22.2% Italian
 If the hair is light, 50%>45.5% For French, 33.3%>42.4% for German, and 16.7%>16.1% for Italian
 In other words, using our model in this population, being French remains most likely regardless of hair color (55.6% for dark hair and 45.5% for light hair. However German comes a very close second when the hair is light (42.4%)
a posteriori Probability with Bias P(oa,π,c). When a bias is introduced, the probability for outcomes with a higher cost coefficients increases. In our example, we are recruiting German translators, so we think not missing a German is twice as important as the other two ethnicities. The relative costs of 1:2:1 are normalized to the proportions of 25%:50%:25%. The a priori>Bayes>plus bias probabilities are as follow.
 If the hair is dark, 50%>55.6%>45.5% for French, 33.3%>22.2%>36.4% for German, and 16.7%%>22.2%>18.2% Italian
 If the hair is light, 50%>45.5%>31.9% For French, 33.3%>42.4%>59.6% for German, and 16.7%>16.1%>8.5% for Italian
 In other words, using our bias model in this population, we would identify someone with dark hair as French (45.5%), and light hair as German (59.6%)
Please Note: that many discussions of Bayesean Probability do not include bias in the algorithm. The inclusion in StatsToDo is intended to show that it is possible to include a value in decision making, and not to suggest that it needs to be used. In the programs, if bias is not intended, the same values should be given for the bias coefficients.
Bayes Probability with Multiple Predictors: Basic Bayes
The Bayes model remains fundamentally the same when there are more than one predictors, and the term "Basic Bayes" is only used in StatsToDo to avoid confusion with the Naive Bayes model. The attributes (a) form each predictors are combined to form patterns of attributes, and the patterns are mathematically treated in the same way as if they are attributes from a single predictor.
In our example, the 2 attributes of hair color (dark, light) are combined with 3 attributes of eye color (brown, blue, others) to form 6 patterns of dark_brown, dark_blue, dark_others, light_brown, light_blue, and light_others. Using the arithematic, shown in the Basic Bayes panel of this page and the Classification by Basic Bayes Probability Program Page
, the a priori>Bayesean>bias probabilities are as follows
 For dark hair and brown eye, 50%>35.7%>30% for French, 33.3%>25%>40% for German, 16.7%>35.7%>30% for Italian.
In other words, being French is most likely (50%) before any observations. With dark hair and brown eyes, being French and Italian are equally likely (35.7% each). Adding a bias, we would decide for German (40%)
 For dark hair and blue eyes, 50%>69.2%>60% for French, 33.3%>15.4%>26.7% for German, and 16.7%>15.4%>13.3% for Italians.
In other words, being French is most likely (50%) before any observations. With dark hair and blue eye, we are surer to 69.2%. Adding a bias towards German reduce our certainty to 60%, but our conclusion remains that being French is most likely.
 For dark hair and other color eyes, 50%>50%>37.5% for French, 33.3%>33.3%>50% for German, and 16.7%>16.7%>12.5% for Italians.
In other words, being French is most likely (50%) before any observations. With dark hair and other color eye, we have not changed our minds. Adding a bias towards German, we decide that being German is most likely (50%).
 For light hair and brown eye, 50%>60%>50% for French, 33.3%>20%>33.3% for German, 16.7%>20%>16.7% for Italian.
In other words, being French is most likely (50%) before any observations. With light hair and brown eyes, we increase our certainty for French to 60%. Adding a bias, our certainty is reduced to 50%, but we would still decide that being French is most likely.
 For light hair and blue eyes, 50%>21.4%>12.5% for French, 33.3%>71.4%>83.3% for German, and 16.7%>7.1%>4.2% for Italians.
In other words, being French is most likely (50%) before any observations. With light hair and blue eye, we decide being German is most likely (71.4%). Adding a bias towards German increase that certainty to 83.3%.
 For light hair and other color eyes, 50%>66.7%>54.5% for French, 33.3%>22.2%>50% for German, and 16.7%>11.1%>9.1% for Italians.
In other words, being French is most likely (50%) before any observations. With light hair and other color eye, we increase that certainty to 66.7%. Adding a bias towards German, we reduce the certainty of being French to 54.5%, and choosing German comes a close second at 50%.
It can be seen that the basic Bayesean model is simple, elegant, intuitively easy to understand and accept. What follows are discussions on its usage in practice.
Advantages
The main advantage is that the model is self evidently valid, as it is the mathematical form of an accepted practice of basing decisions on knowledge and experience. Given that each pattern and each outcome are unique there is no confusion on what the relationships and conclusions refer to. user can therefore be confident with the decisions from the model, providing the model is based on valid data, and the a priori probabilities are well chosen.
Another advantage is the ease of use. Tables of coefficients, and tables relating patterns of predictors to probabilities of outcomes, can be produced by researchers, and distributed to users. Users do not need to understand the mathematics, they just need to consult the published tables to make decisions.
Disadvantages
The major disadvantage of the basic Bayesean model is scalability, that it cannot cope with too many predictors. In our example, with 2 predictor of 2 hair and 3 eye color, the number of patterns is 2x3=6. If we are to add skin color (say pale, tan, and dark), the number of patterns will be 6x3=36. If we then add body bult (skinny fat, muscular), then temperament (stoic, emotional) and so on, each additional predictor multiplies the number of patterns required. Even if all predictors are binery, the number of patterns would exceed 1000 if more than 9 predictors are included. Some of the combinations may be uncommon (say fat, calm, dark skin, light hair, blue eyes, and French), and to obtain sufficient cases to fill all combinations with sufficient numbers to produce stable predictions may require samples sizes that are impractical to collect.
The model fails when presented with missing data. For example, we cannot make a prediction when the person is too far away for us to know the color of his/her eyes, or if the person has no hair. Missing data is common in situations where decisions need to be made with incomplete data. For example, when a patient arrives in an emergency department in a hospital, the triage nurse has to make a decision based on the major symptom. Once admitted, decisions are based on additional history and physical examination. After a few hours, decisions are modified with test results. A day later, more modification based on progress observed. This means that missing data can be normal, and a single model cannot cope with this. In these situations, individual model will need to be produced for every conceivable combination of data and missing data that may arise, and this magnifies the scalability problem.
The model is inflexible, in that each pattern includes attributes from all predictors. This means that no predictors can be added, removed, or its relationship with outcomes altered without changes to the whole model. If any change is required, then a new model will need to be constructed. For example, after we used our example model for a while, we observed that the 3 ethnic groups have different temperaments, and wish to include that in our model. We cannot simply add temperament to the model, as we will have to create 12 new patterns of hair_eye_temperament, essentially building a new model with new data. The problem of stability is particularly important in rapidly changing environments, such as medical care where disease patterns and technologies evolve rapidly.
Usage
The basic Bayesean model is preferred when the relationship between predictors and outcomes are stable, when the total number of patterns are not too great (say 100 patterns, or equivalent to 5 or less binary predictors), where there is no rarity in any pattern/outcome combination, where missing data is not expected, and where a large population is available to build the model.
Bayes Probability with Multiple Predictors: Naive Bayes
Naive Bayes probability is used only when there are multiple predictors. It differs from the basic model by accepting, naively, that the predictors are independent of each other.
Conceptually, the model starts with the nominated a priori probability, and compute the first a posteriori probability using an attribute of the first predictor. This first a posteriori probability is then used as a priori probability to compute the second a posteriori probability using an attribute of the second predictor. This is then used to compute the third, and so on, until all predictors have been included.
The algorithm of Naive Bayes probability combines the whole sequence and reduces unnecessary repetitions to produce the final results. The model calculates the probability of being positive for each attribute in each predictor P(+o) for each outcome, and this list becomes the coefficients of the model, as shown in table 2 in the naive Bayes panel of this page, and in the results from the Classification by Naive Bayes Probability Program Page
.
when presented with a pattern of predictors, the program multiplies all P(+o) from attribute that are positive in the pattern, to produce the probability of the pattern given the outcome (P(po) = product(P(+o) of all positive attributes).
Using our example, the P(po) for dark hair and brown eye is P(+o)_{dark hair} x P(+o}_{brown eye} which becomes the pattern coefficients used to convert a priori probability to a posteriori probabilities, P(op.π), and P(op,π,c) in the same manner as that of the basic Bayes model.
In other words, the Naive Bayes model differs from the basic model in the following manners
 The naive assumption that all predictors in the model are independent of each other
 The coefficients are P(+o), for each attribute of each predictor related to each outcome, and not P(po) for each pattern related to each outcome
 The probability of patterns given outcome (P(po)) is computed dynamically when presented with a pattern, and not selected from a fixed list
 All other calculations in the two models are the same.
The results obtained from the example are already presented in the Naive Bayes panel of this page and from the program in the Classification by Naive Bayes Probability Program Page
. The explanation and interpretations for these results are the same as that for the basic Bayes probability. Therefore results from the example will not be further discussed here.
The advantages and disadvantages of using the Naive Bayes model are the mirror image of the basic Bayes model.
Advantages
The Naive Bayes model is easier to scale to include greater number of predictors.
 The number of P(ao) required increases linearly for the Naive Bayes model, compared with exponentially for the basic bayes model. From our example, 2 hair and 3 eye color requires 2+3=5 P(ao) cells for each outcome for Naive Bayes, instead of 2x3=6 P(po) for basic Bayes. Using only n binary predictors, Naive Bayes requires 2xn combinations of attribute, but basic Bayes requires 2^{n} combination of patterns
 By assuming that all attributes are independent, the model can be built providing sufficient number of cases are available for each attribute, compared with basic Bayes where sufficient numbers are required for each combination of attributes
 In other words, the model can be built using a smaller and more practical sample size.
The Naive Bayes model copes better with missing data. As all attributes are assumed independent, the absence of an attribute merely means that the influence of that attribute is not included. Missing data, where all attributes from a predictor are not available, means that aposteriori probability is calculated without its influence. The results would be less precise with missing data, but the algorithem will produce an answer with what data it has. In other words, the quality of the result would degrade and becomes increasingly approximate with more missing data, but a result will be calculated.
The Naive Bayes model is more flexible. It is easier to add, delete, or change a predictor. Given the assumption that all predictors are independent, and the P(po) is calculated dynamically, addition, removal, or change in any predictor should not affect the performance of the other predictors. Most of the calculations would remain the same, but the results would be modified by the changes. This flexibility is particularly useful in situations where the relationship between predictors and outcome may change, or technological development requires additional predictors to be added, such as in medical care.
Disadvantages
The main disadvantage, or objection, is in the naive assumption, that the predictors are independent, because this assumption is rarely correct. Two types of dependencies may exist in any set of predictors.
 Correlation may exist between predictors, as many predictors have common precursors in genetics, geography, history, culture, and so on. When a correlation exists, the common precursor is used more than once in calculation, tautology is created, which inserts an unaccounted bias in the results. An example may be using height weight, and skin color to predict red cell concentration is blood of children. Height and weight depending on age, so the model imposees the influence of age twice via height and weight, and skin color only once, and resulted in a bias prediction.
 Interaction is when predictors have synergistic or inhibitory effects on each other in their relationship with outcomes.
 An example of synergism is in the treatment of cancer. If surgery has a cure rate of 10% and chemotherapy 10%, then we would expect that providing both will result in a cure rate of 19% (10.9x0.9=0.19) if there is no interaction. In many cases however, surgery makes the remaining cancer cells more sensitive to chemotherapy, and the cure rate of combined therapy is more than 19%.
 An eample of inhibition is in the use of antibiotics to treat a particular infection. An antibiotic, say penicillin, may cure this infection in 10% of cases, and another, say tetracyline, also 10%. In the absence of interaction we would expect a 19% cure rate if both are used. However, penicillin works by destroying bacterial cell walls and is most effective when bacteria are actively growing, but tetracycline works by slowing the growth of bacteria. The giving of both therefore may result in one inhibiting the effect of the other, and a cure rate of less than 19%.
The error cause by the naive assumption in the Bayes model is difficult to quantify, so the precision and accuracy can only be estimated when the model is used after development.
A less important disadvantage is that Bayesean probabilities are calculated dynamically when the model is used. The calculations are often tedious, and if carried out manually prone to error. This means that using the model requires a computer that contains the appropriate software, and that the user is capable of operating that software.
Usage
Given the problems of the naive assumption of predictor independence, the basic Bayes model should always be preferred, and the Naive Bayes model chosen only when the basic model is impractical, when the number of predictors are too large, when some combinations of attribute are too rare, when decisions are often required with incomplete data, or when the model requires frequent modifications.
When chosen, the Naive Bayes model should not be viewed as self evidently valid. Rather it should be considered as a tool to develop a prediction method, which requires validation and frequent monitoring during usage after development.
During development much attention must be paid towards selecting the predictors and their attributes, to make them as independent of each other as possible. If lack of independence between any two or more predictors is suspected, combining attributes from them into a single predictor may be required.
During usage after development, results should be recorded, and these reviewed regularly to detect and calibrate errors, in the light of which the model can be modified and improved.
The following references are mainly introductory in nature, mainly to help the inexperienced.
Basic Bayes
https://en.wikipedia.org/wiki/Bayes%27_theorem Bayes Probability on Wikipedia. This provides a clear definition, and provides links to references for more reading
https://arbital.com/p/bayes_rule/?l=1zq An online The introduction to Bayes theorem. This provides good detailed explanations, and links to additional pages that catered to different level of knowledge and needs
Naive Bayes
https://en.wikipedia.org/wiki/Naive_Bayes_classifier Naive Bayes on Wikipedia. A concise and clear description, and provides references to morre reading
https://www.machinelearningplus.com/predictivemodeling/hownaivebayesalgorithmworkswithexampleandfullcode/ a teaching page with explanations for beginners
Mueller JP and Massaron L (2016) Machine Learning for Dummies. John Wiley and Son, Inc New Jersey, ISBN 978119245513. p.158163.
This gives only a brief introduction to both Bayes algorithms. However it provides perspectives in how Bayes probabililty is used in data analysis, in comparison with a whole lot of other methods. There is not much guidance to calculations however, as the book relies heavily on the use of R and Python packages.
Old references
The following 2 references introduced me to Naive Bayes many years ago, before the term naive became commonly used. I include them partly out of sentiment, but mostly they discussed the practical problems of hadling predictors with multiple attributes without using multiple and convoluted approaches, techniques that evolved into the +s and s I used in this and the two programming pages.
Warner HR, Toronto AF, George Veasey LG, and Stephenson R (1961)A Mathematical Approach
to Medical Diagnosis. Application to Congenital Heart Disease. JAMA 117:3 p.177183.
This explained the use of Naive Bayes to diagnose congenital heart diseases in the new born, using multiple medical observations as predictors. The term Naive was not used in the paper, but the formula is unmistakable. The method of handling data was a bit complicated, and the authors used different methods of computation, depending on whether the predictors were binomial or multinomial. The pitfalls of the naive assumption of predictor independence was also discussed, accompanied by suggested methods to reduce the risks associated with them.
Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology.
McGraw Hill Book Company New York. Library of Congress No. 73147164070479356 p.400412.
Both the basic and Naive Bayes were explained this book, which I came across many years ago. The book discussed many multivariate methods of handling psychiatric data, and the Bayesean classifications were used to distinguish between depression and schizophrenia. The term naive was not as yet in use, and the authors used "pattern probability" for the method now generally called Naive Bayes. The book also discussed different options of handling multinomial predictors to reduce computational complexity. The terms "attribute" and "pattern" used in this page and the programs are derived from those discussions, and the use of a string of +s and s also evolved from the strategies suggested. This book is old, and may be out of print, but major libraries should have a copy as this was a standard text book for multivariate statistics at the Masters level in the 1970s.
Contents of G:6
