Related link :
Classification by Basic Bayes Probability Program Page
Classification by Naive Bayes Probability Program Page
Introduction
Basic Bayes
Naive Bayes
Discussions
E
References
G
This page provides explanation and support for the two programs in Classification by Basic Bayes Probability Program Page
and
Classification by Naive Bayes Probability Program Page
. As the programs and this explanation page use specific terms and abbreviations, and these are best demonstrated with examples, this introduction panel will describe the example used and the terminology.
The format of data entry and explanation of results produced are in the Help and Hints panel of the program pages.
The contents of the other panels of this page are
The remainder of this panel provides a description of the example data used in this and the two program pages, and brief explanations of the overall concepts of Bayesian probability and the terms used in these pages.
Before we start: Modern computer perform calculations with precision to 14 decimal points. The two programs associated with this page display results with precision to 4 decimal points. On this page, to conserve space and make reading easier, probability values are displayed to 2 decimal points. Minor differences to the results from the program may occasionally arise, and some of the probabilities do not total to 1. Reader should be aware and not be confused by this.
The Example
The same example is used in the two programming pages and this explaination page.
Predictor | Outcome |
Hair | Eye | French | German | Italian |
Dark | Blue | 3 | 1 | 2 |
Dark | Brown | 1 | 1 | 3 |
Dark | Others | 1 | 1 | 1 |
Light | Blue | 2 | 1 | 1 |
Light | Brown | 2 | 1 | 2 |
Light | Others | 1 | 5 | 1 |
a priori | 0.5 | 0.33 | 0.17 |
We wish to develop a Bayesian model to identify the ethnicity of people, based on hair color and eye color. To build our model, we recruited 10 each of known French, German, and Italians, and observed their hair and eye color. We then use the Bayesian model to predict ethnicity using hair and eye colors, in a community with an expected ratios of French:German:Italian of 3:2:1, normalized to a priori probabilities of 0.5:0.33:0.17.
The count of each combinations and the coefficients are presented in the table to the right, and the explanation of terms and abbreviations used are as follows
Bayesian Probability
Bayesian Probability Theory is a mathematical model of making decisions based on experience. The process is to predict, using a set of predictors to determine the probabilities of alternative outcomes. In the Bayesian context, prediction is not to forecast the future, nor to establish what may be true, but to logically apply the observed values of predictors to calculate how confident we can be, in terms of probabilities (a number between 0 and 1, or a percentage), for each of the alternative outcomes contained in our model.
The process of Bayesian decisions can be separated into the following stages
- We begin by nominating the a priori probabilities (π), our confidence in believing each of the alternative outcomes to be correct, before taking predictors into consideration. This can be established by the following
- We can decide that we do not know, and assign the same value as a priori probability to all outcomes
- We can base the a priori probabilities on knowledge, from experience, research, previously collected data, heresay, cultural belief, or simply a guess
- We can propose a priori probabilities as a hypothesis to explore, such as "if the a priori probabilities are ...., then ....."
- From our example, in the community we will use our Bayesian model (noth west of Switzerland), Census informs us that the ratio of French:German:Italian are 3:2:1. These are normalized to probabilities by dividing each value by the total to 0.5:0.33:0.17
- We then use the coefficients of our model to apply the attributes of predictors to change a priori probabilities to a posteriori probabilities. The coefficients are developed using a set of reference data, in our example, 10 cases of each ethnicity. Each coefficient is the probability of seeing an attribute given the outcome P(a|o), obtained by dividing the number of cases with each pair of attribute/outcome by the sample size of that outcome in the reference data. Both the Basic and Naive Bayes model use P(a|o) as coefficients, but they are calculated, presented, and used differently. Details of this are presented in the 2 subsequent panels.
- The coefficients P(a|o) interacts with the alternatives in the predictor(s) to estimate the a posteriori probability. This is term the a posteriori probability, and commonly referred to as the Bayesian probability
- When there is only 1 predictor, as in the Basic Bayes model, attribute (a) represents each alternative of the predictor, and the Bayesian probability is probability given attribute πP(o|a)
- When there are more than 1 predictor, as in the Naive Bayes model, pattern (p) represents an array of attributes, one from each predictor, and the Bayesian probability is probability given pattern πP(o|p)
- Two types of a posteriori probability can therefore be calculated using the coefficients we developed
- Probability of outcome using only the predictor(s), without taking a priori probability into consideration. In the Basic Bayes model with 1 predictor, this is probability given attribute P(o|a), and in the Naive Bayes model with multiple predictors probability given pattern P(o|p). This probability is also termed Maximum Likelihood, and the table of Maximum Likelihood describes the behaviour of the model.
- Probability of outcome using the predictor(s) and the a priori probabilities π. In the Basic Bayes model with 1 predictor, this is probability given attribute and a priori probability πP(o|a), and in the Naive Bayes model with multiple predictors probability given pattern and a priori probability πP(o|p). This probability is also termed Bayes or Bayesian Probability, and is the major and most commonly used a posteriori probability.
Summary and Technical Notes
The terminology and abbreviations used in this page and the two associated program pages are adapted from diverse sources, and may not be the same as in other publications. Users should be aware of this peculiarity when comparing these pages with other sources of information. These are chosen to prefer clarity over brevity, hoping that, by doing so, the inexperienced will be less confused. In particular, the following should be noted.
- Predictor is a conceptual term representing things used to predict. and no abbreviation is provided in these pages. In other publications, a variety of terms and abbreviations, such as independent variable, x, j, are used
- Attribute is the value of a predictor, and abbreviated as a. In other publication, predictor, independent variable, x, j, and so on are used
- Pattern is an array of attributes, and abbreviated as p, one from each predictor, and is used only in the Naive Bayes model. In other publication, predictor, independent variable, x, j, and so on are used
- Outcome is used bothe as a concept of things to predict, and also as the values (probability) predicted, and is abreviated as o.In other publication, dependent variable, a posteriori, posterior probability, y, z, θ are used
- The abbreviation P(x|y), representing the probability of x given y, generally known as conditional probability, is the same in these pages as in most publications. However, in most publications, the same abbreviations are used (with different letters) to represent different types of conditional probabilities, while in these pages
- P(a|o) and P(p|o) represent probability of attribute or pattern given outcome. Other publications use P(x|y), P(x|θ), or names of predictors and outcomes
- P(o|a) and P(o|a) represent probability of outcome given attribute or pattern, without consideration of a priori probabilities. Other publications use P(y|x), P(θ|x), or names of predictors and outcomes This represents Maximum Likelihood, a term used in these pages as in most publications
- πP(o|a) and πP(o|a) represent Bayesian Probability, with π representing a priori probability. The term is an old one (see references), and used in these pages to distinguish it from Maximum Likelihood. In most publications the same abbreviation as Maximum Likelihood is used, and what the abbreviation means depends on the context described.
This panel discusses the basic Bayes model generally, and the calculations used in the Classification by Basic Bayes Probability Program Page
.
Table B1: Counts
Attributes | Outcome |
| French | German | Italian |
DarkBlue | 3 | 1 | 2 |
DarkBrown | 1 | 1 | 3 |
DarkOthers | 1 | 1 | 1 |
LightBlue | 2 | 1 | 1 |
LightBrown | 2 | 1 | 2 |
LightOthers | 1 | 5 | 1 |
a priori π | 0.5 | 0.33 | 0.17 |
Basic Bayes is used to represent the Bayes Probability model, as originally described by Bayes and in the Wikipedia page on Bayes Theorem. The term "Basic" is used to avoid confusion with the Naive Bayes model, and is not applicable outside of StatsToDo.
The model uses a single predictor with 2 or more attributes to predict the probabilities of 2 or more outcomes. Where more than one variable are used to predict, they are combined into a single compound predictor. In the default example in the Classification by Basic Bayes Probability Program Page
, the data presented in the Introduction panel are restructured, so that the two variables, hair color (2 attributes of Dark and Light) and eye colors (3 attributes of Blue, Brown and Others) are combined into a single compound predictor of HairEye, with 2x3=6 attributes of DarkBlue, DarkBrown, DarkOthers, LightBlue, LightBrown, and LightOthers. The restructured table is used in the program, and as shown to the right.
Building the model: Attribute (a): Compound(Hair and eye color), Outcome (o):ethnicity
Table B2. Model Coefficients P(a|o)
Attributes | Outcome |
| French | German | Italian | Total |
DarkBlue | 3/10=0.3 | 1/10=0.1 | 2/10=0.2 | 0.3+0.1+0.2=0.6 |
DarkBrown | 1/10=0.1 | 1/10=0.1 | 3/10=0.3 | 0.1+0.1+0.3=0.5 |
DarkOthers | 1/10=0.1 | 1/10=0.1 | 1/10=0.1 | 0.1+0.1+0.1=0.3 |
LightBlue | 2/10=0.2 | 1/10=0.1 | 1/10=0.1 | 0.2+0.1+0.1=0.4 |
LightBrown | 2/10=0.2 | 1/10=0.1 | 2/10=0.2 | 0.2+0.1+0.2=0.5 |
LightOthers | 1/10=0.1 | 5/10=0.5 | 1/10=0.1 | 0.1+0.5+0.1=0.7 |
The coefficients of the model, to be used to convert a priori to a posteriori probability, is the probability of attribute given outcome P(a|o). For the i th attribute and the j th outcome, P(a|o) is calculated by dividing the number of the attribute/outcome pair (N i,j) by the sample size of that outcome (N j).
In this example, the attributes are the hair eye color combinations, and the outcomes ethnicity. The results are shown in Table B2 to the right.
Prediction 1. Maximum Likelihood P(o|a)
Table B3. Maximum Likelihood P(o|a)
Attribute | Outcome |
| French | German | Italian |
DarkBlue | 0.3/0.6=0.5 | 0.1/0.6=0.17 | 0.2/0.6=0.33 |
DarkBrown | 0.1/0.5=0.2 | 0.1/0.5=0.2 | 0.3/0.5=0.6 |
DarkOthers | 0.1/0.3=0.33 | 0.1/0.3=0.33 | 0.1/0.3=0.33 |
LightBlue | 0.2/0.4=0.14 | 0.1/0.4=0.71 | 0.1/0.4=0.14 |
LightBrown | 0.2/0.5=0.4 | 0.1/0.5=0.2 | 0.2/0.5=0.4 |
LightOthers | 0.1/0.7=0.5 | 0.5/0.7=0.25 | 0.1/0.7=0.25 |
If a posteriori probability is calculated without the inclusion of a priori probability π, the result is probability of outcome given attribute P(o|a), also called Maximum Likelihood. This describes the model, and demonstrates the relationship between attributes and outcomes.
P(o|a) is calculated from P(a|o) in Table B2. For each outcome j, its probability to be predicted by an attribute i (ai), the calculation is
P(oj|ai) = P(ai|oj) / Σ(P(ai|oj)) for all outcomes
The calculations and results are shown in the table B3 to the right. They suggest that, without including a priori probabilities, those with light hair and blue eyes Germans (0.71), those with dark hair and blue eyes French and those with light hair and other color eyes French (0.5), while the other combined attributes did not clearly discriminate between the 3 ethnicities.
Prediction 2. Bayesian Probability πP(o|a)
Table B4a. πP(a|o)
Attributes | Outcomes |
| French | German | Italian | Total |
DarkBlue | 0.3x0.5=0.15 | 0.1x0.33=0.03 | 0.2x0.17=0.03 | 0.15+0.033+0.033=0.22 |
DarkBrown | 0.1x0.5=0.05 | 0.1x0.33=0.03 | 0.3x0.17=0.05 | 0.05+0.033+0.05=0.13 |
DarkOthers | 0.1x0.5=0.05 | 0.1x0.33=0.03 | 0.1x0.17=0.02 | 0.05+0.033+0.02=0.1 |
LightBlue | 0.2x0.5=0.05 | 0.1x0.33=0.17 | 0.1x0.17=0.017 | 0.05+0.17+0.017=0.233 |
LightBrown | 0.2x0.5=0.1 | 0.1x0.33=0.03 | 0.2x0.17=0.033 | 0.1+0.03+0.03=0.17 |
LightOthers | 0.1x0.5=0.1 | 0.5x0.33=0.03 | 0.1x0.17=0.017 | 0.1+0.03+0.017=0.15 |
Table B4b. Bayesian Probability πP(o|a)
Attributes | Outcomes |
| French | German | Italian |
DarkBlue | 0.15/0.22=0.69 | 0.03/0.22=0.15 | 0.03/0.22=0.15 |
DarkBrown | 0.05/0.13=0.38 | 0.03/0.13=0.25 | 0.05/0.13=0.38 |
DarkOthers | 0.05/0.1=0.5 | 0.03/0.1=0.33 | 0.02/0.1=0.17 |
LightBlue | 0.05/0.23=0.21 | 0.17/0.23=0.71 | 0.02/0.23=0.07 |
LightBrown | 0.1/0.17=0.6 | 0.03/0.17=0.2 | 0.03/0.17=0.2 |
LightOthers | 0.1/0.15=0.67 | 0.03/0.15=0.22 | 0.02/0.15=0.11 |
If a posteriori probability is calculated by changing the a priori probability π, the result is probability of outcome given attribute and a priori probability πP(o|a). This is usually referred to as Bayesian Probability, as it follows the descriptions first made by Bayes.
πP(o|a) is calculated from P(a|o) in Table B2 and the apriori probability π from Table B1. The calculations are in 2 steps. Firstly, the coefficient is adjusted by the a priori probability of each outcome, then the adjusted coefficients are normalized by the total for all outcomes. For each outcome j, its probability to be predicted by an attribute i (ai) and a priori probability πj, the calculations are as follows
- πjP(ai|oj) = P(ai|oj) x πj
- πjP(oj|ai) = πjP(ai|oj) / Σall j(πjP(ai|oj))
The calculations and results are shown in tables B4a (step 1) and B4b (step 2). Bayesian Probabilities suggest that, in a population of French:German:Italian of 0.5:0.33:0.17, those with light hair and blue eyes are most likely to be German at 0.71 and all other combinations likely to be French, although those with dark hair and brown eyes are equally likely to be Italians at 0.38.
The program
The structure of the program, and how to use it, are presented in the Help and Hint panel of Classification by Basic Bayes Probability Program Page
. The results produced, using the default example data, are as follows.
Attribute | French | German | Italian |
DarkBlue | 3 | 1 | 2 |
DarkBrown | 1 | 1 | 3 |
DarkOthers | 1 | 1 | 1 |
LightBlue | 1 | 5 | 1 |
LightBrown | 2 | 1 | 2 |
LightOthers | 2 | 1 | 1 |
The Matrix of counts are produced if the program commences with the modelling data in Program 1. The program counts the numbers in each attribute/outcome combination, and table the results as shown to the right. This is the same as table B1 previously shown in this panel.
|
Attribute | French | German | Italian |
DarkBlue | 0.3 | 0.1 | 0.2 |
DarkBrown | 0.1 | 0.1 | 0.3 |
DarkOthers | 0.1 | 0.1 | 0.1 |
LightBlue | 0.1 | 0.5 | 0.1 |
LightBrown | 0.2 | 0.1 | 0.2 |
LightOthers | 0.2 | 0.1 | 0.1 |
The Model Coefficients are calculated if the program commences with Program 1, and copied from the text area if the commences with Program 2. This is the same as Table B2. Model Coefficients P(a|o), previously shown in this panel, but without the calculations
|
Regardless whether computation begins at Program 1, 2, or 3, the following results are shown.
| French | German | Italian |
a priori π | 0.5 | 0.3333 | 0.1667 |
The Arrays of a priori coefficients are as shown to the right. a priori probabilities are normalized by dividing each value by the total of the array.
|
Attribute | French | German | Italian |
DarkBlue | 0.6923 | 0.1538 | 0.1539 |
DarkBrown | 0.375 | 0.25 | 0.3751 |
DarkOthers | 0.5 | 0.3333 | 0.1667 |
LightBlue | 0.2143 | 0.7143 | 0.0714 |
LightBrown | 0.6 | 0.2 | 0.2 |
LightOthers | 0.6667 | 0.2222 | 0.1111 |
The table of a posteriori probabilities is as shown to the right. The type of table created depends on how the a priori probabilities are set by the user.
- Maximum Likelihood (P(o|a)) if a priori probabilities are not set
- Bayesian (πP(o|a) if a priori probabilities are set
As the a priori probabilities are set in the reference example, the table is πP(o|a), the same as table B4B previously shown in this panel, but without the calculations
The highest probability for each row is marked bold to indicate the outcome chosen by the model.
|
Row | Attribute | French | German | Italian |
1 | DarkBlue | 0.6923 | 0.1538 | 0.1539 |
2 | LightOthers | 0.6667 | 0.2222 | 0.1111 |
3 | DarkBrown | 0.375 | 0.25 | 0.3751 |
.... etc |
Running Program 1, or if data is available in the Data text area for Program 2 or 3, each attribute is processed by the model to produce the array of probabilities, as shown in the table to the right.
|
Program 4 creates a Javascript function that will allow the user to calculate a posteriori probabilities from attributes. This function can be incorporated into any html page, or adapted to another computer language and incorporated into users own programs.
StatsToDo provides a template BasicBayesTemplate.html which demonstrates how this function can be used. As the template contains all the explanations, they will not be further elaborated here. Users are encouraged to access, examine, renames, and modify the html page and explore how the function can be used.
This panel discusses the Naive Bayes model generally, and the calculations used in the Classification by Naive Bayes Probability Program Page
.
Table N1: Counts
Attributes | Outcome |
Col | Attribute | French | German | Italian |
1 | Dark | 5 | 3 | 6 |
1 | Light | 5 | 7 | 4 |
2 | Blue | 4 | 6 | 3 |
2 | Brown | 3 | 2 | 5 |
2 | Others | 3 | 2 | 2 |
a priori π | 0.5 | 0.33 | 0.17 |
Naive Bayes is used to represent the Bayes Probability model, a modification of the original Bayes modle described by Bayes and in the Wikipedia page on Bayes Theorem, and explained in Wikipedia page on Naive Bayes.
The model uses 2 or more predictors, each with 2 or more attributes, to predict the probabilities of 2 or more outcomes. The term Naive refers to the naive assumption that the predictors are independent of each other.
In the default example in the Classification by Naive Bayes Probability Program Page
, the data presented in the Introduction panel are restructured, so that the two preditors, hair color (2 attributes of Dark and Light) and eye colors (3 attributes of Blue, Brown and Others) are counted separately. The restructured table is used in the program, and as shown to the right.
Please note: The term Col is used in all tables to represent predictor. Col1 is predictor 1 (in this example hair color), Col2 is predictor 2 (in this example eye color), and so on.
Building the model: Attribute (a) : Outcome (o)
Table N2. Model Coefficients P(a|o)
Col | Attribute | French | German | Italian |
1 | Dark | 5/10=0.5 | 3/10=0.3 | 6/10=0.6 |
1 | Light | 5/10=0.5 | 7/10=0.7 | 4/10=0.4 |
2 | Blue | 4/10=0.4 | 6/10=0.6 | 3/10=0.3 |
2 | Brown | 3/10=0.3 | 2/10=0.2 | 5/10=0.5 |
2 | Others | 3/10=0.3 | 2/10=0.2 | 2/10=0.2 |
The coefficients of the model, to be used to convert a priori to a posteriori probability, is the probability of attribute given outcome P(a|o). For the i th attribute from each predictor and the j th outcome, P(a|o) is calculated by dividing the number of the attribute/outcome pair (N i,j) by the sample size of that outcome (N j).
In this example, the attributes are the hair color (Col 1) and eye color (Col 2), and the P(a|o) for each attribute from each predictor (Col) are calculated separately. The results are shown in Table N2 to the right.
Creating the pattern coefficient: pattern(p) : Outcome (o)
Table N2x. Pattern Coefficients P(p|o)
Pattern | Outcome |
Col 1 | Col 2 | French | German | Italian | Total |
Dark | Blue | 0.5x0.4=0.2 | 0.3x0.6=0.18 | 0.6x0.3=0.18 |
0.2+0.18+0.18=0.56 |
Dark | Brown | 0.5x0.3=0.15 | 0.3x0.2=0.06 | 0.6x0.5=0.3 |
0.15+0.06+0.3=0.51 |
Dark | Others | 0.5x0.3=0.15 | 0.3x0.2=0.06 | 0.6x0.2=0.12 |
0.15+0.06+0.12=0.33 |
Light | Blue | 0.5x0.4=0.2 | 0.7x0.6=0.42 | 0.4x0.3=0.12 |
0.2+0.42+0.12=0.74 |
Light | Brown | 0.5x0.3=0.15 | 0.7x0.2=0.14 | 0.4x0.5=0.2 |
0.15+0.14+0.2=0.49 |
Light | Others | 0.5x0.3=0.15 | 0.7x0.2=0.14 | 0.4x0.2=0.08 |
0.15+0.14+0.08=0.37 |
When presented with an array of attributes (pattern), Naive Bayes creates the coefficient for that particular pattern/outcome combination (P(p|o)). This is created by multiplying the P(a|o) of attribute/outcome combination in the array of attributes (pattern). An example uses the first pattern
- Col 1, hair color:Dark and outcome 1 French: P(a|o) = 0.5
- Col 2, eye color:Blue and outcome 1 French: P(a|o) = 0.4
- For pattern [Dark,Blue]/French P(p|o) = 0.5x0.4 = 0.2
- Equation : P(p|o) = Π(all P(a|o) in the array of attributes)
The coefficients are as shown as Table N2x for demonstration purpose here. In practice, the P(p|o) coefficients are calculated dynamically, depending on the attributes in the pattern array. As this is a dynamic intermediary step, P(p|o) are not presented as results, but used to produce the a posteriori probabilities in the same manner as P(a|o) in the basic Bayes model.
Prediction 1. Maximum Likelihood P(o|p)
Table N3. Maximum Likelihood P(o|a)
Pattern | Outcome |
Col 1 | Col 2 | French | German | Italian |
Dark | Blue | 0.2/0.56=0.36 | 0.18/0.56=0.32 | 0.18/0.56=0.32 |
Dark | Brown | 0.15/0.51=0.29 | 0.06/0.51=0.12 | 0.3/0.51=0.59 |
Dark | Others | 0.15/0.33=0.45 | 0.06/0.33=0.18 | 0.12/0.33=0.36 |
Light | Blue | 0.2/0.74=0.27 | 0.42/0.74=0.57 | 0.12/0.74=0.16 |
Light | Brown | 0.15/0.49=0.31 | 0.14/0.49=0.29 | 0.2/0.49=0.41 |
Light | Others | 0.15/0.37=0.41 | 0.14/0.37=0.38 | 0.08/0.37=0.22 |
If a posteriori probability is calculated without the inclusion of a priori probability π, the result is probability of outcome given attribute P(o|p), also called Maximum Likelihood. This describes the model, and demonstrates the relationship between attributes and outcomes.
P(o|p) is calculated dynamically, the values demonstrated in Table N2x. For each outcome j, its probability to be predicted by pattern i (pi), the calculation is
P(oj|pi) = P(pi|oj) / Σ(P(pi|oj)) for all outcomes
The calculations and results are shown in the table N3 to the right. They suggest that, without including a priori probability, those with brown eyes are most likely to be Italians (0.59 for dark hair and 0.41 for light hair), those with light hair and blue eyes German (0.57), and the rest French (0.36,0.45, 0.41).
Prediction 2. Bayesian Probability πP(o|p)
Table N4a. πP(p|o)
Pattern | Outcome |
Col 1 | Col 2 | French | German | Italian | Total |
Dark | Blue | 0.2x0.5=0.1 | 0.18x0.33=0.06 | 0.18x0.17=0.03 |
0.10+0.06+0.03=0.19 |
Dark | Brown | 0.15x0.5=0.08 | 0.06x0.33=0.02 | 0.3x0.17=0.05 |
0.08+0.02+0.05=0.15 |
Dark | Others | 0.15x0.5=0.08 | 0.06x0.33=0.02 | 0.12x0.17=0.02 |
0.08+0.02+0.02=0.12 |
Light | Blue | 0.2x0.5=0.1 | 0.42x0.33=0.14 | 0.12x0.17=0.02 |
0.10+0.14+0.02=0.26 |
Light | Brown | 0.15x0.5=0.08 | 0.14x0.33=0.05 | 0.2x0.17=0.03 |
0.08+0.05+0.03=0.16 |
Light | Others | 0.15x0.5=0.08 | 0.14x0.33=0.05 | 0.08x0.17=0.01 |
0.08+0.05+0.01=0.13 |
Table N4b. Bayesian Probability πP(o|p)
Pattern | Outcome |
Col 1 | Col 2 | French | German | Italian |
Dark | Blue | 0.1/0.19=0.53 | 0.06/0.19=0.32 | 0.03/0.19=0.16 |
Dark | Brown | 0.08/0.15=0.52 | 0.02/0.15=0.14 | 0.05/0.15=0.34 |
Dark | Others | 0.08/0.12=0.65 | 0.02/0.12=0.17 | 0.02/0.12=0.17 |
Light | Blue | 0.1/0.26=0.38 | 0.14/0.26=0.54 | 0.02/0.26=0.08 |
Light | Brown | 0.08/0.16=0.48 | 0.05/0.16=0.3 | 0.03/0.16=0.22 |
Light | Others | 0.08/0.13=0.56 | 0.05/0.13=0.35 | 0.01/0.13=0.1 |
If a posteriori probability is calculated by changing the a priori probability π, as in the majority of predictions, the result is probability of outcome given pattern and a priori probability πP(o|p). This is usually referred to as Naive Bayesian Probability.
πP(o|p) is calculated from the dynamically calculated P(p|o) as shown in in Table N2x and the apriori probability π from Table N1. The calculations are in 2 steps. Firstly, the coefficient P(p|o) is adjusted by the a priori probability of each outcome, then the adjusted coefficients are normalized by the total for all outcomes. For each outcome j, its probability to be predicted by a pattern i (pi) and a priori probability πj, the calculations are as follows
- πjP(pi|oj) = P(pi|oj) x πj
- πjP(oj|pi) = πjP(pi|oj) / Σall j(πjP(pi|oj))
The calculations and results are shown in tables N4a (step 1) and N4b (step 2). Bayesian Probabilities suggest that, in a population of French:German:Italian of 0.5:0.33:0.17, those with light hair and blue eyes are most likely to be German at 0.54 and all other combinations likely to be French
The program
The structure of the program, and how to use it, are presented in the Help and Hint panel of Classification by Naive Bayes Probability Program Page
. The results produced, using the default example data, are as follows.
Col | Attribute | French | German | Italian |
1 | Dark | 5 | 3 | 6 |
1 | Light | 5 | 7 | 4 |
2 | Blue | 4 | 6 | 3 |
2 | Brown | 3 | 2 | 5 |
2 | Others | 3 | 2 | 2 |
The Matrix of counts are produced if the program commences with the modelling data in Program 1. The program counts the numbers in each attribute/outcome combination, and table the results as shown to the right. This is the same as table N1 previously shown in this panel.
|
Col | Attribute | French | German | Italian |
1 | Dark | 0.5 | 0.3 | 0.6 |
1 | Light | 0.5 | 0.7 | 0.4 |
2 | Blue | 0.4 | 0.6 | 0.3 |
2 | Brown | 0.3 | 0.2 | 0.5 |
2 | Others | 0.3 | 0.2 | 0.2 |
The Model Coefficients are calculated if the program commences with Program 1, and copied from the text area if the commences with Program 2. This is the same as Table N2. Model Coefficients P(p|o), previously shown in this panel, but without the calculations
|
Regardless whether computation begins at Program 1, 2, or 3, the following results are shown.
| French | German | Italian |
a priori π | 0.5 | 0.3333 | 0.1667 |
The Arrays of a priori coefficients are as shown to the right. a priori probabilities are normalized by dividing each value by the total of the array.
|
Attributes | Outcomes |
Col 1 | Col 2 | French | German | Italian |
Dark | Blue | 0.5263 | 0.3158 | 0.1579 |
Dark | Brown | 0.5172 | 0.1379 | 0.3449 |
Dark | Others | 0.6522 | 0.1739 | 0.1739 |
Light | Blue | 0.3846 | 0.5384 | 0.0769 |
Light | Brown | 0.4839 | 0.301 | 0.2151 |
Light | Others | 0.5556 | 0.3456 | 0.0988 |
The table of a posteriori probabilities is as shown to the right. The type of table created depends on how the a priori probabilities are set by the user.
- Maximum Likelihood (P(o|a)) if a priori probabilities are not set
- Bayesian (πP(o|a) if a priori probabilities are set
As the a priori probabilities are set in the reference example, the table is πP(o|a), the same as table N4B previously shown in this panel, but without the calculations
The highest probability for each row is marked bold to indicate the outcome chosen by the model.
Please Note: The use of the Naive Bayes model may include many predictors. As the number of possible combinations increases exponentially with the number of predictors, it may not be possible to present all combinations before time limit set by the server expires and the program crashes. This table is therefore only presented when the total number of combinations do not exceed 500, as in the case with the default example where there are 2x3=6 combinations.
|
Col 1 | Col 2 | French | German | Italian |
Dark | Blue | 0.5263 | 0.3158 | 0.1579 |
Light | Others | 0.5556 | 0.3456 | 0.0988 |
Dark | Brown | 0.5172 | 0.1379 | 0.3449 |
.... etc |
Running Program 1, or if data is available in the Data text area for Program 2 or 3, each attribute is processed by the model to produce the array of probabilities, as shown in the table to the right.
Please Note: To avoid the program crashing because the time limit set by the server expires, only the first 500 rows of data are calculated and presented, as in the case with the default example where there are 30 row of data.
|
Program 4 creates a Javascript function that will allow the user to calculate a posteriori probabilities from attributes. This function can be incorporated into any html page, or adapted to another computer language and incorporated into users own programs.
StatsToDo provides a template NaiveBayesTemplate.html which demonstrates how this function can be used. As the template contains all the explanations, they will not be further elaborated here. Users are encouraged to access, examine, renames, and modify the html page and explore how the function can be used.
Explanations for the terms and abbreviations used, and the mathematical algorithms, have already been discussed in the previous panels. This panel discusses, in conceptual terms, the differences between the Basic and Naive Bayesian models, and how they may be used.
Basic Bayes Probability Model
The Basic Bayes model is based on the original Bayes Theorem. The term "Basic" is used in StatsToDo to distinguish it from the Naive Bayes model. In this discussion therefore, Basic Bayes and Bayes means the same thing
Conditional probability is estimating and changing our estimate of probability, when presented with conditions we known are associated with the outcome of interest. For example, if we know that a particular poker player blinks a lot when he is bluffing, then we can conclude he is likely to be bluffing when he blinks a lot. The model is simple, elegant, intuitively easy to understand and accept. What follows are discussions on its usage in practice.
Advantages
The main advantage is that the model is self evidently valid, as it is the mathematical form of an accepted practice of basing decisions on knowledge and experience. Given that each attribute and each outcome is unique, their relationship is also unique, so there is no ambiguitiy in what the results represent. The user can therefore be confident in translating the numerical results to decisions and actions, providing the model is based on valid data, and the a priori probabilities are well chosen.
The model is also easy to use. Tables of coefficients, and tables translating attributes to probabilities of outcomes, can be produced by researchers, and distributed to end decision makers, either physically as reference tables, or electronically as computer applications. The accompanied html page BasicBayesTemplate.html is a template html page demonstrating how the coefficients developed can be translated into a tool for front line decision makers.
Disadvantages
The major disadvantage of the basic Bayesian model is its inability to cope with too many predictors.
- In our example, with 2 predictor of 2 hair and 3 eye color, the number of combined attributes is 2x3=6. If we are to add skin color (say pale, tan, and dark), the number of attributes will be 6x3=36. If we then add body bult (skinny, fat, muscular), 36x3=108. Then temperament (stoic, emotional), 108x2=216, and so on. Each additional predictor multiplies the number of combinations, and this can exceed 1000 if more than 9 binary predictors are included.
- The problem is not so much computational complexity, as the high speed and large memory of modern computers can always cope.
- The main problem is to find sufficient cases to build a valid reference model in the first place. If probabilities are to be accurate to 0.1 (10%) then at least 10 cases are needed in each attribute/outcome pair. Some of the combinations may be uncommon (say fat, calm, dark skin, light hair, blue eyes, and French), and to obtain sufficient cases to fill all combinations with sufficient numbers may require samples sizes that are impractical to collect.
The model cannot compute missing data, as each attribute is a combination of all the predictors, and all observations must be present for the attribute to be valid. For example, we cannot make a prediction when the person is too far away for us to know the color of his/her eyes, or if the person has no hair. Missing data is common in situations where decisions need to be made with incomplete data. For example, when a patient arrives in an emergency department in a hospital, the triage nurse has to make a decision based on the major symptom. Once admitted, decisions are based on additional history and physical examination. After a few hours, decisions are modified with test results. A day later, more modification based on progress observed. This means that missing data can be normal, and a single model cannot cope with this. In these situations, individual model will need to be produced for every conceivable combination of data and missing data that may arise, and this magnifies the scalability problem.
The model is inflexible, in that each attribute in a compound predictor is actually a combination of information from different original predictors. This means that no original predictor can be added, removed, or its relationship with outcomes altered without changes to attributes in the model. If any change is required, then a new model will need to be constructed. For example, after we used our example model for a while, we observed that the 3 ethnic groups have different temperaments, and wish to include that in our model. We cannot simply add temperament to the model, as we will have to create 12 new patterns of hair_eye_temperament, essentially building a new model with new data. The problem of stability is particularly important in rapidly changing environments, such as medical care where disease patterns and technologies evolve rapidly.
Usage
The basic Bayesian model is preferred if the following conditions can be met
- When the relationship between predictors and outcomes are stable
- When the total number of attribute not so great that the adequate sample size to build the model cannot be obtained,
- Where there is no attribute/outcome combination that is so uncommon that the total sample size required cannot be reached
- When missing data is not expected
- Where a sufficiently large sample size is available to build the model.
Naive Bayes Probability
Naive Bayes probability is used only when there are multiple predictors. It differs from the basic model by accepting, naively, that the predictors are independent of each other.
Conceptually, the model can be considered as a sequence of basic Bayesian calculations. We start with the a priori probabilities π modify this with the attribute of the first predictor to produce the first a posteriori probabilities, which become the a posteriori probabilities for the attribute of the second predictor, the results of which become a priori for the third predictor, and so on.
The computations are simplified by multiplying all the coefficients P(a|o) representing the attributes to create the pattern coefficient P(p|o), and use this in the same way as in the basic Bayesian model.
The results obtained from the example are already presented in the Naive Bayes panel of this page and from the program in the Classification by Naive Bayes Probability Program Page
. The explanation and interpretations for these results are the same as that for the basic Bayes probability. Therefore results from the example will not be further discussed here.
The advantages and disadvantages of using the Naive Bayes model are the mirror image of the basic Bayes model.
Advantages
The Naive Bayes model can include a large number of predictors.
- The number of P(a|o) required increases linearly with the number of predictors for the Naive Bayes model, compared with exponentially for the basic bayes model. From our example, 2 hair and 3 eye color requires 2+3=5 P(a|o) cells for each outcome for Naive Bayes, instead of 2x3=6 P(p|o) for basic Bayes. Using only n binary predictors, Naive Bayes requires 2xn individual attribute, but basic Bayes requires 2n combination of attributes
- By assuming that all attributes are independent of each other, a workable model can be built providing sufficient number of cases are available for each attribute, compared with basic Bayes where sufficient numbers are required for each combination of attributes
- In other words, the model can be built using a smaller and more practical sample size.
The Naive Bayes model copes better with missing data. As all attributes are assumed independent, the absence of an attribute merely means that the influence of that attribute is not included. The results will be less precise, but the algorithm will calculate the a posteriori with what data it has, insted of not computing at all when the Basic Bayeseam model confronts missing data.
The Naive Bayes model is more flexible. It is easier to add, delete, or change a predictor. Given the assumption that all predictors are independent, and the P(p|o) is calculated dynamically, the addition, removal, or change in any predictor should not affect the performance of the other predictors. Most of the calculations would remain the same, but the results would be modified by the changes. This flexibility is particularly useful in situations where the relationship between predictors and outcome may change, or technological development requires additional predictors to be added from time to time, such as in medical care.
Disadvantages
The main disadvantage of, or objection to the Naive Bayes model is in the naive assumption, that the predictors are independent, because this assumption is rarely correct. Two types of dependencies may exist in any set of predictors.
- Correlation may exist between predictors, as many predictors have common precursors in genetics, geography, history, culture, and so on. When a correlation exists, the common precursor is used more than once in calculation, and this inserts an unaccounted bias in the results. In our example, hair and eye color may be correlated, as they are both dependent on pigmentation generally. If we combine these with say temperament in a model, we will use the influence of pigmentation twice (two colors) to temperament once, resulting in a bias decision.
- Interaction is when predictors have synergistic or inhibitory effects on each other in their relationship with outcomes.
- An example of synergism is in the treatment of cancer. If surgery has a cure rate of 10% and chemotherapy 10%, then we would expect that providing both will result in a cure rate of 19% (1-0.9x0.9=0.19) if there is no interaction. In many cases however, surgery makes the remaining cancer cells more sensitive to chemotherapy, and the cure rate of combined therapy is more than 19%.
- An eample of inhibition is in the use of antibiotics to treat a particular infection. An antibiotic, say penicillin, may cure this infection in 10% of cases, and another, say tetracyline, also 10%. In the absence of interaction we would expect a 19% cure rate if both are used. However, penicillin works by destroying bacterial cell walls and is most effective when bacteria are actively growing, but tetracycline works by slowing the growth of bacteria. The giving of both therefore may result in one inhibiting the effect of the other, and a cure rate of less than 19%.
The error cause by the naive assumption in the Bayes model is difficult to quantify, so the precision and accuracy can only be estimated when the model is deployed after development.
Choice between models
The Basic Bayes model should be the first option, given its theoretical validity and ease of use.
Given that the assumtion of predictor independence is naive and contains unquantifiable bias, the Naive Bayes model is a compromise solution, and chosen when the Basic Bayes model is not practicable. When chose, the following should be observed
- The Naive Bayes model should be considered only as an approach to develop a prediction model, its validity and usefulness should not be presumed.
- The choice of predictors should be carefully considered to ensure that they are as independent from each other as possible. When correlation or interaction is suspected, predictors can be combined into compound predictors before they are added to the model
- The performance of the completed model requires constant review, and the model adjusted when inaccuracies are traced or whem circumstances changed.
E
The following references are introductory in nature, mainly to help the inexperienced.
Basic Bayes
https://en.wikipedia.org/wiki/Bayes%27_theorem Bayes Probability from Wikipedia. This provides a clear definition, and links to references for more reading
https://arbital.com/p/bayes_rule/?l=1zq An online The introduction to Bayes theorem. This provides good detailed explanations, and links to additional pages that catered to different level of knowledge and needs
http://jim-stone.staff.shef.ac.uk/BookBayes2012/bookbayesch01WithR.pdf The first chapter of a book on Bayes Theorem, and provides a clear explanation and examples of Bayes that can be understood by beginners
Naive Bayes
https://en.wikipedia.org/wiki/Naive_Bayes_classifier Naive Bayes from Wikipedia. A concise and clear description, and provides references to more reading
https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/ a teaching page with explanations for beginners
Mueller JP and Massaron L (2016) Machine Learning for Dummies. John Wiley and Son, Inc New Jersey, ISBN 978-1-19-24551-3. p.158-163.
This gives only a brief introduction to both Bayes algorithms. However it provides perspectives in how Bayes probabililty is used in data analysis, in comparison with a whole lot of other methods. There is not much guidance to calculations however, as the book relies heavily on the use of R and Python packages.
Old references
The following 2 references introduced me to Naive Bayes many years ago, before the term naive became commonly used. I include them partly out of sentiment, but mostly because they have major influences on how the models are presented on this and the two programming web pages.
Warner HR, Toronto AF, Veasey LG, and Stephenson R (1961)A Mathematical Approach
to Medical Diagnosis. Application to Congenital Heart Disease. JAMA 117:3 p.177-183.
This paper used the formula that is the same as what is now Naive Bayes probability, but neither Bayes nor Naive was mentioned. The paper quoted Lusted as the source of the mathematical approach.
The paper discussed the difficulty of combining multiple predictors into a single compound predictor, as the list of observations needed to differentiate between congenital heart deseases was too great to do so in practice (my interpretation).
The paper presented a complicated set of mathematical arrangements to deal with a mix of binomial and multinomial predictors. It was easy enough to understand, but difficult to present concisely and dynamically in the web page format, and I had tried a number of different ways to present the calculations and results before ending with the format on these pages
The paper went on to warn the reader about the pitfalls of correlation between predictors and the consequent tautology, and recommended their careful selection and management.
Over the next few years, the authors published a number of additional papers on the subject, reviewing the results and validating the use of the model they built.
This paper therefore provided the initial prompt that led me to developed the arguments listed in the advantages/disadvantages and usage sections in the discussion panel. It also prompted the evolutions on how the concept and results are presented until the current form on these pages.
Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology.
McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.400-412.
This is an old book that may be out of print. In its days it was a recommended text for multivariate statistics at the Masters level.
Included in the book was a brief chapter on Bayes Probability, and in that chapter presented the formulae which are the same as that currently known as Naive Bayes. The term "pattern probability" was used, and my guess is that the term Naive was at the time not in common use.
I began to understand Bayesian probability from reading this book, as by using different abbreviations, the book clarifies the difference between the basic and the naive models.
From this book, I acquire the terms attribute and pattern, and the abbreviations πP(o|p) to represent Bayes Probability. The book used j for outcome, x for attributes, and arrays of attribute for patterns. I converted these to a for attribute, p for pattern, and o for outcomes, making it easier for the inexperienced to follow explanations and algorithms.
Contents of G:6
|