Backpropagation

Content Disclaimer
Copyright @2020.
All Rights Reserved.

StatsToDo: Backpropagation Neural Network

Links : Home Index (Subjects) Contact StatsToDo

Explanations
This page provides the program and explanations for the basic Backpropagation Neural Net.
Neural net is a vast subject, and subjected to rapid development in the 21st century, as it forms the basis of machine learning and artificial intelligence, and backprpagation is one of the earliest to develop, and form the basic framework for many algorithms.
Backpropagation began as a simple adaptive learning algorithm, and this is presented in this page, in the form of a Javascript program, so that users can use the program directly, or if required, copy and adapt the algorithm into their own programs. The program is best viewed as a form of non-parametric regression, where the variables are based on Fuzzy Logic, a number between 0 (false) and 1 (true).
Fuzzy Logic The Greek philosopher, Aristotle, stated that things can be true or not true, but cannot be both. Fuzzy logic replaces this statement with that true and false are only extremes that seldom exist, while reality is mostly somewhere in between. Mathematically this is represented as a number (y) between 0 (false) and 1 (true), and its relationship to a linear measurement (x) represented by the logistic curve (y = 1/(1+exp(-x)), as shown in the plot to the left, where a value of 0 is translated to a probability of 0.5, -∞ to 0 and or +∞ to 1. If we then accept that <=0.05 as unlikely to be true and >=0.95 likely to be true, then we can rescale any measurement to -2.9444 and + 2.9444, which is then logistically transformed to 0.05 and 0.95. Program for logistic transformation is available at Transformation.php.
An example of this is shown in the plot to the right, which translates the measurement of fetal blood pH into a diagnosis of acidosis, by firstly rescale the normally accepted non-acidosis value of 7.35 to -2.4444 and its logistic value of 0.05 and the normally accepted acidosis value of 7.2 to 2.4444 and its logistic value of 0.95. This rescaling changes an otherwise normally distributed measurement into the bimodal one of acidosis and non-acidosis, compressing the values less than 7.2 and more then 7.35, while stretches the values in between.
Neurone The processing unit in a Backpropagation neuronet is the perceptron, based on the concept of the nerve cell the neurone. The unit receives one or more inputs (dendrites), process them to produce an output (axon). Mathematically, this is divided into two processes.

The first is to combined the inputs so the y= Σw_iv_i + c, where v are input values, w the weights given to each input, and c the bias value
The combined value (y) is then transformed into a Fussy Logic value between 0 (false) and 1 (true). This can be binary (>0.5=1, <0.5=0), but most commonly the logistic transform is used.

Neuronet and Backpropagation
The Backpropagation neuronet is an arrangement of neurones as shown to the right, and consists of the following

The input layer, which contains as many neurone as there are inputs. In this example, there are 2 input neurones
One or more middle layer, each containing a number of neurones. In this example there is 1 middle layer containing 3 neurones
The output layer, which contains as many neurones as there are outputs. In this example, there is 1 output neurone

Tranining the neuronet
The coefficients (w and c) in all of the neurones in a backpropagation neuronet consist of random numbers when the neuronet is initially constructed. Training consists of presenting a series of templates (input and output) to the neuronet, which adapts (learn) through the following processes

Forward Propagation

Each input entered via the input layer is entered into each neurone of the middle layer. Each neurone then processes all the inputs (dendrite) and produces its output (axon)
If there is more than 1 layer, the outputs from each layer becomes the inputs of the next layer, until the output layer, the neurones of which produce the final output values.

Backward Propagation

The output values are compared with the template output values. The coefficients in each neurone (w and c) are then changed so that results would be closer to the template output values
Going backwords through the layers, each preceeding layer is similarly altered so that the output from each neurone would produce an output that is closer to the required value

For each template in the training data set, the error produced is estimated and compared with the output values in the template
The maximum error for each iteration of the whole dataset is estimated, and compared with the acceptable error value. The training is re-iterated until the maximum error for each iteration is less than the acceptable error. At this point, the training is completed, and the values in the coefficients represent the "memory" of the training, and can be used to reproduce the template output values from inputs.

Using the trained neuronet
At the end of training, the set of coefficients represents the "memory" that has been trained, and can be use to produce outputs from sets of input. Simple neuronet can be process manually, but usually the set of coefficients is incorporated into a computer program or hardwired into machineries.
From the Javascript program in this page, the trained neural network can be presented as a program (html and Javascript code) that the user can copy to a text editor and saved as an html file. The html propgram can then be used to interpret future data
References

Users should be aware that neural network generally, and backpropagation in particular, have undergone dramatic development in the 21st century, and the current complexity and capability of these algorithms greatly exceed the content of this page.
The program on this page is a simple and primitive one, and can probably used for diagnostic or therapeutic decision making in clearly defined clinical domains, with 5-20 inputs,10-20 patterns to learn, and training dataset of no more than a few hundred templates. It is insufficient to process complex patterns that requires large datasets such as in predicting share prices, company profitability, or weather forecast. where ambiguous data, multiple causal input and output, unknown patterns, and massive training data are involved.
The following are references for beginners. They introduce the concept, and lead to further reading.
Mueller J P and Massaron L (2019) Deep Learning for Dummies. John Wiley and Sons, Inc., New Jersey. ISBN 978-1-119-54303-9. Chapter 7 and 8 p.131-162. A very good introduction to neuronet and Backpropagation
On Line

https://en.wikipedia.org/wiki/Backpropagation Wikipedia on Backpropagation
https://blog.revolutionanalytics.com/2017/07/nnets-from-scratch.html An introduction to the concepts
https://www.datacamp.com/community/tutorials/neural-network-models-r A tutorial in using one of the R packages
https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf The R resource for a really sophisticated Backpropagation package
https://www.rdocumentation.org/packages/nnet/versions/7.3-16/topics/nnet Documentation for the neural net presented in the R panel of this page

Calculations
Hints and Suggestions
This panel explains how the program in the program panel can be run, and provides some suggestions on how to make the program run efficiently
The Structure
This is a column of numbers which represents the number of neurones for each layer. The minumum is 2 rows, input and output. The most common is 3 rows, with a single middle layer. In theory there can be any number of middle layers, and there can be any number of neurones in each layer. The following general approach can be used, although each network is unique, and some trial and error may be necessary

The larger the number of input and output, the greater the complexity of patterns in the training data, the more where data values are away from 0 and 1, and the more where similar patterns of input values are related to different outcome values, then the larger number of neurones (in terms of layers or neurones per layer) is required.
Where the model is similar to regression, with linear relationships between inputs and outputs, only 2 layers (input and outputs) are necessary.
Where a limited number of cause and effect patterns are clearly represented in the training data, one, or at most 2 middle layers should suffice. Where many or unrecognized patterns exists in the training data, such as training a network to predict share prices, many layers and neurones, requiring high speed computers with dedicated processing over a long time, are required
The number of neurones in the middle layers can be determined by trial and error. although not obligatory, it is useful to have at least the same number of neurone as inputs in the first middle layer, and more neurones than the number of outputs in the last middle layer.
In our example (a simple XOR simulator plus a switch), there is 1 middle layer which contains 4 neurone, 1 more than the number of inputs and 3 more than the outputs. The example net will still train with 2 or 3 neurones in the middle layer, but requires many more iterations to reach the same precision.

The values representing the structure are placed in the Network Structure text area. In our example, there are 3 inputs and one output. The middle layer contains 4 neurones.
Data : Input and Output
The data is a table of numbers representing the template pattern. It can have any number of rows, but the number of columns is the number of inputs (first row in structure) plus the number of outputs (last row in the structure). For training, the number of columns must conformed to input + output. To use a trained net to interpret a set of data, only the input columns are required.
All data used by the backpropagation, be they input parameters or result output, represents Fuzzy Logic, and numerically represented as values between 0 for false, and 1 for True. Real data will therefore need to be edited to conform with this. Fuzzy Logic is discussed in the introduction panel, and will not be elaborated here.

The simplest binary groups (no/yes, false/true, female/male). 0 for false and 1 for true can used
For multiple groups, the easiest is to have an input for each group. For example 1 0 0 0 for group A, 0 1 0 0 for group B, 0 0 1 0 for group C and 0 0 0 1 for group D.
A more abbreviated set of dummy variable can also be used. For example 0 0 for group A, 0 1 for group B, 1 0 for group C and 1 1 for group D. This makes for a smaller neural net with shorter training runs, but the results are intuitively more difficult to use, as group names will need to be firstly converted into a different format
Measurements must be transformed to a conceptual value between 0 and 1 for false and true. We will use height of a person to demonstrate 3 common methods of transformation, using 155cms for short, 170cm for tall. One of the following options can be used

The use of a cut off to transform into two input values. Those with 155cms or less would be 1 0, 170cms or more 0 1, in between 0 0, and 1 1 does not exist.
Using a straight line gradient as a single value. 155cms or less = 0, 160cms or more = 1, and the rest (ht-155) / (170-155)
Conversion to Fuzzy Logic values using logistic transformation, which clusters values near the extremes and stretching the distances between value in between, producing a bi-modal distribution of probability values between 0 and 1. In our height example, 155cms is given the probability of 0.05 and 170cms 0.95 for the transform. Logic of Fuzzy Logic is discussed in the Introduction panel of this page, and will not be further elaborated here.

The data consists of a table, each row a case, and the columns are firstly the input values, then the output values. The columns are separated by spaces or tabs. The number of columns must be compatible with the structure of the net. In our example, there are 3 input and one output. This means that the data must be 4 columns for training and 3 to interpret the trained network. The data is placed in the Data Matrix text area.
Training Schedule These are parameters that controls the speed and precision of training the neural network. They do not matteer much when the training data is brief, conceptually clear, and all the values are close to 0 and 1 (as in the example). They become increasingly important when the training data is large, where the values are heterogenous, and when the same set of inputs are linked to different outcomes

The learning rate is a value between 0 where no learning occurs and 1 when the weights in the neurones are fully corrected by the values of the error found. When the training set is simple, the same training rate can be used throughout. When the training data is complex, with values not close to 0 and 1, and where the same input are related to different outcomes, there is a need to reduce the learning rate as training progresses, so that the result would converge better. In most cases users should adjust these parameters by trial and error. They do not affect the final outcome of training, but governs the speed (number of iterations) required.

Maximum learning rate is the rate set for the start of the training.The value 1 can be used, but this tends to over-correct and thus prolong training. In most cases it should begins at about 0.8 (the default setting).
Minimum rate is the smallest training rate used, and should be the same or lower value as the acceptable error
Decrement is the increment of decrease (proportion of the current training rate) as the training progresses. The value is usually set at 0.5 to 0.1, although smaller values can also be used.
Number of iteration (of the training data) per decrement is the rate at which the training rate is reduced. Given that most backprogagation training requires 500 to 50,000 cycles, this can be set at about 1/10th of the expected number of cycles required.

Acceptable Error is the error acceptable to end the training. Neural net are base on Fuzzy Logic, where false(0) and true (1) are unattainable extremes, so the user has to determine how close to 0 and 1 they would accept. The default is set to 0.05, meaning that outcomes >= 0.95 is accepteable as 1 and <= 0.05 is accepted as 0. In many cases, especially when the training data is complex, this level of precision is unattainable. Also there is a need to avoid over-training, as the neural net will then model the trivial variations in the data. For practical reasons, a precision closer than 0.2 is considered workable and 0.1 as precise.
Maximum iteration. Training will cease when the acceptable error is attained or when the maximum number of iterations is reached. Maximum iteration is required to stop training, if the required precision is not be attainable, or the duration of training exceeds the time allowed by the browser.

The neural net text area contains the neural net, a table of coefficients, each row containing the coefficients of a neurone. When the program begins, this area is blanked, as no neural network as yet exists, and the program creates the neural net using random numbers.
If the neural net already exists, either because the user paste it in the text area, or if some training has already occurs, the coefficients are used and further modified if more training is performed
The program produces coefficients to 10 decimal places, in excess to precision requirements in most cases, but allowing users to truncate them as preferred. In general, the number of decimal places should be 1 or 2 more than the precision of results required by the user. In our example, truncation coefficients to 3 decimal places will produce the same results.
The Default Example
The default example is a backpropagation network with 3 layers

In the Structure box, the 3 rows are 3 for 3 inputs, 4 for 4 neurones in a single middle layer, and 1 for a single output
In the Data Matrix text area is the training data, which demonstrates a decision making algorithm based on the XOR pattern, which cannot be otherwise computed numerically.

There are 4 columns. The first 3 are the inputs, and the last the output
In the first 4 rows, where the value in the third volumn is 0, if both the values in the first two columns are both 0 or both 1, the network should return the value of 0. If the values of the first two columns are different (0 1 or 1 0), then the return value is 1
In the last 4 rows, where the value in the third column is 1, the return values are reversed.
This the return values produced depends 2 patterns,

Are A and B both true or both false(1 1 or 0 0) or are they opposite to each other (0 1 or 1 0)
Is the third input represent true (1) or false (0)

Suggestions for training The following schedules are suggested to help users not familiar with the program

Make sure the neural net structure and the training data are compatible, the number of columns in the training data is the number of input plus the number of outputs.
Leave the default settings for training, but initially set a low value (e.g. 1000) for maximum iterations
Click the Commence Training button to train the network
At the end of initial traing, examine the neural net produced. Adjust training parameters, and click the Commence Training button again. This will use the existing neural net and further modifying it.
Repeat adjusting and re-training until the required solution is obtains, or when no further improvement is possible

Using the Trained Neural Net This page also provides a platform for using the Bckpropagation network once it is trained.

Requirements The structure of the network must be stated in the Network Structure text area, and the trained network in the Neural Net Matrix text area. The two sets of numbers must be compatiable, in that the number of inputs and outputs are the same
Clicking the Produce Program button will produce a Jaascript function that will calculate the output from a row of inputs. This program can be copied and pasted into any web page that acts as an interpreter of the neural net, or as a basis to write a function in any other computer language in another application.
To Calculate Results a set of data is required in the Data Matrix text box. For interpretation, only the input values are required (the rest of the columns will be ignored). Clicking the Calculate Results will produce a table , each row containing the input values, followed by the result output values.

Exporting the Trained Neural Net The trained neural net can be exported as a html program

Requirements The structure of the network must be stated in the Network Structure text area, and the trained network in the Neural Net Matrix text area. The two sets of numbers must be compatiable, in that the number of inputs and outputs are the same
Clicking the Export Trained Neuralnet button will produce the source code for a complete html web page, including the input/output interface and the Javascript program representing the trained neural network.
The source code can be copied as pasted into a text editor, and saved as an html file. The html file can then be used via a web browser to interpret future data.

Javascript Program

Network Structure
3 4 1    Training schedule
Learning rate maximum
Learning rate minimum
Decrement
Number of iterations per decrement
Acceptable error
Maximum Iterations

Data Matrix
0 0 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 1

Training status
Cycle    Learning Rate
Max error : Last cycle=    This cycle=
Numberof : error>0.5=    error>0.1=    Changed=
Comment

Neural Net Matrix

Using the Trained neural Net
Using the structure, input data and trained neural net
Calculate output from information in Structure, input data, and neural net text areas

Translate Neural Net as Javascript Program

R Codes
The R code on this panel is based on the nnet library from https://www.rdocumentation.org/packages/nnet/versions/7.3-16/topics/nnet. The algorithm supports backpropagation calculations for any number of inputs and outputs, but only allows 1 middle layer with any number of neurones.
If more than 1 middle layer is required, user can go to https://cran.r-project.org/web/packages/neuralnet/index.html to download the package neuralnet and its instructions pdf file.
The following 2 examples calculates a simple backpropagation neural network
The Data: consists of a matrix of 8 rows, with 3 inputs (I1, I2, and I3) and 2 outputs (O1 and O2). After the data frame is created, the library nnet is then called.
myDat = (" I1 I2 I3 O1 O2 0 0 0 0 1 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 ") myDataFrame <- read.table(textConnection(myDat),header=TRUE) myDataFrame library(nnet)

Test 1
input=I1,I2,I3, 1 middle layer with 4 neurones, output = O1, tolerance=0.05, maximum interation=200
x<-subset(myDataFrame, select=I1:I3) #subset x=I1,I2,I3 x y<-c(myDataFrame$O1) #subset y=O1 y nn <- nnet(x,y, size=4, abstol=0.05, maxit = 200) // backpropagation summary(nn) predict(nn)
The subset x consists of the 3 inputs I1 to I3. y the single output O1
The function nnet is called.

x and y are the inputs and outputs
size=4 means a middle leyer with 4 neurones
abstol=0.05 means processing stops when error is smaller than 0.05
maxit=200 means processing stop after 200 iterations
The results are
> x I1 I2 I3 1 0 0 0 2 0 1 0 3 1 0 0 4 1 1 0 5 0 0 1 6 0 1 1 7 1 0 1 8 1 1 1 > y<-c(myDataFrame$O1) #subset y=O1 > y [1] 0 1 1 0 1 0 0 1 > nn <- nnet(x,y, size=4, abstol=0.05, maxit = 200) # weights: 21 initial value 2.050771 iter 10 value 1.999513 iter 20 value 1.913413 iter 30 value 1.597108 iter 40 value 0.216418 final value 0.033959 converged > summary(nn) a 3-4-1 network with 21 weights options were - b->h1 i1->h1 i2->h1 i3->h1 2.16 -5.72 -0.09 1.12 b->h2 i1->h2 i2->h2 i3->h2 5.10 -1.99 -4.27 -4.27 b->h3 i1->h3 i2->h3 i3->h3 2.64 -1.68 1.87 5.01 b->h4 i1->h4 i2->h4 i3->h4 -4.91 -9.70 10.28 9.01 b->o h1->o h2->o h3->o h4->o -3.18 -6.82 7.57 -0.95 7.93 > predict(nn) [,1] 1 0.06983661 2 0.95256979 3 0.96103373 4 0.08941800 5 0.91531074 6 0.07498991 7 0.05624830 8 0.96314380

x and y are shown.
Summary presents the values of weights in each neurone
Predict shows the calculated values of each O1 as calculated by the trained neural net using I1 to I3
After training, a new set of data is created and tested with the trained network
newDat = (" I1 I2 I3 0.1 0.2 0.05 0.3 0.9 0.0 0.8 0.1 0.1 ") newData<-read.table(textConnection(newDat),header=TRUE) predict(nn, newData)
The results are
> predict(nn, newData) [,1] [1,] 0.1086763 [2,] 0.9636761 [3,] 0.9212004

Test 2
This is the same as test 1, except that there are now 2 outputs O1 and O2.
The program is
x<-subset(myDataFrame, select=I1:I3) #subset x=I1,I2,I3 x y<-subset(myDataFrame, select=O1:O2) #subset y=O1,O2 y nn <- nnet(x,y, size=4, abstol=0.05, maxit=200) summary(nn) predict(nn)
The results are
> x I1 I2 I3 1 0 0 0 2 0 1 0 3 1 0 0 4 1 1 0 5 0 0 1 6 0 1 1 7 1 0 1 8 1 1 1 > y<-subset(myDataFrame, select=O1:O2) #subset y=O1,O2 > y O1 O2 1 0 1 2 1 0 3 1 0 4 0 1 5 1 0 6 0 1 7 0 1 8 1 0 > nn <- nnet(x,y, size=4, abstol=0.05, maxit=200) # weights: 26 initial value 4.071980 iter 10 value 3.999929 iter 20 value 3.990630 iter 30 value 1.902310 final value 0.037599 converged > summary(nn) a 3-4-2 network with 26 weights options were - b->h1 i1->h1 i2->h1 i3->h1 -1.96 -2.40 0.04 0.15 b->h2 i1->h2 i2->h2 i3->h2 12.68 65.74 -40.83 -65.08 b->h3 i1->h3 i2->h3 i3->h3 3.57 3.40 -0.89 -2.10 b->h4 i1->h4 i2->h4 i3->h4 -0.58 -1.89 4.56 2.13 b->o1 h1->o1 h2->o1 h3->o1 h4->o1 -0.32 -7.82 -16.71 20.34 -15.00 b->o2 h1->o2 h2->o2 h3->o2 h4->o2 1.63 7.85 17.30 -22.59 15.04 > predict(nn) O1 O2 1 2.689003e-02 0.965504577 2 9.521576e-01 0.023248336 3 8.834675e-01 0.086618520 4 3.778867e-05 0.999949062 5 9.400405e-01 0.038104978 6 3.346196e-02 0.963676744 7 3.884451e-02 0.948116245 8 9.914175e-01 0.003671725
There are now two outputs O1 and O2.
As with Test 1, the trained neural net is tested on a new set of data, and produced two outputs
newDat = (" I1 I2 I3 0.1 0.2 0.05 0.3 0.9 0.0 0.8 0.1 0.1 ") newData<-read.table(textConnection(newDat),header=TRUE) predict(nn, newData)
The results are
> predict(nn, newData) O1 O2 [1,] 0.001618309 0.997914331 [2,] 0.989863726 0.004413234 [3,] 0.546131182 0.375853575