Back-Propagation is very simple. Who made it Complicated ?

Learning Outcome: You will be able to build your own Neural Network on a Paper.

Prakash Jay
Prakash Jay
Apr 20, 2017 · 9 min read
Image for post
Image for post
Feed Forward Neural Network

Almost 6 months back when I first wanted to try my hands on Neural network, I scratched my head for a long time on how Back-Propagation works. When I talk to peers around my circle, I see a lot of people facing this problem. Most people consider it as a black-box and use libraries like Keras, TensorFlow and PyTorch which provide automatic differentiation. Though it is not necessary to write your own code on how to compute gradients and backprop errors, having knowledge on it helps you in understanding a few concepts like Vanishing Gradients, Saturation of Neurons and reasons for random initialization of weights

More about why is it important to Understand?

Andrej Karapathy wrote a blog-post on it and I found it useful.

Approach

  • Build a small neural network as defined in the architecture below.
  • Initialize the weights and bias randomly.
  • Fix the input and output.
  • Forward pass the inputs. calculate the cost.
  • compute the gradients and errors.
  • Backprop and adjust the weights and bias accordingly

Architecture:

  • Build a Feed Forward neural network with 2 hidden layers. All the layers will have 3 Neurons each.
  • 1st and 2nd hidden layer will have Relu and sigmoid respectively as activation functions. Final layer will have Softmax.
  • Error is calculated using cross-entropy.

Initializing the network

I have taken inputs, weights and bias randomly

Image for post
Image for post
Initializing the network

Layer-1

Matrix Operation:

Image for post
Image for post
Layer-1 Matrix Operation

Relu operation:

Image for post
Image for post
Layer-1 Relu Operation

Example:

Image for post
Image for post
Layer-1 Example

Layer-2

Matrix operation:

Image for post
Image for post
Layer-2 Matrix Operation

Sigmoid operation:

Image for post
Image for post
Sigmoid Operation

Example:

Image for post
Image for post
Layer-2 Example

Layer-3

Matrix operation:

Image for post
Image for post
Layer-3 Matrix Operation

Softmax operation:

Image for post
Image for post
Softmax formula

Example:

Image for post
Image for post
Layer-3 Output Example

Edit1: As Jmuth pointed out in the comments, the output from softmax would be [0.19858, 0.28559, 0.51583] instead of [0.26980, 0.32235, 0.40784]. I have done a/sum(a) while the correct answer would be exp(a)/sum(exp(a)) . Please adjust your calculations from here on using these values. Thank you.

Analysis:

  • The Actual Output should be [1.0, 0.0, 0.0] but we got [0.2698, 0.3223, 0.4078].
  • To calculate error lets use cross-entropy

Error:

Cross-Entropy:

Image for post
Image for post
Cross-Entropy Formula

Example:

Image for post
Image for post
Cross-Entropy calculation

We are done with forward pass. Now let us see backward pass

Important Derivatives:

Sigmoid:

Image for post
Image for post
Derivative of Sigmoid

Relu:

Image for post
Image for post
Derivative of Relu

Softmax:

Image for post
Image for post
Derivative of Softmax

BackPropagating the error — (Hidden Layer2 — Output Layer) Weights

Let us calculate a few derivatives upfront so these become handy and we can reuse them whenever necessary. Here are we are using only one example (batch_size=1), if there are more examples, We just need to average everything.

Image for post
Image for post
Example: Derivative of Cross-Entropy

By symmetry we can calculate other derivatives also

Image for post
Image for post
Matrix of cross-entropy derivatives wrt output

In our example,

Image for post
Image for post
values of derivative of cross-entropy wrt output.

Next let us calculate the derivative of each output with respect to their input.

Image for post
Image for post
Example: Derivative of softmax wrt output layer input

By symmetry we can calculate other derivatives also

Image for post
Image for post
Matrix of Derivative of softmax wrt output layer input.

In our example,

Image for post
Image for post
values of derivative of softmax wrt output layer input .

For each input to neuron let us calculate the derivative with respect to each weight. Now let us look at the final derivative

Image for post
Image for post
Example: Derivative of input to output layer wrt weight

By symmetry we can calculate other derivatives also

Image for post
Image for post
values of derivative of input to output layer wrt weights.

Finally Let us calculate the change in

Image for post
Image for post
Weight from k1 to l1 neuron

Which will be simply

Image for post
Image for post
Derivative of error wrt weight

Using Chain Rule:

Image for post
Image for post
Chain rule breakdown of Error derivative

By symmetry:

All of the above values are calculated before. We just need to substitute the results.

Considering a learning rate of 0.01 we get our final weight matrix as

Image for post
Image for post
Modified weights of kl neurons after backprop

So, We have calculated new weight matrix for W_{kl}. Now let us move to the next layer:

BackPropagating the error — (Hidden Layer1 — Hidden Layer 2) Weights

Let us calculate a few handy derivatives before we actually calculate the error derivatives wrt weights in this layer.

Image for post
Image for post
Example: Derivative of sigmoid output wrt layer 2 input

In our example:

Image for post
Image for post
Values of derivative of output of layer-2 wrt input of layer1

For each input to neuron let us calculate the derivative with respect to each weight. Now let us look at the final derivative

Image for post
Image for post
Derivative of layer 2 input wrt weight

By symmetry we can calculate:

Image for post
Image for post
Values of derivative layer 2 input wrt of weight

Now we will calculate the derivative of

Image for post
Image for post
weight from j3 to k1

which will be simply.

Image for post
Image for post
Derivative of Error wrt weight j3-k1

Using chain rule,

Image for post
Image for post
chain rule of derivative of error wrt weight

By symmetry we get the final matrix as,

We have already calculated the 2nd and 3rd term in each matrix. We need to check on the 1st term. If we see the matrix, the first term is common in all the columns. So there are only three values. Let us look into one value

Image for post
Image for post
Breakdown of error.

Lets see what each individual term boils down too.

Image for post
Image for post
breakdown of each error derivative

by symmetry we get the final matrix as,

Image for post
Image for post
Derivative of error wrt output of hidden layer 2

Again the first two values are already calculated by us when dealing with derivatives of W_{kl}. We just need to calculate the third one, Which is the derivative of input to each output layer wrt output of hidden layer-2. It is nothing but the corresponding weight which connects both the layers.

Image for post
Image for post
Derivative of input of output layer wrt hidden layer -2
Image for post
Image for post
Final Matrix of derivative of total error wrt output of hidden layer-2

All Values are calculated before we just need to impute the corresponding values for our example.

Image for post
Image for post
calculations using an example

Let us look at the final matrix

In our example,

Image for post
Image for post
Calculations from our examples

Consider a learning rate (lr) of 0.01 We get our final Weight matrix as

Image for post
Image for post
Final modified matrix of W_{jk}

So, We have calculated new weight matrix for W_{jk}. Now let us move to the next layer:

BackPropagating the error — (Input Layer — Hidden Layer 1) Weights.

Edit:1 the following calculations from here are wrong. I took only wj1k1 and ignored wj1k2 and wj1k3. This was pointed by an user in comments. I would like someone to edit the jupyter notebook attached at the end. Please refer to some other implementations if u still didn’t understand back-prop here.

Let us calculate a few handy derivatives before we actually calculate the error derivatives wrt weights in this layer.

Image for post
Image for post
Derivative of hidden layer 1 output wrt to its input

We already know the derivative of relu (We have seen it at the beginning of the post). Since all inputs are positive, We will get output as 1

Image for post
Image for post
Calculations

For each input to neuron let us calculate the derivative with respect to each weight. Now let us look at the final derivative.

Image for post
Image for post
Derivative of input to hidden layer wrt to weights

By symmetry we can write,

Image for post
Image for post
Final derivative calculations

Now we will calculate the change in

Image for post
Image for post
Weight connecting i2 neuron to j1

and generalize it to all variables. This will be simply

Image for post
Image for post
derivative of error wrt to weight

Using chain rule,

Image for post
Image for post
chain rule for calculating error

By symmetry,

We know the 2nd and 3rd derivatives in each cell in the above matrix. Let us look at how to get to derivative of 1st term in each cell.

Image for post
Image for post
Watching for the first term

We have calculated all the values previously except the last one in each cell, which is a simple derivative of linear terms.

Image for post
Image for post
calculations in our example

In our example,

Image for post
Image for post
calculations in our example

Consider a learning rate (lr) of 0.01 We get our final weight matrix as

Image for post
Image for post
using learning rate we get final matrix

The End of Calculations

Our Initial Weights:

Image for post
Image for post

Our Final Weights:

Image for post
Image for post

Important Notes:

  • I have completely eliminated bias when differentiating. Do you know why ?
  • Backprop of bias should be straightforward. Try on your own.
  • I have taken only one example. What will happen if we take batch of examples?
  • Though I have not mentioned directly about vanishing gradients. Do you see why it occurs?
  • What would happen if all the weights are the same number instead of random ?

References Used:

Code available on Github :

I have hand calculated everything. Let me know your feedback. If you like it, please recommend and share it. Thank you.

Prakash Jay

Written by

Prakash Jay

Senior Data Scientist @FractalAI