What is Gradient Descent?

Gradient descent is an iterative algorithm for finding a local minimum of a differentiable function.
This blog post will be a quick explanation of how it works and its application in deep learning.

In deep learning, we have something called a loss function and it is defined by how far off the predictions are to its true value.

Some of the loss functions include Mean Squared Error(MSE),
Mean Absolute Error(MAE) and Cross-Entropy.

In order for the predictions to be more accurate, we have to minimize the loss function through gradient descent.


Gradient Descent on
single variable function

Let’s start off with something simple to get an intuition of how Gradient Descent works.

We have a function f(x) and we will be using gradient descent to get the local minimum of f(x).
f'(x) is also shown below for reference.

f(x) = x² -8x +4
f'(x) = 2x -8

The algorithm for the gradient descent is given by this equation:

In this equation, the new value of X, is given by the current value of X minus the n(learning rate) multiplied by the gradient at the current value of X.
The first value of X is usually a random guess in the range of x.
n can be thought of as a “step size” and is a positive arbitrary constant.

If a function concave upwards, and the initial X value is towards the right of the minimum point, the derivative will be positive.

This means that the subsequent X value would be less than the previous X value and this results in a convergence towards the minimum point of the function.

I have added an animation to help you visualize to understand this better.
The algorithm is ran for 30 iterations with n=0.2

Here’s the output of the algorithm,
Notice that the first few iterations, it converges very quickly.
However, it gradually converges at a slower rate with increasing iteration.

X_value: 28.52, Y_value: 589.2304, dy/dx= 49.04
X_value: 18.712, Y_value: 204.442944, dy/dx= 29.424
X_value: 12.8272, Y_value: 65.91946, dy/dx= 17.6544
X_value: 9.29632, Y_value: 16.051006, dy/dx= 10.59264
X_value: 7.177792, Y_value: -1.901638, dy/dx= 6.355584
X_value: 5.906675, Y_value: -8.36459, dy/dx= 3.81335
X_value: 5.144005, Y_value: -10.691252, dy/dx= 2.28801
X_value: 4.686403, Y_value: -11.528851, dy/dx= 1.372806
X_value: 4.411842, Y_value: -11.830386, dy/dx= 0.823684
X_value: 4.247105, Y_value: -11.938939, dy/dx= 0.49421
X_value: 4.148263, Y_value: -11.978018, dy/dx= 0.296526
X_value: 4.088958, Y_value: -11.992087, dy/dx= 0.177916
X_value: 4.053375, Y_value: -11.997151, dy/dx= 0.106749
X_value: 4.032025, Y_value: -11.998974, dy/dx= 0.06405
X_value: 4.019215, Y_value: -11.999631, dy/dx= 0.03843
X_value: 4.011529, Y_value: -11.999867, dy/dx= 0.023058
X_value: 4.006917, Y_value: -11.999952, dy/dx= 0.013835
X_value: 4.00415, Y_value: -11.999983, dy/dx= 0.008301
X_value: 4.00249, Y_value: -11.999994, dy/dx= 0.004981
X_value: 4.001494, Y_value: -11.999998, dy/dx= 0.002988
X_value: 4.000896, Y_value: -11.999999, dy/dx= 0.001793
X_value: 4.000538, Y_value: -12.0, dy/dx= 0.001076
X_value: 4.000323, Y_value: -12.0, dy/dx= 0.000645
X_value: 4.000194, Y_value: -12.0, dy/dx= 0.000387
X_value: 4.000116, Y_value: -12.0, dy/dx= 0.000232
X_value: 4.00007, Y_value: -12.0, dy/dx= 0.000139
X_value: 4.000042, Y_value: -12.0, dy/dx= 8.4e-05
X_value: 4.000025, Y_value: -12.0, dy/dx= 5e-05
X_value: 4.000015, Y_value: -12.0, dy/dx= 3e-05
X_value: 4.000009, Y_value: -12.0, dy/dx= 1.8e-05
X_value: 4.000005, Y_value: -12.0, dy/dx= 1.1e-05

You can think of the Gradient descent algorithm as rolling down a ball into the curve.
The algorithm ensures that the “ball” ends up at the local minimum.


Common Challenges

The algorithm might not give the global minimum value

In the above given example, it was a square function
and therefore only has one minimum point.

However, as functions get more complex,
there might be multiple minimum points.

Here’s a sample of a function g(x) with two minimum points.

g(x) = 0.001x4 – 0.03x³ +0.1x²+ x +2
g'(x) = 0.004x³ -0.09x²+ 0.2x + 1

From the graph above, we can clearly see that the global minimum occurs at roughly x =20.

However, when we run the same algorithm for 30 iterations on the graph of g(x), with n= 0.2 , the result shows the local minimum point at roughly x=-2.

X_value: -11.26, Y_value: 62.322707, dy/dx= -18.373398
X_value: -7.58532, Y_value: 16.572022, dy/dx= -7.441151
X_value: -6.09709, Y_value: 7.801995, dy/dx= -4.471749
X_value: -5.20274, Y_value: 4.461727, dy/dx= -3.040036
X_value: -4.594733, Y_value: 2.872184, dy/dx= -2.206997
X_value: -4.153334, Y_value: 2.018626, dy/dx= -1.669766
X_value: -3.819381, Y_value: 1.523662, dy/dx= -1.29963
X_value: -3.559455, Y_value: 1.220957, dy/dx= -1.032555
X_value: -3.352944, Y_value: 1.028504, dy/dx= -0.833168
X_value: -3.18631, Y_value: 0.902499, dy/dx= -0.680391
X_value: -3.050232, Y_value: 0.818095, dy/dx= -0.560915
X_value: -2.938049, Y_value: 0.760527, dy/dx= -0.465948
X_value: -2.844859, Y_value: 0.720686, dy/dx= -0.389458
X_value: -2.766968, Y_value: 0.692786, dy/dx= -0.32718
X_value: -2.701532, Y_value: 0.673056, dy/dx= -0.276017
X_value: -2.646328, Y_value: 0.658991, dy/dx= -0.23367
X_value: -2.599594, Y_value: 0.648897, dy/dx= -0.1984
X_value: -2.559914, Y_value: 0.641612, dy/dx= -0.168869
X_value: -2.52614, Y_value: 0.636328, dy/dx= -0.144034
X_value: -2.497334, Y_value: 0.632482, dy/dx= -0.123068
X_value: -2.47272, Y_value: 0.629671, dy/dx= -0.105311
X_value: -2.451658, Y_value: 0.627612, dy/dx= -0.090232
X_value: -2.433611, Y_value: 0.6261, dy/dx= -0.077396
X_value: -2.418132, Y_value: 0.624987, dy/dx= -0.066448
X_value: -2.404843, Y_value: 0.624166, dy/dx= -0.057094
X_value: -2.393424, Y_value: 0.62356, dy/dx= -0.04909
X_value: -2.383606, Y_value: 0.623111, dy/dx= -0.042234
X_value: -2.375159, Y_value: 0.622779, dy/dx= -0.036353
X_value: -2.367888, Y_value: 0.622534, dy/dx= -0.031304
X_value: -2.361628, Y_value: 0.622351, dy/dx= -0.026967
X_value: -2.356234, Y_value: 0.622216, dy/dx= -0.023238

This means that the gradient descent algorithm does not necessarily
converge to the global minimum point and thus it has limitations in
minimizing the loss function.

One way to get around this is by repeating the algorithm several times
with different starting points.

X_value: 11.33, Y_value: -0.987125, dy/dx= -2.469522
X_value: 11.823904, Y_value: -2.241394, dy/dx= -2.605493
X_value: 12.345003, Y_value: -3.630622, dy/dx= -2.721449
X_value: 12.889293, Y_value: -5.137301, dy/dx= -2.808797
X_value: 13.451052, Y_value: -6.731203, dy/dx= -2.858723
X_value: 14.022797, Y_value: -8.369282, dy/dx= -2.86323
X_value: 14.595443, Y_value: -9.998061, dy/dx= -2.816446
X_value: 15.158732, Y_value: -11.558874, dy/dx= -2.71597
X_value: 15.701926, Y_value: -12.995409, dy/dx= -2.563888
X_value: 16.214704, Y_value: -14.261939, dy/dx= -2.367094
X_value: 16.688123, Y_value: -15.329918, dy/dx= -2.136654
X_value: 17.115454, Y_value: -16.190947, dy/dx= -1.886278
X_value: 17.492709, Y_value: -16.855321, dy/dx= -1.630279
X_value: 17.818765, Y_value: -17.347041, dy/dx= -1.381572
X_value: 18.09508, Y_value: -17.697269, dy/dx= -1.15023
X_value: 18.325126, Y_value: -17.938281, dy/dx= -0.942837
X_value: 18.513693, Y_value: -18.099222, dy/dx= -0.762597
X_value: 18.666212, Y_value: -18.20397, dy/dx= -0.609945
X_value: 18.788202, Y_value: -18.270697, dy/dx= -0.483367
X_value: 18.884875, Y_value: -18.312459, dy/dx= -0.380196
X_value: 18.960914, Y_value: -18.338226, dy/dx= -0.297253
X_value: 19.020365, Y_value: -18.353942, dy/dx= -0.231297
X_value: 19.066624, Y_value: -18.363441, dy/dx= -0.1793
X_value: 19.102484, Y_value: -18.369142, dy/dx= -0.138584
X_value: 19.130201, Y_value: -18.372544, dy/dx= -0.106868
X_value: 19.151575, Y_value: -18.374565, dy/dx= -0.082265
X_value: 19.168028, Y_value: -18.375762, dy/dx= -0.063238
X_value: 19.180675, Y_value: -18.376469, dy/dx= -0.048561
X_value: 19.190387, Y_value: -18.376886, dy/dx= -0.03726
X_value: 19.197839, Y_value: -18.377131, dy/dx= -0.02857
X_value: 19.203553, Y_value: -18.377276, dy/dx= -0.021897

Too small/large learning rates

When deciding the learning rate, we must ensure that it is not too small as very small learning rate would result in slower convergence towards the minimum point.

The algorithm is ran for 30 iterations on the graph of f(x) with n=0.01.

X_value: 28.7, Y_value: 598.09, dy/dx= 49.4
X_value: 28.206, Y_value: 573.930436, dy/dx= 48.412
X_value: 27.72188, Y_value: 550.727591, dy/dx= 47.44376
X_value: 27.247442, Y_value: 528.443578, dy/dx= 46.494885
X_value: 26.782494, Y_value: 507.042012, dy/dx= 45.564987
X_value: 26.326844, Y_value: 486.487949, dy/dx= 44.653687
X_value: 25.880307, Y_value: 466.747826, dy/dx= 43.760614
X_value: 25.442701, Y_value: 447.789412, dy/dx= 42.885401
X_value: 25.013847, Y_value: 429.581751, dy/dx= 42.027693
X_value: 24.59357, Y_value: 412.095114, dy/dx= 41.187139
X_value: 24.181698, Y_value: 395.300947, dy/dx= 40.363397
X_value: 23.778064, Y_value: 379.17183, dy/dx= 39.556129
X_value: 23.382503, Y_value: 363.681426, dy/dx= 38.765006
X_value: 22.994853, Y_value: 348.804441, dy/dx= 37.989706
X_value: 22.614956, Y_value: 334.516585, dy/dx= 37.229912
X_value: 22.242657, Y_value: 320.794528, dy/dx= 36.485314
X_value: 21.877804, Y_value: 307.615865, dy/dx= 35.755607
X_value: 21.520248, Y_value: 294.959077, dy/dx= 35.040495
X_value: 21.169843, Y_value: 282.803497, dy/dx= 34.339685
X_value: 20.826446, Y_value: 271.129279, dy/dx= 33.652892
X_value: 20.489917, Y_value: 259.917359, dy/dx= 32.979834
X_value: 20.160119, Y_value: 249.149432, dy/dx= 32.320237
X_value: 19.836916, Y_value: 238.807915, dy/dx= 31.673832
X_value: 19.520178, Y_value: 228.875921, dy/dx= 31.040356
X_value: 19.209774, Y_value: 219.337235, dy/dx= 30.419549
X_value: 18.905579, Y_value: 210.17628, dy/dx= 29.811158
X_value: 18.607467, Y_value: 201.378099, dy/dx= 29.214934
X_value: 18.315318, Y_value: 192.928327, dy/dx= 28.630636
X_value: 18.029012, Y_value: 184.813165, dy/dx= 28.058023
X_value: 17.748431, Y_value: 177.019364, dy/dx= 27.496863
X_value: 17.473463, Y_value: 169.534197, dy/dx= 26.946925

However, if the learning rate is very large, the X value diverges and thus, the algorithm fails.

The algorithm is ran for 30 iterations on the graph of f(x) with n=1.06.

X_value: 2.1, Y_value: -8.39, dy/dx= -3.8
X_value: 6.128, Y_value: -7.471616, dy/dx= 4.256
X_value: 1.61664, Y_value: -6.319595, dy/dx= -4.76672
X_value: 6.669363, Y_value: -4.8745, dy/dx= 5.338726
X_value: 1.010313, Y_value: -3.061773, dy/dx= -5.979374
X_value: 7.348449, Y_value: -0.787888, dy/dx= 6.696898
X_value: 0.249737, Y_value: 2.064473, dy/dx= -7.500526
X_value: 8.200295, Y_value: 5.642475, dy/dx= 8.400589
X_value: -0.70433, Y_value: 10.130721, dy/dx= -9.40866
X_value: 9.26885, Y_value: 15.760777, dy/dx= 10.537699
X_value: -1.901112, Y_value: 22.823118, dy/dx= -11.802223
X_value: 10.609245, Y_value: 31.682119, dy/dx= 13.21849
X_value: -3.402354, Y_value: 42.79485, dy/dx= -14.804709
X_value: 12.290637, Y_value: 56.73466, dy/dx= 16.581274
X_value: -5.285513, Y_value: 74.220758, dy/dx= -18.571027
X_value: 14.399775, Y_value: 96.155319, dy/dx= 20.79955
X_value: -7.647748, Y_value: 123.670032, dy/dx= -23.295496
X_value: 17.045478, Y_value: 158.184488, dy/dx= 26.090955
X_value: -10.610935, Y_value: 201.479422, dy/dx= -29.22187
X_value: 20.364247, Y_value: 255.788587, dy/dx= 32.728494
X_value: -14.327957, Y_value: 323.914003, dy/dx= -36.655914
X_value: 24.527312, Y_value: 409.370526, dy/dx= 41.054623
X_value: -18.990589, Y_value: 516.567187, dy/dx= -45.981178
X_value: 29.74946, Y_value: 651.03468, dy/dx= 51.49892
X_value: -24.839395, Y_value: 819.710703, dy/dx= -57.67879
X_value: 36.300122, Y_value: 1031.297905, dy/dx= 64.600245
X_value: -32.176137, Y_value: 1296.712892, dy/dx= -72.352274
X_value: 44.517274, Y_value: 1629.649452, dy/dx= 81.034547
X_value: -41.379346, Y_value: 2047.285073, dy/dx= -90.758693
X_value: 54.824868, Y_value: 2571.167195, dy/dx= 101.649736
X_value: -52.923852, Y_value: 3228.32493, dy/dx= -113.847704

Gradient Descent on
two variable function

Now that we have some idea of how gradient descent works with single variable(x),
we are going to take a step further to see how it works with two variables (x,y).

Here’s a plot of h(x,y) on a 3d plot, and their partial derivatives for reference.

h(x,y) = x2 + y2
hx = 2x
hy = 2y

Just like we did with the single variable function, we are going to apply the same algorithm.

But now that we have two variables, we would have to apply the algorithm to both the variables x and y.

By applying the algorithm for both x and y, for 30 iterations at n=0.1.

X_value: -3.333333,  Y_value:4.303030, 
dz/dx= -6.666667,  dz/dy= 8.606061,  Z_value: 29.627181

X_value: -2.666667,  Y_value:3.442424,
dz/dx= -5.333333,  dz/dy= 6.884848,  Z_value: 18.961396

X_value: -2.133333,  Y_value:2.753939,
dz/dx= -4.266667,  dz/dy= 5.507879,  Z_value: 12.135293

X_value: -1.706667,  Y_value:2.203152,
dz/dx= -3.413333,  dz/dy= 4.406303,  Z_value: 7.766588

X_value: -1.365333,  Y_value:1.762521,
dz/dx= -2.730667,  dz/dy= 3.525042,  Z_value: 4.970616

X_value: -1.092267,  Y_value:1.410017,
dz/dx= -2.184533,  dz/dy= 2.820034,  Z_value: 3.181194

X_value: -0.873813,  Y_value:1.128014,
dz/dx= -1.747627,  dz/dy= 2.256027,  Z_value: 2.035964

X_value: -0.699051,  Y_value:0.902411,
dz/dx= -1.398101,  dz/dy= 1.804822,  Z_value: 1.303017

X_value: -0.559241,  Y_value:0.721929,
dz/dx= -1.118481,  dz/dy= 1.443857,  Z_value: 0.833931

X_value: -0.447392,  Y_value:0.577543,
dz/dx= -0.894785,  dz/dy= 1.155086,  Z_value: 0.533716

X_value: -0.357914,  Y_value:0.462034,
dz/dx= -0.715828,  dz/dy= 0.924069,  Z_value: 0.341578

X_value: -0.286331,  Y_value:0.369627,
dz/dx= -0.572662,  dz/dy= 0.739255,  Z_value: 0.218610

X_value: -0.229065,  Y_value:0.295702,
dz/dx= -0.458130,  dz/dy= 0.591404,  Z_value: 0.139910

X_value: -0.183252,  Y_value:0.236562,
dz/dx= -0.366504,  dz/dy= 0.473123,  Z_value: 0.089543

X_value: -0.146602,  Y_value:0.189249,
dz/dx= -0.293203,  dz/dy= 0.378499,  Z_value: 0.057307

X_value: -0.117281,  Y_value:0.151399,
dz/dx= -0.234562,  dz/dy= 0.302799,  Z_value: 0.036677

X_value: -0.093825,  Y_value:0.121120,
dz/dx= -0.187650,  dz/dy= 0.242239,  Z_value: 0.023473

X_value: -0.075060,  Y_value:0.096896,
dz/dx= -0.150120,  dz/dy= 0.193791,  Z_value: 0.015023

X_value: -0.060048,  Y_value:0.077517,
dz/dx= -0.120096,  dz/dy= 0.155033,  Z_value: 0.009615

X_value: -0.048038,  Y_value:0.062013,
dz/dx= -0.096077,  dz/dy= 0.124026,  Z_value: 0.006153

X_value: -0.038431,  Y_value:0.049611,
dz/dx= -0.076861,  dz/dy= 0.099221,  Z_value: 0.003938

X_value: -0.030745,  Y_value:0.039688,
dz/dx= -0.061489,  dz/dy= 0.079377,  Z_value: 0.002520

X_value: -0.024596,  Y_value:0.031751,
dz/dx= -0.049191,  dz/dy= 0.063502,  Z_value: 0.001613

X_value: -0.019677,  Y_value:0.025401,
dz/dx= -0.039353,  dz/dy= 0.050801,  Z_value: 0.001032

X_value: -0.015741,  Y_value:0.020320,
dz/dx= -0.031482,  dz/dy= 0.040641,  Z_value: 0.000661

X_value: -0.012593,  Y_value:0.016256,
dz/dx= -0.025186,  dz/dy= 0.032513,  Z_value: 0.000423

X_value: -0.010074,  Y_value:0.013005,
dz/dx= -0.020149,  dz/dy= 0.026010,  Z_value: 0.000271

X_value: -0.008060,  Y_value:0.010404,
dz/dx= -0.016119,  dz/dy= 0.020808,  Z_value: 0.000173

X_value: -0.006448,  Y_value:0.008323,
dz/dx= -0.012895,  dz/dy= 0.016647,  Z_value: 0.000111

X_value: -0.005158,  Y_value:0.006659,
dz/dx= -0.010316,  dz/dy= 0.013317,  Z_value: 0.000071

X_value: -0.004126,  Y_value:0.005327,
dz/dx= -0.008253,  dz/dy= 0.010654,  Z_value: 0.000045

The results show that it converges towards (0,0) which is the local and the global minimum of the function h(x).


Gradient Descent in Deep Learning

The examples shown above are very much over simplified.

In a neural network, there could be hundreds of variables and it becomes incredibly difficult to minimize the loss function.

This would also mean that it would take a longer time to train the model.

To speed up our training process, we could explore other variants of Gradient Descents such as batch gradient descent and stochastic gradient descent.

Feel free to look them up on your own if you are interested.

Thank you for reading! 😀

One thought on “What is Gradient Descent?

Leave a comment

Design a site like this with WordPress.com
Get started