MRT Map & Dijkstra

Non-technical Introduction of the project

During my time in Term 4 of CSD, I was introduced to various Algorithms and I was paprticularly interested in the applications of the graph algorithms to the real world context.

This was my motivation for this project.

One of the Graph Algorithms that I learnt was Dijkstra’s Algorithm, where the algorithm finds the shorest path from a source node to a destination node, given a non-negative weighted graph.

I am also someone who frequently take public transport and most of my transport time is spent on the train.

Since I have quite a few minutes to spare on the MRT, I spend my commute time coding (P.S. I am doing this while on MRT as well haha)

One day, I had an idea that sprung up my head about making the Map of the MRT stations and running Dijkstra’s algorithm on it bringing me to this project.

How is the map created?

Each station on the map is represented by an object “Station” in the Vertex.py and a Station is defined by its name, colour, edges, distance and a reference of the previous station. The first 3 attributes are crucial in creating a unique station and the last 2 is crucial in storing the information while running the Dijkstra’s Algorithm.

Each ‘connection’ is represented by an object “edge” defined by its origin, destination and weight- time taken to travel from origin to destination

Using multiple helper functions, the stations and the edges are created through looping over the list of the station names and the durations. Each MRT line is created in isolation and once all the lines are created, we connect one line to another using the interchanges.

Tada! We have our MAP!

Functions

 search (string)
 - returns a list of station names containing the input string

 getAdjacencyListOf(station)
- returns the Adjacency List of the input station - Adjacency List is the list of the neighboring nodes in the graph, in this case they are the next stations from the station in all directions.

getShortestPath(origin, destination)
 - returns a list of stations along the shortest path from origin to destination

 getShortestTime(origin, destination)
 - returns time taken for shortest path from origin to destination

If you would like to play around with the code, here’s the link to the github repo of this code. Try it!

https://github.com/oakar-00/MRTMapDijkstra

Limitations of the model

In the code implemented here, it only accounts for the time taken for the MRT to travel from A to B and it fails to account for the walking time at the interchange and therefore, the results might not be as accurate.

The Fundamentals of Deep Learning

Deep Learning at first glance sounds very complicated.
What if I tell you, it is not as complicated as you think.

This blog will explain to you the basics of how Deep Learning works.
You would have to be comfortable with seeing a lot of differential equations and know a little bit of Linear Algebra.

Perceptron

What is a perceptron?

Perceptrons in Deep Learning are essentially the same as neurons in the human brain.

How does the human brain work?

At this point, it is important to explain how the human brain works.
The human brain consists of a vast network containing billions of neurons.

When we receive information, certain neurons fire to process this information.
And when we receive the similar information in the future, the same sets of neurons get fired and that is how we recognize that this information is something we have seen before.

This is what we call learning.

One example of which can be picture books that children use to identify animals.

How can we apply the concept of neurons to the perceptron?

Just like how certain neurons fire when a similar information is provided,
we can define perceptrons in such a way that certain perceptrons in the neural network get activated when we give them a similar information.

When information is provided, there are only two states to a single neuron.
It fires or it doesn’t.

We can apply this to the perceptron in a binary format and assign the perceptron that fires “1” and assign “0” if the perceptron doesn’t fire.

How does a perceptron work?

Let’s start off with a simple example.
The diagram below there are 3 inputs (x) and a bias term connected to a perceptron and each connection has its own weight (w).

From the diagram, both the inputs and the respective weights can be written as matrices:

From this, we can express the expression for ŷ as follows:

Non-Linear Activation Function

After attaining the weighted sum, we apply one of the non-linear activation function.

The most commonly used and easiest to understand would be “ReLU”, but there are other activation functions such as “tanh” and “sigmoid” as well.

image from : https://datawow.io/blogs/interns-explain-basic-neural-network

All this calculation is just for one perceptron and we would have to compute for every perceptron in our neural network.

And we can have several hidden layers and it might look something like this:

Backpropagation

This is the most crucial part of Deep Learning as this is essentially where the “learning” takes place.

During the first forward propagation, the weights are assigned randomly and the loss function would be too large.

If you ask the computer to classify between images of dogs and cats
the results would not be any different from a coin flip.
(Well that’s just not good.)

That’s why we have to work backwards to update our weights to ensure that the predictions can fit the expected results.

The expression for optimization of the loss function with respect to the weight can be obtained through applying chain rule as shown below.

This formula can be used to attain the values of all the weights in the neural network (w₁, w₂, …) such that it outputs the minimum value of J(W).

This algorithm is also known as Gradient Descent.
I have a blog on that too! Check it out here.

As we repeat this forward and backward propagation for several iterations, the computer learn to recognize patterns in the data and the predictions become increasingly accurate.

What is Gradient Descent?

Gradient descent is an iterative algorithm for finding a local minimum of a differentiable function.
This blog post will be a quick explanation of how it works and its application in deep learning.

In deep learning, we have something called a loss function and it is defined by how far off the predictions are to its true value.

Some of the loss functions include Mean Squared Error(MSE),
Mean Absolute Error(MAE) and Cross-Entropy.

In order for the predictions to be more accurate, we have to minimize the loss function through gradient descent.

Gradient Descent on
single variable function

Let’s start off with something simple to get an intuition of how Gradient Descent works.

We have a function f(x) and we will be using gradient descent to get the local minimum of f(x).
f'(x) is also shown below for reference.

f(x) = x² -8x +4
f'(x) = 2x -8

The algorithm for the gradient descent is given by this equation:

In this equation, the new value of X, is given by the current value of X minus the n(learning rate) multiplied by the gradient at the current value of X.
The first value of X is usually a random guess in the range of x.
n can be thought of as a “step size” and is a positive arbitrary constant.

If a function concave upwards, and the initial X value is towards the right of the minimum point, the derivative will be positive.

This means that the subsequent X value would be less than the previous X value and this results in a convergence towards the minimum point of the function.

I have added an animation to help you visualize to understand this better.
The algorithm is ran for 30 iterations with n=0.2

Here’s the output of the algorithm,
Notice that the first few iterations, it converges very quickly.
However, it gradually converges at a slower rate with increasing iteration.

X_value: 28.52, Y_value: 589.2304, dy/dx= 49.04
X_value: 18.712, Y_value: 204.442944, dy/dx= 29.424
X_value: 12.8272, Y_value: 65.91946, dy/dx= 17.6544
X_value: 9.29632, Y_value: 16.051006, dy/dx= 10.59264
X_value: 7.177792, Y_value: -1.901638, dy/dx= 6.355584
X_value: 5.906675, Y_value: -8.36459, dy/dx= 3.81335
X_value: 5.144005, Y_value: -10.691252, dy/dx= 2.28801
X_value: 4.686403, Y_value: -11.528851, dy/dx= 1.372806
X_value: 4.411842, Y_value: -11.830386, dy/dx= 0.823684
X_value: 4.247105, Y_value: -11.938939, dy/dx= 0.49421
X_value: 4.148263, Y_value: -11.978018, dy/dx= 0.296526
X_value: 4.088958, Y_value: -11.992087, dy/dx= 0.177916
X_value: 4.053375, Y_value: -11.997151, dy/dx= 0.106749
X_value: 4.032025, Y_value: -11.998974, dy/dx= 0.06405
X_value: 4.019215, Y_value: -11.999631, dy/dx= 0.03843
X_value: 4.011529, Y_value: -11.999867, dy/dx= 0.023058
X_value: 4.006917, Y_value: -11.999952, dy/dx= 0.013835
X_value: 4.00415, Y_value: -11.999983, dy/dx= 0.008301
X_value: 4.00249, Y_value: -11.999994, dy/dx= 0.004981
X_value: 4.001494, Y_value: -11.999998, dy/dx= 0.002988
X_value: 4.000896, Y_value: -11.999999, dy/dx= 0.001793
X_value: 4.000538, Y_value: -12.0, dy/dx= 0.001076
X_value: 4.000323, Y_value: -12.0, dy/dx= 0.000645
X_value: 4.000194, Y_value: -12.0, dy/dx= 0.000387
X_value: 4.000116, Y_value: -12.0, dy/dx= 0.000232
X_value: 4.00007, Y_value: -12.0, dy/dx= 0.000139
X_value: 4.000042, Y_value: -12.0, dy/dx= 8.4e-05
X_value: 4.000025, Y_value: -12.0, dy/dx= 5e-05
X_value: 4.000015, Y_value: -12.0, dy/dx= 3e-05
X_value: 4.000009, Y_value: -12.0, dy/dx= 1.8e-05
X_value: 4.000005, Y_value: -12.0, dy/dx= 1.1e-05

You can think of the Gradient descent algorithm as rolling down a ball into the curve.
The algorithm ensures that the “ball” ends up at the local minimum.

Common Challenges

The algorithm might not give the global minimum value

In the above given example, it was a square function
and therefore only has one minimum point.

However, as functions get more complex,
there might be multiple minimum points.

Here’s a sample of a function g(x) with two minimum points.

g(x) = 0.001x⁴ – 0.03x³ +0.1x²+ x +2
g'(x) = 0.004x³ -0.09x²+ 0.2x + 1

From the graph above, we can clearly see that the global minimum occurs at roughly x =20.

However, when we run the same algorithm for 30 iterations on the graph of g(x), with n= 0.2 , the result shows the local minimum point at roughly x=-2.

X_value: -11.26, Y_value: 62.322707, dy/dx= -18.373398
X_value: -7.58532, Y_value: 16.572022, dy/dx= -7.441151
X_value: -6.09709, Y_value: 7.801995, dy/dx= -4.471749
X_value: -5.20274, Y_value: 4.461727, dy/dx= -3.040036
X_value: -4.594733, Y_value: 2.872184, dy/dx= -2.206997
X_value: -4.153334, Y_value: 2.018626, dy/dx= -1.669766
X_value: -3.819381, Y_value: 1.523662, dy/dx= -1.29963
X_value: -3.559455, Y_value: 1.220957, dy/dx= -1.032555
X_value: -3.352944, Y_value: 1.028504, dy/dx= -0.833168
X_value: -3.18631, Y_value: 0.902499, dy/dx= -0.680391
X_value: -3.050232, Y_value: 0.818095, dy/dx= -0.560915
X_value: -2.938049, Y_value: 0.760527, dy/dx= -0.465948
X_value: -2.844859, Y_value: 0.720686, dy/dx= -0.389458
X_value: -2.766968, Y_value: 0.692786, dy/dx= -0.32718
X_value: -2.701532, Y_value: 0.673056, dy/dx= -0.276017
X_value: -2.646328, Y_value: 0.658991, dy/dx= -0.23367
X_value: -2.599594, Y_value: 0.648897, dy/dx= -0.1984
X_value: -2.559914, Y_value: 0.641612, dy/dx= -0.168869
X_value: -2.52614, Y_value: 0.636328, dy/dx= -0.144034
X_value: -2.497334, Y_value: 0.632482, dy/dx= -0.123068
X_value: -2.47272, Y_value: 0.629671, dy/dx= -0.105311
X_value: -2.451658, Y_value: 0.627612, dy/dx= -0.090232
X_value: -2.433611, Y_value: 0.6261, dy/dx= -0.077396
X_value: -2.418132, Y_value: 0.624987, dy/dx= -0.066448
X_value: -2.404843, Y_value: 0.624166, dy/dx= -0.057094
X_value: -2.393424, Y_value: 0.62356, dy/dx= -0.04909
X_value: -2.383606, Y_value: 0.623111, dy/dx= -0.042234
X_value: -2.375159, Y_value: 0.622779, dy/dx= -0.036353
X_value: -2.367888, Y_value: 0.622534, dy/dx= -0.031304
X_value: -2.361628, Y_value: 0.622351, dy/dx= -0.026967
X_value: -2.356234, Y_value: 0.622216, dy/dx= -0.023238

This means that the gradient descent algorithm does not necessarily
converge to the global minimum point and thus it has limitations in
minimizing the loss function.

One way to get around this is by repeating the algorithm several times
with different starting points.

X_value: 11.33, Y_value: -0.987125, dy/dx= -2.469522
X_value: 11.823904, Y_value: -2.241394, dy/dx= -2.605493
X_value: 12.345003, Y_value: -3.630622, dy/dx= -2.721449
X_value: 12.889293, Y_value: -5.137301, dy/dx= -2.808797
X_value: 13.451052, Y_value: -6.731203, dy/dx= -2.858723
X_value: 14.022797, Y_value: -8.369282, dy/dx= -2.86323
X_value: 14.595443, Y_value: -9.998061, dy/dx= -2.816446
X_value: 15.158732, Y_value: -11.558874, dy/dx= -2.71597
X_value: 15.701926, Y_value: -12.995409, dy/dx= -2.563888
X_value: 16.214704, Y_value: -14.261939, dy/dx= -2.367094
X_value: 16.688123, Y_value: -15.329918, dy/dx= -2.136654
X_value: 17.115454, Y_value: -16.190947, dy/dx= -1.886278
X_value: 17.492709, Y_value: -16.855321, dy/dx= -1.630279
X_value: 17.818765, Y_value: -17.347041, dy/dx= -1.381572
X_value: 18.09508, Y_value: -17.697269, dy/dx= -1.15023
X_value: 18.325126, Y_value: -17.938281, dy/dx= -0.942837
X_value: 18.513693, Y_value: -18.099222, dy/dx= -0.762597
X_value: 18.666212, Y_value: -18.20397, dy/dx= -0.609945
X_value: 18.788202, Y_value: -18.270697, dy/dx= -0.483367
X_value: 18.884875, Y_value: -18.312459, dy/dx= -0.380196
X_value: 18.960914, Y_value: -18.338226, dy/dx= -0.297253
X_value: 19.020365, Y_value: -18.353942, dy/dx= -0.231297
X_value: 19.066624, Y_value: -18.363441, dy/dx= -0.1793
X_value: 19.102484, Y_value: -18.369142, dy/dx= -0.138584
X_value: 19.130201, Y_value: -18.372544, dy/dx= -0.106868
X_value: 19.151575, Y_value: -18.374565, dy/dx= -0.082265
X_value: 19.168028, Y_value: -18.375762, dy/dx= -0.063238
X_value: 19.180675, Y_value: -18.376469, dy/dx= -0.048561
X_value: 19.190387, Y_value: -18.376886, dy/dx= -0.03726
X_value: 19.197839, Y_value: -18.377131, dy/dx= -0.02857
X_value: 19.203553, Y_value: -18.377276, dy/dx= -0.021897

Too small/large learning rates

When deciding the learning rate, we must ensure that it is not too small as very small learning rate would result in slower convergence towards the minimum point.

The algorithm is ran for 30 iterations on the graph of f(x) with n=0.01.

X_value: 28.7, Y_value: 598.09, dy/dx= 49.4
X_value: 28.206, Y_value: 573.930436, dy/dx= 48.412
X_value: 27.72188, Y_value: 550.727591, dy/dx= 47.44376
X_value: 27.247442, Y_value: 528.443578, dy/dx= 46.494885
X_value: 26.782494, Y_value: 507.042012, dy/dx= 45.564987
X_value: 26.326844, Y_value: 486.487949, dy/dx= 44.653687
X_value: 25.880307, Y_value: 466.747826, dy/dx= 43.760614
X_value: 25.442701, Y_value: 447.789412, dy/dx= 42.885401
X_value: 25.013847, Y_value: 429.581751, dy/dx= 42.027693
X_value: 24.59357, Y_value: 412.095114, dy/dx= 41.187139
X_value: 24.181698, Y_value: 395.300947, dy/dx= 40.363397
X_value: 23.778064, Y_value: 379.17183, dy/dx= 39.556129
X_value: 23.382503, Y_value: 363.681426, dy/dx= 38.765006
X_value: 22.994853, Y_value: 348.804441, dy/dx= 37.989706
X_value: 22.614956, Y_value: 334.516585, dy/dx= 37.229912
X_value: 22.242657, Y_value: 320.794528, dy/dx= 36.485314
X_value: 21.877804, Y_value: 307.615865, dy/dx= 35.755607
X_value: 21.520248, Y_value: 294.959077, dy/dx= 35.040495
X_value: 21.169843, Y_value: 282.803497, dy/dx= 34.339685
X_value: 20.826446, Y_value: 271.129279, dy/dx= 33.652892
X_value: 20.489917, Y_value: 259.917359, dy/dx= 32.979834
X_value: 20.160119, Y_value: 249.149432, dy/dx= 32.320237
X_value: 19.836916, Y_value: 238.807915, dy/dx= 31.673832
X_value: 19.520178, Y_value: 228.875921, dy/dx= 31.040356
X_value: 19.209774, Y_value: 219.337235, dy/dx= 30.419549
X_value: 18.905579, Y_value: 210.17628, dy/dx= 29.811158
X_value: 18.607467, Y_value: 201.378099, dy/dx= 29.214934
X_value: 18.315318, Y_value: 192.928327, dy/dx= 28.630636
X_value: 18.029012, Y_value: 184.813165, dy/dx= 28.058023
X_value: 17.748431, Y_value: 177.019364, dy/dx= 27.496863
X_value: 17.473463, Y_value: 169.534197, dy/dx= 26.946925

However, if the learning rate is very large, the X value diverges and thus, the algorithm fails.

The algorithm is ran for 30 iterations on the graph of f(x) with n=1.06.

X_value: 2.1, Y_value: -8.39, dy/dx= -3.8
X_value: 6.128, Y_value: -7.471616, dy/dx= 4.256
X_value: 1.61664, Y_value: -6.319595, dy/dx= -4.76672
X_value: 6.669363, Y_value: -4.8745, dy/dx= 5.338726
X_value: 1.010313, Y_value: -3.061773, dy/dx= -5.979374
X_value: 7.348449, Y_value: -0.787888, dy/dx= 6.696898
X_value: 0.249737, Y_value: 2.064473, dy/dx= -7.500526
X_value: 8.200295, Y_value: 5.642475, dy/dx= 8.400589
X_value: -0.70433, Y_value: 10.130721, dy/dx= -9.40866
X_value: 9.26885, Y_value: 15.760777, dy/dx= 10.537699
X_value: -1.901112, Y_value: 22.823118, dy/dx= -11.802223
X_value: 10.609245, Y_value: 31.682119, dy/dx= 13.21849
X_value: -3.402354, Y_value: 42.79485, dy/dx= -14.804709
X_value: 12.290637, Y_value: 56.73466, dy/dx= 16.581274
X_value: -5.285513, Y_value: 74.220758, dy/dx= -18.571027
X_value: 14.399775, Y_value: 96.155319, dy/dx= 20.79955
X_value: -7.647748, Y_value: 123.670032, dy/dx= -23.295496
X_value: 17.045478, Y_value: 158.184488, dy/dx= 26.090955
X_value: -10.610935, Y_value: 201.479422, dy/dx= -29.22187
X_value: 20.364247, Y_value: 255.788587, dy/dx= 32.728494
X_value: -14.327957, Y_value: 323.914003, dy/dx= -36.655914
X_value: 24.527312, Y_value: 409.370526, dy/dx= 41.054623
X_value: -18.990589, Y_value: 516.567187, dy/dx= -45.981178
X_value: 29.74946, Y_value: 651.03468, dy/dx= 51.49892
X_value: -24.839395, Y_value: 819.710703, dy/dx= -57.67879
X_value: 36.300122, Y_value: 1031.297905, dy/dx= 64.600245
X_value: -32.176137, Y_value: 1296.712892, dy/dx= -72.352274
X_value: 44.517274, Y_value: 1629.649452, dy/dx= 81.034547
X_value: -41.379346, Y_value: 2047.285073, dy/dx= -90.758693
X_value: 54.824868, Y_value: 2571.167195, dy/dx= 101.649736
X_value: -52.923852, Y_value: 3228.32493, dy/dx= -113.847704

Gradient Descent on
two variable function

Now that we have some idea of how gradient descent works with single variable(x),
we are going to take a step further to see how it works with two variables (x,y).

Here’s a plot of h(x,y) on a 3d plot, and their partial derivatives for reference.

h(x,y) = x² + y²
h_x = 2x
h_y= 2y

Just like we did with the single variable function, we are going to apply the same algorithm.

But now that we have two variables, we would have to apply the algorithm to both the variables x and y.

By applying the algorithm for both x and y, for 30 iterations at n=0.1.

X_value: -3.333333,  Y_value:4.303030, 
dz/dx= -6.666667,  dz/dy= 8.606061,  Z_value: 29.627181

X_value: -2.666667,  Y_value:3.442424,
dz/dx= -5.333333,  dz/dy= 6.884848,  Z_value: 18.961396

X_value: -2.133333,  Y_value:2.753939,
dz/dx= -4.266667,  dz/dy= 5.507879,  Z_value: 12.135293

X_value: -1.706667,  Y_value:2.203152,
dz/dx= -3.413333,  dz/dy= 4.406303,  Z_value: 7.766588

X_value: -1.365333,  Y_value:1.762521,
dz/dx= -2.730667,  dz/dy= 3.525042,  Z_value: 4.970616

X_value: -1.092267,  Y_value:1.410017,
dz/dx= -2.184533,  dz/dy= 2.820034,  Z_value: 3.181194

X_value: -0.873813,  Y_value:1.128014,
dz/dx= -1.747627,  dz/dy= 2.256027,  Z_value: 2.035964

X_value: -0.699051,  Y_value:0.902411,
dz/dx= -1.398101,  dz/dy= 1.804822,  Z_value: 1.303017

X_value: -0.559241,  Y_value:0.721929,
dz/dx= -1.118481,  dz/dy= 1.443857,  Z_value: 0.833931

X_value: -0.447392,  Y_value:0.577543,
dz/dx= -0.894785,  dz/dy= 1.155086,  Z_value: 0.533716

X_value: -0.357914,  Y_value:0.462034,
dz/dx= -0.715828,  dz/dy= 0.924069,  Z_value: 0.341578

X_value: -0.286331,  Y_value:0.369627,
dz/dx= -0.572662,  dz/dy= 0.739255,  Z_value: 0.218610

X_value: -0.229065,  Y_value:0.295702,
dz/dx= -0.458130,  dz/dy= 0.591404,  Z_value: 0.139910

X_value: -0.183252,  Y_value:0.236562,
dz/dx= -0.366504,  dz/dy= 0.473123,  Z_value: 0.089543

X_value: -0.146602,  Y_value:0.189249,
dz/dx= -0.293203,  dz/dy= 0.378499,  Z_value: 0.057307

X_value: -0.117281,  Y_value:0.151399,
dz/dx= -0.234562,  dz/dy= 0.302799,  Z_value: 0.036677

X_value: -0.093825,  Y_value:0.121120,
dz/dx= -0.187650,  dz/dy= 0.242239,  Z_value: 0.023473

X_value: -0.075060,  Y_value:0.096896,
dz/dx= -0.150120,  dz/dy= 0.193791,  Z_value: 0.015023

X_value: -0.060048,  Y_value:0.077517,
dz/dx= -0.120096,  dz/dy= 0.155033,  Z_value: 0.009615

X_value: -0.048038,  Y_value:0.062013,
dz/dx= -0.096077,  dz/dy= 0.124026,  Z_value: 0.006153

X_value: -0.038431,  Y_value:0.049611,
dz/dx= -0.076861,  dz/dy= 0.099221,  Z_value: 0.003938

X_value: -0.030745,  Y_value:0.039688,
dz/dx= -0.061489,  dz/dy= 0.079377,  Z_value: 0.002520

X_value: -0.024596,  Y_value:0.031751,
dz/dx= -0.049191,  dz/dy= 0.063502,  Z_value: 0.001613

X_value: -0.019677,  Y_value:0.025401,
dz/dx= -0.039353,  dz/dy= 0.050801,  Z_value: 0.001032

X_value: -0.015741,  Y_value:0.020320,
dz/dx= -0.031482,  dz/dy= 0.040641,  Z_value: 0.000661

X_value: -0.012593,  Y_value:0.016256,
dz/dx= -0.025186,  dz/dy= 0.032513,  Z_value: 0.000423

X_value: -0.010074,  Y_value:0.013005,
dz/dx= -0.020149,  dz/dy= 0.026010,  Z_value: 0.000271

X_value: -0.008060,  Y_value:0.010404,
dz/dx= -0.016119,  dz/dy= 0.020808,  Z_value: 0.000173

X_value: -0.006448,  Y_value:0.008323,
dz/dx= -0.012895,  dz/dy= 0.016647,  Z_value: 0.000111

X_value: -0.005158,  Y_value:0.006659,
dz/dx= -0.010316,  dz/dy= 0.013317,  Z_value: 0.000071

X_value: -0.004126,  Y_value:0.005327,
dz/dx= -0.008253,  dz/dy= 0.010654,  Z_value: 0.000045

The results show that it converges towards (0,0) which is the local and the global minimum of the function h(x).

Gradient Descent in Deep Learning

The examples shown above are very much over simplified.

In a neural network, there could be hundreds of variables and it becomes incredibly difficult to minimize the loss function.

This would also mean that it would take a longer time to train the model.

To speed up our training process, we could explore other variants of Gradient Descents such as batch gradient descent and stochastic gradient descent.

Feel free to look them up on your own if you are interested.

Thank you for reading! 😀

BoxPlot VS ViolinPlot VS SwarmPlot

Introduction

In this blog post, we are going to compare 3 different types of plots to help you visualize the distribution of data using seaborn. We will also mention advantages and disadvantages of each plot so that you can choose the most suitable plot when you are plotting 😀

### import relevant libraries 
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline

For this blog post, we will be using Student Performance in Exam dataset from kaggle here.

Just for reference, this is what the data looks like.

### reading the csv file
df= pd.read_csv("StudentsPerformance.csv")
df

Box plot

This is the simplest and easiest to understand of all the plots in the blog.

The box represents the interquartile range (IQR) of the data, from 25 percentile(Q1) to 75 percentile(Q3).

The line within the box represents the mean value.

The two tails represent the Lower Adjacent Value(Q1 – 1.5*IQR) and Upper Adjacent Value(Q3 + 1.5*IQR).

The data points outside the tails represents the outliers in the dataset.

fig, ax = plt.subplots( figsize=(6,2) )
fig= sns.boxplot(x= "math score", data= df , ax=ax).set_title("Distribution of Maths Scores")

When we plot the math scores across the different races from the dataset, this is what we get.

fig, ax = plt.subplots(figsize=(16,8))
fig = sns.boxplot(y= "math score", x= 'race/ethnicity',data= df.sort_values('race/ethnicity'), ax=ax).set_title("Distribution of Math Scores")

We can subdivide these box plots of the math scores of different races into male and female math scores as well.

fig, ax = plt.subplots(figsize=(16,8))
fig = sns.boxplot(y= "math score", x= 'race/ethnicity', hue = 'gender' ,data= df.sort_values('race/ethnicity'), ax=ax).set_title("Distribution of Math Scores")

Advantages
– Simple and easy to understand

Disadvantages
– May not be as visually appealing as other plots

Violin plot

Violin plot is very similar to box plot in terms of what it represents but with much more to offer.

At the center of the violin, the white dot represent the mean value, the bold line represent the interquartile range, and the faint line represent the Lower and Upper Adjacent Values.

You may be wondering what additional information does this violin shape contain. Well it is pretty simple.

The width of the violin (Density plot width) represent the frequency of the data at that value which roughly correlates to the likelihood or the probability of attaining that value.

fig, ax = plt.subplots(figsize=(6,2))
fig = sns.violinplot(x= "math score", data= df, ax=ax).set_title("Distribution of Maths Scores")

When we plot the math scores across the different races from the dataset, we get something like this.

fig, ax = plt.subplots(figsize=(16,8))
fig = sns.violinplot(y= "math score", x='race/ethnicity',data= df.sort_values("race/ethnicity"), split=True, ax=ax).set_title("Distribution of Maths Scores")

Moreover, we can split the violin to represent the data separately for categories with 2 labels.

For this dataset, we can split the violin and have one side represent the scores of male students and another side represent the scores of the female students.

For the plot below, I have plotted the violin plots of the math scores across the different races and further divide them into gender.

fig, ax = plt.subplots(figsize=(16,8))
fig = sns.violinplot( y= "math score", x='race/ethnicity',hue='gender',data= df.sort_values("race/ethnicity"), split=True, ax=ax).set_title("Distribution of Maths Scores")

Advantages
– Include frequency of the values in the data
– Includes the probability of getting the value
– Can be divided to represent 2 demographics

Disadvantages
– The mean, IQR and Lower and Upper Adjacent Values are not as explicit and outlier values are difficult to see
– Gives a misleading notion that a student can score lower than 0 or higher than 100.

Swarm plot

Swarm plots are essentially the same as the violin plot, but instead of having a smooth curve to represent the frequency of the data, swarm plot just plots a data point.

Same as before, let us plot the distribution of the math scores to see how it looks like.

fig, ax = plt.subplots(figsize=(8,6))
fig = sns.swarmplot(x= "math score", data= df, ax=ax).set_title("Distribution of Maths Scores")

When we plot the same data as from the violin plot above, the results plot also accounts for the uneven distribution of the race/ethnicity in the sample size.

fig, ax = plt.subplots(figsize=(16,8))
fig = sns.swarmplot( y= "math score", x='race/ethnicity',data= df.sort_values("race/ethnicity"), ax=ax).set_title("Distribution of Maths Scores")

However, when we further subdivide the swarm plot into genders, the plot can get very messy very quickly.

fig, ax = plt.subplots(figsize=(16,8))
fig = sns.swarmplot(y= "math score", x='race/ethnicity',hue='gender' ,data= df.sort_values(["race/ethnicity"]), ax=ax).set_title("Distribution of Maths Scores")

Advantages
– Outliner values are shown clearly
– Accounts for the different in the size of categories in the sample size
– Rough distribution of the data can be seen

Disadvantages
– Mean, IQR and Lower and Upper Adjacent values are no longer present.
– Cannot split the swarm to see the distribution unlike violin plots.

Conclusion

To be very honest, if you were to ask me “What is the best plot?”
I would say, it really depends on what you want to achieve.

Some plots offer features that others doesn’t so it is really a trade off that you have to consider when you are choosing between these plots.

Natural Language Processing (NLP) with Deep Learning

Introduction

Natural Language Processing is one of the many branches of computer science that enables the computer’s ability to understand languages.

This was my submission for a Kaggle Competition, “Natural Language Processing with Disaster Tweets“.

The objective is to determine if a certain tweet is referring to an actual disaster or not.

This is a brief blog post on how to use deep learning to predict if the tweets are talking about actual disasters.

I used Google Colab for my project to use their GPU to run the code faster.

Disclaimer: This is a code which I have written and it may or may not be the most efficient method.
If you have any suggestions or feedback, I am willing to receive constructive feedback.

Importing relevant libraries

Here are the purposes of each libraries imported:

pandas for processing data
tensorflow and keras for deep learning
nltk, natural language toolkit for natural language processing
matplotlib for plotting graphs

Reading the csv file

Using pandas, we can read the csv file.
“train.head()” will return the first 5 rows of the dataset.

Here’s a sample of what is inside

From here, we can see that disaster tweets are denoted by the target value of “1”, and if they are not, they are assigned the target value “0”.

Data Cleaning

As we can see from the figure above, the data can contain a lot of not very useful information such as the columns “keyword” and “location” as they are mostly empty values.

So before we give the computer to understand what we are saying, we have to clean up the mess in the data.

We will do just a simple data cleaning which consists of

Dropping the columns “keyword” and “location”.
Making all text to be in lower case
Removing hyperlinks and special characters (#, @, %, … )
Removing stopwords.

Before we move on, here’s a very brief introduction to what are stopwords.
Stopwords are words that does not contain much information.
They are mostly pronouns, prepositions, and conjunctions.
Here are some samples of stopwords:

{'again', 'doing', 'off', 'were', 'these', 'and', 'so', 'are', 'as', 'who', 'has', 'this', 'yourself', 'me'}

Let’s write some functions to make our data cleaning a little bit easier 🙂

Implement the code to the training and test datasets.

Here’s a sample of how an entry looks like after data cleaning.

id                            38
text       Was in NYC last week!  #original text

text_v1    was in nyc last week!  #lower case
text_v2     was in nyc last week  #special character removed
text_v3            nyc last week  #stopwords removed
target                         0

Tokenization

Now that we have our data cleaned, we can now give it to the computer to do the ‘learning’.

Well not exactly.

Computers understand numbers better than they understand English, so we have to somehow convert our texts to numbers. That’s when we have to do Tokenization.

How it works.

You can think of tokenization as a 1-1 function.
There exists one and ONLY one number that correlates to one English word.

In the code above, we are using the words from “train_clean.text_v3” to form a dictionary consisting of English words and its corresponding numerical value.

“oov_tok” is a token that is assigned to words that are out-of-vocabulary.
“vocab_size” is the number of tokenized words you want to keep. Only the most common words are kept.
“char_level” is put as False. If True, every character will be treated as a token.

Here’s a sample of what is in the “word_index”:

{..., 'video': 14,
 'emergency': 15,
 'disaster': 16,
 'police': 17, ... }

train_test_split

When training the model, we have to divide the training dataset into 2 separate two groups of data. Training data and Validation data.

You can think of machine learning/ deep learning like drawing a best fit curve on a scatterplot. With just the training dataset, you can definitely draw a curve passing through all the points, and you obtain a model with very high accuracy on a training dataset as your mean squared error is zero.

However, data in real world are subjected to errors and thus, the model that you developed may not reflect the general trend that occurs in reality.

This leads to overfitting and to avoid this from happening, we have to have a validation dataset to ‘check’ to ensure that the model is able to be applied to a larger context.

Here’s the code for “train_test_split”:

X_train, X_valid: “text_v3” training and validation
Y_train, Y_valid: “target” training and validation

Padding

The final step we have to do before we train the model is padding.

Language consist of sentences with varying lengths, but computers can only learn when the input is of the same length for all sentences. So we have to fill in the gaps for shorter sentences by padding them.

“max_len”: maximum length of the sentence
“padding_type” : sentences will be padded at the end
“trunc_type”: sentences longer than max_len will be cut short from the back

Building a model

The model consists of 4 layers: 2 Bidirectional LSTM layers and 2 Dense layers.

Embedding is a way to represent words using dense vectors in a continuous vector space.
Activation functions for Bidirectional LSTM and Dense layer are “tanh” and “relu” respectively.
The last Dense layer uses a “sigmoid” function as this is a classification problem.
“Adam” optimizer is used.

“epoch” : the number of passes of the entire training dataset the machine learning algorithm has completed
“early_stopping” is implemented so that the training stops when the maximum validation accuracy (“val_accuracy”) is attained.

Here’s the result of the trained model.

60/60 [==============================] - 0s 6ms/step - loss: 0.5470 - accuracy: 0.7269
[0.5469951629638672, 0.7268907427787781]

Not bad, but definitely can be improved 😀

Non-technical Introduction of the project

How is the map created?

Functions

Limitations of the model

Perceptron

What is a perceptron?

How does the human brain work?

How can we apply the concept of neurons to the perceptron?

How does a perceptron work?

Backpropagation

Gradient Descent on single variable function

Common Challenges

Gradient Descent on two variable function

Gradient Descent in Deep Learning

Introduction

Box plot

Violin plot

Swarm plot

Conclusion

Introduction

Importing relevant libraries

Reading the csv file

Data Cleaning

Tokenization

How it works.

train_test_split

Padding

Building a model

Gradient Descent on
single variable function

Gradient Descent on
two variable function