Why we use the activation function
Episode 2 - The non-linear function can widen the information that the model can capture
In the previous episode we’ve discussed at a high level what the main elements that constitute an Artificial Neural Network (ANN) are, by just saying that the raw output computed by each node can’t be used by the rest of the network. That’s because all computations inside the nodes are just a simple linear transformation:
And therefore they can’t gather all information available in the dataset. This statement can be confirmed by just thinking about the world around us. In fact, the environment rarely follows a perfect linear pattern; even without plotting complex graphs, think about when you’re driving a car and press on the accelerator. The velocity doesn’t increase linearly, and this can be translated into the fact that the relationship between time and velocity is not strictly proportional.
So to fix this problem we have to introduce a non-linear function (also called an activation function) that aims to widen the spectrum of information that can be “learnt“ by the algorithm.
The XOR counterexample
If you are not convinced by the previous example with the car, let’s consider an example using the XOR operator.
In case you wouldn’t know, the XOR (exclusive OR) operator is a binary operator that returns true if and only if exactly one of its operands is true, and it’s defined as follows:
In case we would use just the linear function to reproduce the XOR function, linear algebra comes to us. In fact we have to solve this linear equation system written in matrix form:
If you try to solve the system, you will find out that the system has no real solutions, which means that the XOR function can’t be computed linearly.
This is just one of many examples that I can provide as proof of the fact that the mere linear function can’t be used to describe all the complexities of the world that surround us.
The activation function
There are many kinds of activation functions; this post just covers some of them. All of them are non-linear functions that transform the scalar number zz produced by each node into a new scalar. The most typical are:
ReLu (Rectified Linear Unit):
Despite containing the word "Linear" in its name, this is one of the simplest non-linear functions out there. It is defined as follows:
So given the output of the node, if it is a negative number, this function passes the null number to the next nodes. Graphically, it can be seen as:
This function is very commonly used because it fixes many problems related to the optimization process that the functions below have.
Logistic:
This is one of the most important functions in statistics. Apart from the neural network, it’s commonly used to classify categorical variables. In every case, the result of this function is a probabilistic Bernoulli result. The function contains a linear combination and is the exponent of the Neper number:
Because the result can be interpreted as a probability of success, this means that the function will always output values between 0 and 1 (precisely, those two extremes are asymptotic).
This function is used inside an ANN when the goal is to perform a binary classification.
Softmax:
In case we would perform a classification that has more than two possible outputs (think, for example, of simple number recognition that has 10 different possible outputs—one for each character). In fact, this function is very similar to the previous logistic function. Its function is defined as follows:
There are two graphs that I have to show here. The first one is the one that plots the output of the function. This function is quite similar to the well-known exponential function.
In case we cumulate the output provided by the function for each value given as input, the total sum will be 1 and will appear as the sigmoid function:
These three activation functions are not the only ones; they are probably the most important and the most commonly used. In case you’re interested in learning more, I suggest looking into the Exponential Linear Unit (ELU), Leaky ReLU, and the hyperbolic tangent (tanh). The first two are quite similar to the ReLU seen before, while the latter is similar to the logistic function.
The XOR computing
The counterexample seen before can be resolved by simply implementing the ReLU function inside each node. First, let’s make a simple ANN that has three neurons that send their output after being processed by the activation function:
Precisely, the functions applied in each node are a linear combination with weights and biases already set:
It is immediate to check that it returns the right XOR values for the four inputs (0,0), (0,1), (1,0), and (1,1).
In this example, I have already set the value of each weight and bias. In reality, the process of finding the best value for each parameter is pretty tedious and involves Gradient Descent and Back Propagation. But these are the topics for the next lessons…
(Probably if you click on my account you can already see the next episode)