Part of Artificial Intelligence is to allow a machine to learn concepts and infers decisions from that knowledge. One of the most powerful basic concepts in AI and machine learning is Artificial Neural Networks. Thanks to them, we are able to train computers in a way that was not possible with other mathematical tools such as Logistic Regressor and the like.

In this article, we want to talk about the basics of neural networks and how they work.

What are Artificial Neural Networks?

As the name suggests, they are inspired by the biological human brain. Information is processed by neurons which can be defined as “human computational unit”. Each neuron communicates with other neurons, so in the human brain, there is a network of neurons. Communication is carried out by synapsis and processed data is transferred from a neuron (presynaptic neuron) to another (postsynaptic neuron) to be further elaborated [1]Clipart from:

In an Artificial Neural Network (from now on I will simply refer to it as Neural Network), we have neurons that are our computational units. Neurons are organized in layers and weights are the synapsis. Neurons emit signals so as communication can occur among neurons of adjacent layers, but not with those in the same layer (intra-layer connections are not allowed).

A neural network has three types of layers:

  • Input layer handles the feature vectors;
  • Hidden layer deals with data transformation and it is the actual computation;
  • Output layer carries out model evaluation;

In a common network, there is one input layer, one output layer and whatever hidden layer we want! It is absolutely correct and important to extend the depth of our network by adding hidden layers. Neurons in the output layer must be exactly the same number as the classes because they will be associated with the output class.

As previously mentioned, each neuron emits a signal. Each of them is linked to all other neurons in the adjacent layer. It is important to notice that connections are one of the network parameters or alternatively, they are considered as the weights of the model. There is also a bias term for each layer.

Neuron and activation

It goes without saying that the most important element is the neuron. It acts as a logistic regressor, there is a linear combination of weights and input data which is transformed using an activation function (image from [2]

z = b + \sum_i x_i w_i

out = f(z)

Why should we use so many computation units? Each activation helps model generalization, it breaks the linearity constraints of the data, returning to a non-linear activation. In a more informal way, this can also be called as “feature abstraction”, in each moment we are abstracting data and the more abstraction level we have the more it can be seen as a generalization of the entire problem. There are many activation functions, therefore, we need to choose accurately based on the response of our model. Some of the most famous functions are:

  • Logistic/Sigmoid:

\frac{1}{1 +e^{-z}}

  • Tanh (hyperbolic tangent):


  • ReLU (Rectified Linear Unit):

max(0, x)

  • Softmax:

\frac{e^{z_j}}{\sum_i z_i}

Today the most used activation is probably ReLU and its variations, it is a very simple function that works incredibly well especially in deep neural networks. Softmax is only used on output layer because it gives a probability to each value so that their summation is 1. It means that a neuron that has the higher value is the most probable to be assigned to a category.

Learning type

Neural networks, like many other models, can be used with various types of learning:

  • Supervised learning: for each sample we have the target label that tells us the true nature of the data. Our model estimates the class label of a sample and compares it with the true label. The error is then used to enhance the model.
  • Unsupervised learning: a network is trained with no knowledge of the input samples. It is also used to perform feature extractions, in this way it is possible to obtain features that are not handcrafted;
  • Semi-supervised learning: the most famous and used type of learning. In the first step there is an unsupervised step, therefore the target labels are used to finetune the model;

Network training

Neural networks training consist of two main steps:

  1. Feedforward;
  2. Backward;

In the feedforward step, starting from the input layer, data passes through all consecutive hidden layers until reaching the output layer. In the end, data is evaluated and compared to the target labels, in order to estimate how the model is performing. Performances are computed using a cost function. At this stage, the error sensitivity is propagated back using a technique called backpropagation. It consists in propagating the error back to the input layer, computing the derivatives of all the operations that were performed during the forward step.

So, this approach uses one of the derivative property that is the chain rule. Without focusing on mathematical details, each weight and bias in the network is updated taking into account the error in the last step. At this point, when a new forward operation is executed, we expect to get a smaller error than the previous one and a better generalization of the model.

Solving complex problems

One of the intents of neural networks is to solve complex real-world problems. Usually, they are non-linearly separable problems. This is one of the major limitations of other learning techniques. It has been shown that neural networks address this problem, with the Universal Approximation Theorem [3]Cybenko, G. (1989) “Approximations by superpositions of sigmoidal functions”Mathematics of Control, Signals, and Systems, 2 (4), 303-314 [4]Kurt Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks“, Neural Networks, 4(2), 251–257. doi:10.1016/0893-6080(91)90009-TUsing only one hidden layer with a finite number of hidden units neural networks are able to represent a wide variety of functions. The main problem is that there is no unique set of “right” parameters (and this is one of the reasons for the birth of deep learning).

Case study

We want to show you the importance to choose a right network configuration and the reason why multiple hidden units are necessary. The most famous study case is the non-linear XOR problem.

We have an input represented by two bits 0/1 that can be classified into two classes. As shown in the table below we have:

  • Class 0: two equal bits;
  • Class 1: opposite bits:

Y = X_1 \bigoplus X_2

X_1 X_2 Y
0 0 0
0 1 1
1 0 1
1 1 0

Now we are going to provide two video examples on how a network acts in the case of a single neuron or with multiple neurons (all the videos come from [5]

  • A single neuron is not able to generalize complex models, but only to represent linear decision boundaries. A Rosenblatt Perceptron is shown and it has two input nodes and only one output neuron, there aren’t hidden units:
  • Multiple neurons are able to generalize real-world problems and complex models. The XOR problem is solved with Multi-Layer Perceptron with a single hidden layer and two hidden units:

References   [ + ]

1. Clipart from:
3. Cybenko, G. (1989) “Approximations by superpositions of sigmoidal functions”Mathematics of Control, Signals, and Systems, 2 (4), 303-314
4. Kurt Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks“, Neural Networks, 4(2), 251–257. doi:10.1016/0893-6080(91)90009-T