Nowadays, there are a lot of tools and tutorials that help us use sophisticated mathematical concepts easily. These also bring machine learning and computer science to a large number of people. Moreover, they provide the ability to deal with this field without considerable background. I think that a lot of people might ask themselves “why is it recommended to prefer a method over another?” and “where does this solution come out?”. Well, I’m going to try to give you some tips to answer these questions.

It seems that there is a strong relationship between cross entropy and what we call likelihood! In the next paragraphs, I’ll try to explain this relationship.

N.B. In the article some concepts are not explained thoroughly so as to allow that everyone understands easily.

Statistics and likelihood

Likelihood is the operation (function) that starting from parameters of a model gives as a result the probability that the input is associated with the reference value. To introduce the likelihood we need to take some steps back.

By starting with some statistics basic notations we can define probabilities. Given two events X, Y we define P(X) the prior (or the known) probability of set X and P(Y) a prior probability of set Y.

 

 

 

 

 

 

If there is an intersection between these sets we can define the intersection as joint probability of X and Y defined as P(X , Y) = P(Y | X) * P(X). The first term on the right P(Y | X) is the conditional probability of Y given X, that means, the probability that an event Y can happen given the event X. With more attention we can see that the joint probability depends on both sets, so we can also define it as P(X , Y) = P(X | Y) * P(Y). Ultimately, we obtain that the two probabilities are equals.

Now that we know about all probabilities previously explained, we should ask how the conditional probability is described:

  • P(X | Y) = \frac{P(Y | X) * P(X)}{P(Y)};
  • P(Y | X) = \frac{P(X | Y) * P(Y)}{P(X)};

Left term is called Posterior Probability, because we see it as the dependence by “opposite” conditional probability and the known probability of the other set. Denominator is called Evidence and it is the sum of the joint probabilities, but we don’t care about it, as a matter of fact, it is a normalization factor that for our case, can be removed. Putting all the previous elements together, we can summarize the last equation as:

posterior\_probability = likelihood * prior\_probability.

What!? We found the likelihood! So it is time to compute this value! The main problem is that if we try to compute the joint probability, we must multiply all  probabilities of all subsets, given all other subsets, and the event. It looks like:

  • P(X_1,X_2,\cdots ,X_n|Y) * P(Y);
  • P(X_1,X_2,\cdots ,X_n-1|X_n,Y) * P(Y);
  • P(X_1,X_2,\cdots ,X_n-2|X_n-1,X_n,Y) * P(Y);
  • P(X_1|X_2,\cdots ,X_n|Y) * P(Y);
  • P(X_2,\cdots ,X_n|Y) * P(Y);
  • and so on…

It is not an easy task…To help us we can assume that all events are independent from each other and we can easily multiply them together! In this way the equation becomes:

P(Y | X) = P(X_1 | Y) * P(X_2 | Y) * \cdots * P(X_n | Y) * P(Y) = P(Y) \prod_{i}^{n} P(X_i | Y)

In the end we are only interested in likelihood.

Now probabilities computation only depends on what kind of distribution we are interested in. It could be Normal distribution, Bernoulli, Poisson and so on. Eventually, there is a last problem: product is a wasteful operation, so it is possible to apply a function to likelihood. That function is the logarithm. In this way, all properties are preserved and, moreover, logarithm of products corresponds to their summation:

log(P(Y | X)) = log(\prod_{i}^{n} P(X_i | Y) \Rightarrow \sum_{i}^{n} log(P(X_i | Y))

It is called log-likelihood.

Cross entropy loss function

As it was said in a previous post, a loss function is used to allow our models to learn and it gives us reasonable results in many tasks, such as classification and prediction. One of the most used function is the so-called Cross Entropy. It can be used in binary and in multi-class problems and it is formulated as:

CE(\hat{y}, y) = - \sum_{i=1}^{n} y_i log(\hat{y}) + (1 - y_i) log(1 - \hat{y})

where y is the class label and \hat{y} is the prediction done by our hypothesis and it can be written as \hat{y} = h(x). This means that the output is obtained from a hypothesis function that tries to estimate the true value of y. Estimation is the keyword! It refers to the operation to find a value for \hat{y} that is probably similar to y. The right name for this kind of operation is likelihood. Cross entropy is also called negative log-likelihood.

How can we derive this function from the likelihood that we explained above? Without being too specific and make things easier, we address the problem easily by Bernoulli distribution, which its general form is:

P(y; x) = x^{y_i} (1-x)^{1-y_i}

When applied to likelihood, we can rewrite it as:

P(y | x) = \prod_{i=1}^{n} P(y | x_i)^{y_i} (1-P(y | x_i))^{1-y_i}

We used the “pure” likelihood so far, but also in this case, it is possible to apply logarithm; and thanks to the several logarithms properties we obtain:

 P(y | x) = \sum_{i=1}^{n} y_i log(P(y, x_i)) + (1 - y_i) log(1 - P(y, x_i))

Rewriting cross entropy with a generic equation, it follows:

CE(x, y) = - \sum_{i=1}^{n} y_i log(P(y, x_i)) + (1 - y_i) log(1 - P(y, x_i))

We obtained the same function! Minus symbol is added for minimization purpose.