The cost function is a function that computes the cost of a mathematical model based on how far is from data, or more precisely, how the model is representative of a set of data. In a nutshell, we can assert that it has a ‘generic’ usage based on what kind of operation we need to carry out.

The cost function is used to give an estimation of the optimization models state, solving problems such as  regression, prediction and classification. It is a measurement of the effectiveness of models behavior, which is the error caused by the way it fits the data.

Model example

Taking as an example the case of a simple linear regression with one variable and two parameters represented as follows:

h_x = \theta_0 + \theta_1 * x_1

we can see an example of model (red line) that tries to represent our data (blue circle). The first parameter \theta_1 is chosen arbitrarily, while \theta_0, is always set to 0, basically we are working on a function with one parameter.

Data

Cost estimation

The important question is:

how does this model fit the data?

The answer is given by its “cost”! The more the cost/error is close to zero, the more our model is fitting data. The error must be computed over the whole training set as the mean of estimations differences and the known samples. To compute the error, we chose Mean Squared Error (without taking account of a normalization factor of 2) as cost function:

MSE = \frac{1}{2n} * \sum_{i=1}^{n} (h_i(x) - y_i)^2;\qquad n = length(samples)

As a result for our particular case, the error is about 0.0542. Unfortunately, this answer is not enough for us. We also should ask ourselves: What happens if we try to vary the parameters that define the linear regression behavior? Let us try choosing random parameters for two times:

Data

What we obtain is two new models with different starting points and slopes that represent our data differently. They do not seem to perform in a reasonable way, so now we compute the two costs so as we can effectively compare them:

  • First example has a cost c = 0.0557;
  • Second example has a cost c = 0.1305;

So it is unquestionable that the former example gives us a better representation of our data. Now that we have seen more examples of a cost function, we should try to vary \theta_1 parameter in a way that minimizes the error at each point and plots it in order to visually understand how a cost function actually appears to be:

In the upper and lower plots \theta_1 started with a value that gives a very high cost cost(\theta_1) \approx 4.2, its value changes each time until it is very low \theta_1 \approx 0.1.

 

How many cost functions?

The cost function is not unique, there are many functions that can be used in relation to the problem you need to address. In our example we used Mean Squared Error, but it is not a good choice for every task.

For example, in a classification problem we need to find a boundary between two or more classes, so our model should not be a perfect data representation, this means that MSE fails in its aim, because it gives us the error measured as the distance among the predicted output from the data. So the closer the output is to the data, the lower the error value is.

Unfortunately, as mentioned above, in classification tasks we do not need to fit the data, but to find boundaries among classes. Usually the most important property of a cost function is the derivability, because in order to minimize the loss we need to compute the partial derivates respect to its terms.