9 Loss functions

The concept of a loss function is essential to machine learning. At any iteration, the current loss value indicates how far the estimate is from the target. It is then used to update the parameters in a direction that will decrease the loss.

In our applied example, we already have made use of a loss function: mean squared error, computed manually as

library(torch)

loss <- (y_pred - y)$pow(2)$sum()

As you might expect, here is another area where this kind of manual effort is not needed.

In this final conceptual chapter before we re-factor our running examples, we want to talk about two things: First, how to make use of torch’s built-in loss functions. And second, what function to choose.

9.1 `torch` loss functions

In torch, loss functions start with nn_ or nnf_.

Using nnf_, you directly call a function. Correspondingly, its arguments (estimate and target) both are tensors. For example, here is nnf_mse_loss(), the built-in analog to what we coded manually:

nnf_mse_loss(torch_ones(2, 2), torch_zeros(2, 2) + 0.1)

torch_tensor
0.81
[ CPUFloatType{} ]

With nn_, in contrast, you create an object:

l <- nn_mse_loss()

This object can then be called on tensors to yield the desired loss:

l(torch_ones(2, 2),torch_zeros(2, 2) + 0.1)

torch_tensor
0.81
[ CPUFloatType{} ]

Whether to choose object or function is mainly a matter of preference and context. In larger models, you may end up combining several loss functions, and then, creating loss objects can result in more modular, and more maintainable code. In this book, I’ll mainly use the first way, unless there are compelling reasons to do otherwise.

On to the second question.

9.2 What loss function should I choose?

In deep learning, or machine learning overall, most applications aim to do one (or both) of two things: predict a numerical value, or estimate a probability. The regression task of our running example does the former; real-world applications might forecast temperatures, infer employee churn, or predict sales. In the second group, the prototypical task is classification. To categorize, say, an image according to its most salient content, we really compute the respective probabilities. Then, when the probability for “dog” is 0.7, while that for “cat” is 0.3, we say it’s a dog.

9.2.1 Maximum likelihood

In both classification and regression, the mostly used loss functions are built on the maximum likelihood principle. Maximum likelihood means: We want to choose model parameters in a way that the data, the things we have observed or could have observed, are maximally likely. This principle is not “just” fundamental, it is also intuitively appealing. Imagine a simple example.

Say we have the values 7.1, 22.14, and 11.3, and we know that the underlying process follows a normal distribution. Then it is much more likely that these data have been generated by a distribution with mean 14 and standard deviation 7 than by one with mean 20 and standard deviation 1.

9.2.2 Regression

In regression (that implicitly assumes the target distribution to be normal¹), to maximize likelihood, we just keep using mean squared error – the loss we’ve been computing all along. Maximum likelihood estimators have all kinds of desirable statistical properties. However, in concrete applications, there may be reasons to use different ones.

For example, say a dataset has outliers where, for some reason, prediction and target are found to be deviating substantially. Mean squared error will allocate high importance to these outliers. In such cases, possible alternatives are mean absolute error (nnf_l1_loss()) and smooth L1 loss (nn_smooth_l1_loss()). The latter is a mixture type that, by default, computes the absolute (L1) error, but switches to squared (L2) error whenever the absolute errors get very small.

9.2.3 Classification

In classification, we are comparing two distributions. The estimate is a probability by design, and the target can be viewed as one, too. In that light, maximum likelihood estimation is equivalent to minimizing the Kullback-Leibler divergence (KL divergence).

KL divergence is a measure of how two distributions differ. It depends on two things: the likelihood of the data, as determined by some data-generating process, and the likelihood of the data under the model. In the machine learning scenario, however, we are concerned only with the latter. In that case, the criterion to be minimized reduces to the cross-entropy between the two distributions. And cross-entropy loss is exactly what is commonly used in classification tasks.

In torch, there are several variants of loss functions that calculate cross-entropy. With this topic, it’s nice to have a quick reference around; so here is a quick lookup table (tbl. 9.1 abbreviates the – rather long-ish – function names; see tbl. 9.2 for the mapping):

Table 9.1: Loss functions, by type of data they work on (binary vs. multi-class) and expected input (raw scores, probabilities, or log probabilities).
	Data		Input
	binary	multi-class	raw scores	probabilities	log probs
BCeL	Y		Y
Ce		Y	Y
BCe	Y			Y
Nll		Y			Y

Table 9.2: Abbreviations used to refer to `torch` loss functions.
BCeL	`nnf_binary_cross_entropy_with_logits()`
Ce	`nnf_cross_entropy()`
BCe	`nnf_binary_cross_entropy()`
Nll	`nnf_nll_loss()`

To pick the function applicable to your use case, there are two things to consider.

First, are there just two possible classes (“dog vs. cat”, “person present / person absent”, etc.), or are there several?

And second, what is the type of the estimated values? Are they raw scores (in theory, any value between plus and minus infinity)? Are they probabilities (values between 0 and 1)? Or (finally) are they log probabilities, that is, probabilities to which a logarithm has been applied? (In the final case, all values should be either negative or equal to zero.)

9.2.3.1 Binary data

Starting with binary data, our example classification vector is a sequence of zeros and ones. When thinking in terms of probabilities, it is most intuitive to imagine the ones standing for presence, the zeros for absence of one of the classes in question – cat or no cat, say.

target <- torch_tensor(c(1, 0, 0, 1, 1))

The raw scores could be anything. For example:

unnormalized_estimate <-
  torch_tensor(c(3, 2.7, -1.2, 7.7, 1.9))

To turn these into probabilities, all we need to do is pass them to nnf_sigmoid(). nnf_sigmoid() squishes its argument to values between zero and one:

probability_estimate <- nnf_sigmoid(unnormalized_estimate)
probability_estimate

torch_tensor
 0.9526
 0.9370
 0.2315
 0.9995
 0.8699
[ CPUFloatType{5} ]

From the above table, we see that given unnormalized_estimate and probability_estimate, we can use both as inputs to a loss function – but we have to choose the appropriate one. Provided we do that, the output has to be the same in both cases.

Let’s see (raw scores first):

nnf_binary_cross_entropy_with_logits(
  unnormalized_estimate, target
)

torch_tensor
0.643351
[ CPUFloatType{} ]

And now, probabilities:

nnf_binary_cross_entropy(probability_estimate, target)

torch_tensor
0.643351
[ CPUFloatType{} ]

That worked as expected. What does this mean in practice? It means that when we build a model for binary classification, and the final layer computes an un-normalized score, we don’t need to attach a sigmoid layer to obtain probabilities. We can just call nnf_binary_cross_entropy_with_logits() when training the network. In fact, doing so is the preferred way, also due to reasons of numerical stability.

9.2.3.2 Multi-class data

Moving on to multi-class data, the most intuitive framing now really is in terms of (several) classes, not presence or absence of a single class. Think of classes as class indices (maybe indexing into some look-up table). Being indices, technically, classes start at 1:

target <- torch_tensor(c(2, 1, 3, 1, 3), dtype = torch_long())

In the multi-class scenario, raw scores are a two-dimensional tensor. Each row contains the scores for one observation, and each column corresponds to one of the classes. Here’s how the raw estimates could look:

unnormalized_estimate <- torch_tensor(
  rbind(c(1.2, 7.7, -1),
    c(1.2, -2.1, -1),
    c(0.2, -0.7, 2.5),
    c(0, -0.3, -1),
    c(1.2, 0.1, 3.2)
  )
)

As per the above table, given this estimate, we should be calling nnf_cross_entropy() (and we will, when below we compare results).

So that’s the first option, and it works exactly as with binary data. For the second, there is an additional step.

First, we again turn raw scores into probabilities, using nnf_softmax(). For most practical purposes, nnf_softmax() can be seen as the multi-class equivalent of nnf_sigmoid(). Strictly though, their effects are not the same. In a nutshell, nnf_sigmoid() treats low-score and high-score values equivalently, while nnf_softmax() exacerbates the distances between the top score and the remaining ones (“winner takes all”).

probability_estimate <- nnf_softmax(unnormalized_estimate,
  dim = 2
)
probability_estimate

torch_tensor
 0.0015  0.9983  0.0002
 0.8713  0.0321  0.0965
 0.0879  0.0357  0.8764
 0.4742  0.3513  0.1745
 0.1147  0.0382  0.8472
[ CPUFloatType{5,3} ]

The second step, the one that was not required in the binary case, consists in transforming the probabilities to log probabilities. In our example, this could be accomplished by calling torch_log() on the probability_estimate we just computed. Alternatively, both steps together are taken care of by nnf_log_softmax():

logprob_estimate <- nnf_log_softmax(unnormalized_estimate,
  dim = 2
)
logprob_estimate

torch_tensor
-6.5017 -0.0017 -8.7017
-0.1377 -3.4377 -2.3377
-2.4319 -3.3319 -0.1319
-0.7461 -1.0461 -1.7461
-2.1658 -3.2658 -0.1658
[ CPUFloatType{5,3} ]

Now that we have estimates in both possible forms, we can again compare results from applicable loss functions. First, nnf_cross_entropy() on the raw scores:

nnf_cross_entropy(unnormalized_estimate, target)

torch_tensor
0.23665
[ CPUFloatType{} ]

And second, nnf_nll_loss() on the log probabilities:

nnf_nll_loss(logprob_estimate, target)

torch_tensor
0.23665
[ CPUFloatType{} ]

Application-wise, what was said for the binary case applies here as well: In a multi-class classification network, there is no need to have a softmax layer at the end.

Before we end this chapter, let’s address a question that might have come to mind. Is not binary classification a sub-type of the multi-class setup? Should we not, in that case, arrive at the same result, whatever the method chosen?

9.2.3.3 Check: Binary data, multi-class method

Let’s see. We re-use the binary-classification scenario employed above. Here it is again:

target <- torch_tensor(c(1, 0, 0, 1, 1))

unnormalized_estimate <- 
  torch_tensor(c(3, 2.7, -1.2, 7.7, 1.9))

probability_estimate <- nnf_sigmoid(unnormalized_estimate)

nnf_binary_cross_entropy(probability_estimate, target)

torch_tensor
0.64335
[ CPUFloatType{} ]

We hope to get the same value doing things the multi-class way. We already have the probabilities (namely, probability_estimate); we just need to put them into the “observation by class” format expected by nnf_nll_loss():

# logits
multiclass_probability <- torch_tensor(rbind(
  c(1 - 0.9526, 0.9526),
  c(1 - 0.9370, 0.9370),
  c(1 - 0.2315, 0.2315),
  c(1 - 0.9995, 0.9995),
  c(1 - 0.8699, 0.8699)
))

Now, we still want to apply the logarithm. And there is one other thing to be taken care of: In the binary setup, classes were coded as probabilities (either 0 or 1); now, we’re dealing with indices. This means we add 1 to the target tensor:

target <- target + 1

Finally, we can call nnf_nll_loss():

nnf_nll_loss(
  torch_log(multiclass_probability),
  target$to(dtype = torch_long())
)

torch_tensor
0.643275
[ CPUFloatType{} ]

There we go. The results are indeed the same.

For cases where that assumption seems unlikely, distribution-adequate loss functions are provided (e.g., Poisson negative log likelihood, available as nnf_poisson_nll_loss() .↩︎

9.1 torch loss functions