11  Modularizing the neural network

Let’s recall the network we built a few chapters ago. Its purpose was regression, but its method was not linear. Instead, an activation function (ReLU, for “rectified linear unit”) introduced a nonlinearity, located between the single hidden layer and the output layer. The “layers”, in this original implementation, were just tensors: weights and biases. You won’t be surprised to hear that these will be replaced by modules.

How will the training process change? Conceptually, we can distinguish four phases: the forward pass, loss computation, backpropagation of gradients, and weight updating. Let’s think about where our new tools will fit in:

Once we’ve made those changes, the code will be more modular, and a lot more readable.

11.1 Data

As a prerequisite, we generate the data, same as last time.

# input dimensionality (number of input features)
d_in <- 3
# number of observations in training set
n <- 100

x <- torch_randn(n, d_in)
coefs <- c(0.2, -1.3, -0.5)
y <- x$matmul(coefs)$unsqueeze(2) + torch_randn(n, 1)

11.2 Network

With two linear layers connected via ReLU activation, the easiest choice is a sequential module, very similar to the one we saw in the introduction to modules:

# dimensionality of hidden layer
d_hidden <- 32
# output dimensionality (number of predicted features)
d_out <- 1

net <- nn_sequential(
  nn_linear(d_in, d_hidden),
  nn_relu(),
  nn_linear(d_hidden, d_out)
)

11.3 Training

Here is the updated training process. We use the Adam optimizer, a popular choice.

opt <- optim_adam(net$parameters)

### training loop --------------------------------------

for (t in 1:200) {
  
  ### -------- Forward pass --------
  y_pred <- net(x)
  
  ### -------- Compute loss -------- 
  loss <- nnf_mse_loss(y_pred, y)
  if (t %% 10 == 0)
    cat("Epoch: ", t, "   Loss: ", loss$item(), "\n")
  
  ### -------- Backpropagation --------
  opt$zero_grad()
  loss$backward()
  
  ### -------- Update weights -------- 
  opt$step()

}
Epoch:  10    Loss:  2.549933 
Epoch:  20    Loss:  2.422556 
Epoch:  30    Loss:  2.298053 
Epoch:  40    Loss:  2.173909 
Epoch:  50    Loss:  2.0489 
Epoch:  60    Loss:  1.924003 
Epoch:  70    Loss:  1.800404 
Epoch:  80    Loss:  1.678221 
Epoch:  90    Loss:  1.56143 
Epoch:  100    Loss:  1.453637 
Epoch:  110    Loss:  1.355832 
Epoch:  120    Loss:  1.269234 
Epoch:  130    Loss:  1.195116 
Epoch:  140    Loss:  1.134008 
Epoch:  150    Loss:  1.085828 
Epoch:  160    Loss:  1.048921 
Epoch:  170    Loss:  1.021384 
Epoch:  180    Loss:  1.0011 
Epoch:  190    Loss:  0.9857832 
Epoch:  200    Loss:  0.973796 

In addition to shortening and streamlining the code, our changes have made a big difference performance-wise.

11.4 What’s to come

You now know a lot about how torch works, and how to use it to minimize a cost function in various settings: for example, to train a neural network. But for real-world applications, there is a lot more torch has to offer. The next – and most voluminous – part of the book focuses on deep learning.