```
# input dimensionality (number of input features)
<- 3
d_in # number of observations in training set
<- 100
n
<- torch_randn(n, d_in)
x <- c(0.2, -1.3, -0.5)
coefs <- x$matmul(coefs)$unsqueeze(2) + torch_randn(n, 1) y
```

# 11 Modularizing the neural network

Let’s recall the network we built a few chapters ago. Its purpose was regression, but its method was not *linear*. Instead, an activation function (ReLU, for “rectified linear unit”) introduced a nonlinearity, located between the single hidden layer and the output layer. The “layers”, in this original implementation, were just tensors: weights and biases. You won’t be surprised to hear that these will be replaced by *modules*.

How will the training process change? Conceptually, we can distinguish four phases: the forward pass, loss computation, backpropagation of gradients, and weight updating. Let’s think about where our new tools will fit in:

The forward pass, instead of calling functions on tensors, will call the model.

In computing the loss, we now make use of

`torch`

’s`nnf_mse_loss()`

.Backpropagation of gradients is, in fact, the only operation that remains unchanged.

Weight updating is taken care of by the optimizer.

Once we’ve made those changes, the code will be more modular, and a lot more readable.

## 11.1 Data

As a prerequisite, we generate the data, same as last time.

## 11.2 Network

With two linear layers connected via ReLU activation, the easiest choice is a sequential module, very similar to the one we saw in the introduction to modules:

```
# dimensionality of hidden layer
<- 32
d_hidden # output dimensionality (number of predicted features)
<- 1
d_out
<- nn_sequential(
net nn_linear(d_in, d_hidden),
nn_relu(),
nn_linear(d_hidden, d_out)
)
```

## 11.3 Training

Here is the updated training process. We use the Adam optimizer, a popular choice.

```
<- optim_adam(net$parameters)
opt
### training loop --------------------------------------
for (t in 1:200) {
### -------- Forward pass --------
<- net(x)
y_pred
### -------- Compute loss --------
<- nnf_mse_loss(y_pred, y)
loss if (t %% 10 == 0)
cat("Epoch: ", t, " Loss: ", loss$item(), "\n")
### -------- Backpropagation --------
$zero_grad()
opt$backward()
loss
### -------- Update weights --------
$step()
opt
}
```

```
Epoch: 10 Loss: 2.549933
Epoch: 20 Loss: 2.422556
Epoch: 30 Loss: 2.298053
Epoch: 40 Loss: 2.173909
Epoch: 50 Loss: 2.0489
Epoch: 60 Loss: 1.924003
Epoch: 70 Loss: 1.800404
Epoch: 80 Loss: 1.678221
Epoch: 90 Loss: 1.56143
Epoch: 100 Loss: 1.453637
Epoch: 110 Loss: 1.355832
Epoch: 120 Loss: 1.269234
Epoch: 130 Loss: 1.195116
Epoch: 140 Loss: 1.134008
Epoch: 150 Loss: 1.085828
Epoch: 160 Loss: 1.048921
Epoch: 170 Loss: 1.021384
Epoch: 180 Loss: 1.0011
Epoch: 190 Loss: 0.9857832
Epoch: 200 Loss: 0.973796
```

In addition to shortening and streamlining the code, our changes have made a big difference performance-wise.

## 11.4 What’s to come

You now know a lot about how `torch`

works, and how to use it to minimize a cost function in various settings: for example, to train a neural network. But for real-world applications, there is a lot more `torch`

has to offer. The next – and most voluminous – part of the book focuses on deep learning.