Loss Functions and Fitting Models

lectures 5

Author

Harun Pirim

Published

August 19, 2024

Loss Functions

We discussed linear regression, shallow NNs, and deep NNs where each represents a family of functions that map the input to output and farticular member of the family is determined by the parameters[1]. In this lecture, we will discuss the loss functions that are define best possible mapping, that is seeking best parameters through training.

That requires training dataset \({x_i, y_i}\) where \(x_i\) is the input and \(y_i\) is the output. A loss function \(L[\phi]\) returns a single number that measures the mismatch between the model predictions \(f[x_i, \phi]\) and the ground truth \(y_i\). The goal is to find the parameters \(\phi\) that minimize the loss function.

Maxiimum Likelihood

Recipe for constructing a loss function for training data using maximum likelihood principle is as follows:

choose a suitable probability distribution \(Pr(y|\theta)\) defined over the domain of the output \(y\) with distribution parameters \(\theta\) [1].
set the ML model \(f[x,\phi]\) to predict one or more of these parameters \(\theta\). So, \(f[x,\phi] = \theta\) and \(Pr(y|\theta) = Pr(y|f[x,\phi])\)[1].
to train the model, find the network parameters \(\hat{\phi}\) that maximize the likelihood of the training data. This is equivalent to minimizing the negative log-likelihood of the training dataset \({x_i,y_i}\)[1]:

\[ \hat{\phi} = \arg\min_{\phi} \left[ L[\phi] \right] = \arg\min_{\phi} \left[ -\sum_{i=1}^{I} \log \left[ Pr(y_i \mid \textbf{f}[x_i, \phi]) \right] \right]. \]

To perform inference for a new input \(x\), return either the full distribution \(Pr[y|f[x,\hat{\phi}]]\) or the maximum of this distribution.

Fitting Models

How to find parameter values that minimize the loss function, learning the network’s parameters or training, or fitting the model. First, initial parameter values chosen. Then:

compute the gradient of the loss function with respect to the parameters.
adjust the parameters based on the gradients to decrease the loss[1].

After several iterations, we hope to reach the minimum of the loss function.

Gradient Descent

There are different optimization algorithms to adjust the parameters based on the gradients. The most common one is the gradient descent algorithm (an iterative algorithm). The gradient descent algorithm is as follows:

Initial parameters are

\[ \phi = \begin{bmatrix} \phi_0, \phi_1, \dots, \phi_N \end{bmatrix}^T \]

Compute the gradient of the loss function with respect to the parameters:

\[ \frac{\partial L}{\partial \phi} = \begin{bmatrix} \frac{\partial L}{\partial \phi_0} \\ \frac{\partial L}{\partial \phi_1} \\ \vdots \\ \frac{\partial L}{\partial \phi_N} \end{bmatrix} \]

Adjust the parameters based on the gradients to decrease the loss:

\[ \phi \leftarrow \phi - \alpha \cdot \frac{\partial L}{\partial \phi}, \]

where \(\alpha\) is the learning rate, a hyperparameter that determines the size of the step taken in the parameter space.

Stochastic gradient descent remedies the dependency on the initial parameter values adding noise to the gradient.

The mechanism for introducing randomness is simple. At each iteration, the algorithm chooses a random subset of the training data and computes the gradient from these examples alone. This subset is known as a minibatch or batch for short. The update rule for the model parameters φt at iteration t is hence[1]:

\[ \phi_{t+1} \leftarrow \phi_t - \alpha \cdot \sum_{i \in \mathcal{B}_t} \frac{\partial \ell_i[\phi_t]}{\partial \phi}, \]

where \(\mathcal{B}_t\) is the set of examples in the minibatch at iteration t.

A single pass through the entire training dataset is referred to as an epoch. The learning rate α starts at a high value and is decreased by a constant factor every N epochs[1].

References

1: Understanding Deep Learning