Shallow Neural Networks
Neural Network Example
Shallow NNs describe piecewise linear functions expressive enough to approximate complex relationships between multi-dimensional inputs and outputs[1].
They are functions \(y = f[x,\phi]\) that map input \(x\) to output \(y\) using parameters \(\phi\). For example:
\[ y = \text{f}[x, \phi] \\ = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2 a[\theta_{20} + \theta_{21} x] + \phi_3 a[\theta_{30} + \theta_{31} x]. \]
Here, what is done?
1- three linear functions of the input data 2- each linear function is passed through an activation function \(a\) 3- the results are combined linearly to produce the output.
ReLU is a popular activation function. It is defined as:
$ a[z] = (0, z) $. In other words, it is zero for negative inputs and linear for positive inputs (see figure 3.1). The name ReLU stands for rectified linear unit.
Above equation that maps input to output represents a familty of function where the particular function depends on the ten parameters \(\phi\) and \(\theta\) (see figure 3.2).
For a given set of parameters and a training dataset \({x_i, y_i}\), we can find the best parameters by minimizing the difference between the output of the function and the target value \(y_i\) for each input \(x_i\) in the training dataset. This is done by minimizing the loss function \(L[\phi]\) suc as least squares.
Depicting Neural Networks
Figure 3.4a shows the NN with one input, one output and three hidden units. Each edge (connection) shows one of the ten parameters.
Universal Approximation Theorem
If we consider d hidden units, \(d^th\) hidden unit can be written as:
\[ h_d = a[\theta_{d0} + \theta_{d1} x] \]
These are combined linearly to produce the output:
\[ y = \phi_0 + \sum_{d=1}^{D} \phi_d h_d \]
The number of hidden units is a measure of the network capacity. With ReLU activation functions, the output has at most D joint, a piecewise linear function with at most D+1 linear regions. As we add more hidden units, the model can approximate more complex functions. Figure 3.5 shows the effect of adding more hidden units in approximation. The universal approximation theorem proves that for any continuous function, there exists a shallow network that can approximate this function to any specified precision[1].
The universal approximation theorem holds for mapping multivariate inputs to multivariate outputs as well.
Terminology
Figure 3.12 shows input layer, output layer, and hidden layer concepts. Inputs are passed to a hidden layer termed pre-activations. The values after the activation function is applied are called activations. NNs with at least one hidden layer is called multi layer perceptron (MLP). Only one hidden layered NN is called shallow NN. Multiple hidden layered NNs are called deep NNs. NNs with connections forming acyclic graphs are called feedforward NNs.They are called fully connected if each unit in a layer is connected to each unit in the next layer.Connections represent slope parameters in the underlying equations, also called weights. The offset parameters are called biases.