Deep Neural Networks

lectures 4
Author

Harun Pirim

Published

August 16, 2024

Composing Neural Networks

Both shallow and deep NNs describe piecewise linear mappings from inputs to outputs with ReLU activation functions[1].

Deep networks can produce many more linear regions than shallow networks for a given number of parame- ters. Hence, from a practical standpoint, they can be used to describe a broader family of functions[1].

First, lets consider two shallow NNs so the output of one becomes the input of the second. Figure 4.1a shows this relationship. This is a special case of a deep network with two hidden layers[1].

Deep Neural Networks

Composing two shallow networks yields a special case of a deep network with two hidden layers[1]. Now we consider the general case of a deep network with two hidden layers, each containing three hidden units [1] (figure 4.4). The first layer is defined by:

\[ \begin{align*} h_1 &= \text{a}[\theta_{10} + \theta_{11} x] \\ h_2 &= \text{a}[\theta_{20} + \theta_{21} x] \\ h_3 &= \text{a}[\theta_{30} + \theta_{31} x], \end{align*} \]

the second layer by:

\[ \begin{align*} h'_1 &= \text{a}\left[\psi_{10} + \psi_{11} h_1 + \psi_{12} h_2 + \psi_{13} h_3\right] \\ h'_2 &= \text{a}\left[\psi_{20} + \psi_{21} h_1 + \psi_{22} h_2 + \psi_{23} h_3\right] \\ h'_3 &= \text{a}\left[\psi_{30} + \psi_{31} h_1 + \psi_{32} h_2 + \psi_{33} h_3\right], \end{align*} \]

and the output by:

\[ y' = \phi'_0 + \phi'_1 h'_1 + \phi'_2 h'_2 + \phi'_3 h'_3. \]

How the network constructs a complicated function?

1- first layer 3 hidden units are computed forming linear functions of the input and passing them through the activation functions a (i.e. ReLU).

2- hidden units of the second layer are computed as linear functions of the first layer hidden units and passed through the activation function a. Before activatio function is applied, pre-activations can be thought as a shallow network with three outputs. Three piecewise linear functions.

3- second hidden layer activation clipts piecewise linear functions of the first layer to add new joints to each.

4- output is a linear combination of the second layer hidden units[1].

In conclusion, we can either think of each layer as “folding” the input space or as cre- ating new functions, which are clipped (creating new regions) and then recombined[1].

it’s important not to lose sight of the fact that this is still merely an equation relating input x to output y′[1].

\[ \begin{align*} y' = & \ \phi'_0 + \phi'_1 \text{a}\left[\psi_{10} + \psi_{11} \text{a}[\theta_{10} + \theta_{11} x] + \psi_{12} \text{a}[\theta_{20} + \theta_{21} x] + \psi_{13} \text{a}[\theta_{30} + \theta_{31} x]\right] \\ & + \phi'_2 \text{a}\left[\psi_{20} + \psi_{21} \text{a}[\theta_{10} + \theta_{11} x] + \psi_{22} \text{a}[\theta_{20} + \theta_{21} x] + \psi_{23} \text{a}[\theta_{30} + \theta_{31} x]\right] \\ & + \phi'_3 \text{a}\left[\psi_{30} + \psi_{31} \text{a}[\theta_{10} + \theta_{11} x] + \psi_{32} \text{a}[\theta_{20} + \theta_{21} x] + \psi_{33} \text{a}[\theta_{30} + \theta_{31} x]\right], \end{align*} \]

Hyperparameters

Modern networks might have more than a hundred layers with thousands of hidden units at each layer[1]. Number of layers (K) are the depth of the network, number of hidden units at each layer (\(D_i, \ i=1,..,K\)) are the width of the network. Remember total number of hidden units is a measure of the network capacity. These are called hyperparameters and they are set before training the network (learning the slope and intercept terms)[1].

For fixed hyperparameters (e.g., K = 2 layers with Dk = 3 hidden units in each), the model describes a family of functions, and the parameters determine the particular function. Hence, when we also consider the hyperparameters, we can think of neural networks as representing a family of families of functions relating input to output[1].

Matrix Notation

We can write the equations of a deep network in matrix notation. The first and the second layers and the output (we represented before) are defined by:

\[ \begin{aligned} \begin{bmatrix} h_1 \\ h_2 \\ h_3 \end{bmatrix} &= \text{a}\left( \begin{bmatrix} \theta_{10} \\ \theta_{20} \\ \theta_{30} \end{bmatrix} + \begin{bmatrix} \theta_{11} \\ \theta_{21} \\ \theta_{31} \end{bmatrix} x \right), \\ \begin{bmatrix} h'_1 \\ h'_2 \\ h'_3 \end{bmatrix} &= \text{a}\left( \begin{bmatrix} \psi_{10} \\ \psi_{20} \\ \psi_{30} \end{bmatrix} + \begin{bmatrix} \psi_{11} & \psi_{12} & \psi_{13} \\ \psi_{21} & \psi_{22} & \psi_{23} \\ \psi_{31} & \psi_{32} & \psi_{33} \end{bmatrix} \begin{bmatrix} h_1 \\ h_2 \\ h_3 \end{bmatrix} \right), \\ y' &= \phi'_0 + \begin{bmatrix} \phi'_1 & \phi'_2 & \phi'_3 \end{bmatrix} \begin{bmatrix} h'_1 \\ h'_2 \\ h'_3 \end{bmatrix}. \end{aligned} \]

In matrix nation, the equations of a deep network can be written as:

\[ \begin{aligned} \mathbf{h} &= \text{a}[\mathbf{\theta}_0 + \mathbf{\theta} x] \\ \mathbf{h}' &= \text{a}[\mathbf{\psi}_0 + \Psi h] \\ y &= \phi'_0 + \phi' \mathbf{h}', \end{aligned} \]

A general deep network \(y = f[x, \phi]\) with K layers can now be written as:

\[ \begin{aligned} \mathbf{h}_1 &= \text{a}[\beta_0 + \mathbf{\Omega}_0 \mathbf{x}] \\ \mathbf{h}_2 &= \text{a}[\beta_1 + \mathbf{\Omega}_1 \mathbf{h}_1] \\ \mathbf{h}_3 &= \text{a}[\beta_2 + \mathbf{\Omega}_2 \mathbf{h}_2] \\ &\vdots \\ \mathbf{h}_K &= \text{a}[\beta_{K-1} + \mathbf{\Omega}_{K-1} \mathbf{h}_{K-1}] \\ y &= \beta_K + \mathbf{\Omega}_K \mathbf{h}_K. \end{aligned} \]

Ability to Approximate Any Function

Shallow NNs with enough capacity approximate any continuous function closely. The same argument is valid for deep NNs.

A shallow NN with one input and one output, D > 2 hidden units can form up to D+1 linear regions and is defined by 3D+1 parameters[1].

A deep NN with one input and one output, K layers of D>2 hidden units can form up to \((D+1)^K\) linear regions using 3D+1+(K-1)D(D+1) parameters[1].

Deep neural networks create much more complex functions for a fixed parameter budget[1].

So, it’s not clear that the greater number of regions is an advantage unless (i) there are similar symmetries in the real-world functions that we wish to approximate or (ii) we have reason to believe that the mapping from input to output really does involve a composition of simpler functions[1].

Nonlinear Activation Functions

The ReLU activation function is piecewise linear. It is not the only activation function that can be used in deep networks. The sigmoid and tanh functions for example can be used to create more complex functions.

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor

# Generate synthetic data
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = np.sin(X) + np.random.normal(0, 0.1, X.shape)

# Define and train a neural network with one hidden layer and sigmoid activation function
nn = MLPRegressor(hidden_layer_sizes=(10,), activation='logistic', max_iter=2000, solver='lbfgs', random_state=42)
nn.fit(X, y.ravel())

# Generate predictions
X_test = np.linspace(-3, 3, 1000).reshape(-1, 1)
y_pred = nn.predict(X_test)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X_test, y_pred, color='red', label='NN with Sigmoid Activation')
plt.title('Neural Network with Sigmoid Activation Function')
plt.xlabel('Input X')
plt.ylabel('Output y')
plt.legend()
plt.grid(True)
plt.show()

References

1: Understanding Deep Learning