Both shallow and deep NNs describe piecewise linear mappings from inputs to outputs with ReLU activation functions[1].
Deep networks can produce many more linear regions than shallow networks for a given number of parame- ters. Hence, from a practical standpoint, they can be used to describe a broader family of functions[1].
First, lets consider two shallow NNs so the output of one becomes the input of the second. Figure 4.1a shows this relationship. This is a special case of a deep network with two hidden layers[1].
Deep Neural Networks
Composing two shallow networks yields a special case of a deep network with two hidden layers[1]. Now we consider the general case of a deep network with two hidden layers, each containing three hidden units [1] (figure 4.4). The first layer is defined by:
How the network constructs a complicated function?
1- first layer 3 hidden units are computed forming linear functions of the input and passing them through the activation functions a (i.e. ReLU).
2- hidden units of the second layer are computed as linear functions of the first layer hidden units and passed through the activation function a. Before activatio function is applied, pre-activations can be thought as a shallow network with three outputs. Three piecewise linear functions.
3- second hidden layer activation clipts piecewise linear functions of the first layer to add new joints to each.
4- output is a linear combination of the second layer hidden units[1].
In conclusion, we can either think of each layer as “folding” the input space or as cre- ating new functions, which are clipped (creating new regions) and then recombined[1].
it’s important not to lose sight of the fact that this is still merely an equation relating input x to output y′[1].
Modern networks might have more than a hundred layers with thousands of hidden units at each layer[1]. Number of layers (K) are the depth of the network, number of hidden units at each layer (\(D_i, \ i=1,..,K\)) are the width of the network. Remember total number of hidden units is a measure of the network capacity. These are called hyperparameters and they are set before training the network (learning the slope and intercept terms)[1].
For fixed hyperparameters (e.g., K = 2 layers with Dk = 3 hidden units in each), the model describes a family of functions, and the parameters determine the particular function. Hence, when we also consider the hyperparameters, we can think of neural networks as representing a family of families of functions relating input to output[1].
Matrix Notation
We can write the equations of a deep network in matrix notation. The first and the second layers and the output (we represented before) are defined by:
Shallow NNs with enough capacity approximate any continuous function closely. The same argument is valid for deep NNs.
A shallow NN with one input and one output, D > 2 hidden units can form up to D+1 linear regions and is defined by 3D+1 parameters[1].
A deep NN with one input and one output, K layers of D>2 hidden units can form up to \((D+1)^K\) linear regions using 3D+1+(K-1)D(D+1) parameters[1].
Deep neural networks create much more complex functions for a fixed parameter budget[1].
So, it’s not clear that the greater number of regions is an advantage unless (i) there are similar symmetries in the real-world functions that we wish to approximate or (ii) we have reason to believe that the mapping from input to output really does involve a composition of simpler functions[1].
Nonlinear Activation Functions
The ReLU activation function is piecewise linear. It is not the only activation function that can be used in deep networks. The sigmoid and tanh functions for example can be used to create more complex functions.
Code
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.neural_network import MLPRegressor# Generate synthetic datanp.random.seed(42)X = np.linspace(-3, 3, 100).reshape(-1, 1)y = np.sin(X) + np.random.normal(0, 0.1, X.shape)# Define and train a neural network with one hidden layer and sigmoid activation functionnn = MLPRegressor(hidden_layer_sizes=(10,), activation='logistic', max_iter=2000, solver='lbfgs', random_state=42)nn.fit(X, y.ravel())# Generate predictionsX_test = np.linspace(-3, 3, 1000).reshape(-1, 1)y_pred = nn.predict(X_test)# Plot the resultsplt.figure(figsize=(10, 6))plt.scatter(X, y, color='blue', label='Data')plt.plot(X_test, y_pred, color='red', label='NN with Sigmoid Activation')plt.title('Neural Network with Sigmoid Activation Function')plt.xlabel('Input X')plt.ylabel('Output y')plt.legend()plt.grid(True)plt.show()