Neural Networks or Deep Learning Fundementals

lectures 1-2

Author

Harun Pirim

Published

August 7, 2024

First Words on Neural Networks (NNs) or Deep Learning Fundementals

What is a Neural Network?

A neural network is a computational model that is inspired by the way biological neural networks in the human brain process information. The key element of this model is the neuron, which is a mathematical function that takes an input and produces an output. The output is then passed to other neurons in the network. The network is composed of layers of neurons, with each layer performing a different type of computation. The neurons in each layer are connected to the neurons in the next layer by weights, which are parameters that determine the strength of the connection between neurons. The network learns by adjusting these weights based on the input data and the desired output.

Why Neural Networks?

Neural networks are capable of learning complex patterns in data.
They can be used to solve a wide range of problems, including image recognition, speech recognition, natural language processing, and game playing.
They have been shown to outperform traditional machine learning algorithms in many tasks.
They are highly flexible and can be adapted to different types of data and problems.
They are scalable and can be trained on large datasets using parallel computing.

Some of NN applications

Image recognition
Speech recognition
Natural language processing
Game playing
Robotics
Healthcare
Finance
Marketing
Transportation
Manufacturing
Energy

Brief history of NNs or DL

1943: McCulloch and Pitts propose a mathematical model of a neuron.
1958: Rosenblatt introduces the perceptron, a simple neural network model.
1969: Minsky and Papert show the limitations of perceptrons.
1986: Rumelhart, Hinton, and Williams introduce backpropagation, a method for training neural networks.
1990s: Neural networks fall out of favor due to limitations in training algorithms and computational power.
2006: Hinton and Salakhutdinov introduce deep belief networks, a precursor to modern deep learning.
2012: Krizhevsky, Sutskever, and Hinton win the ImageNet competition using a deep convolutional neural network.
2015: AlphaGo, a deep reinforcement learning system, defeats a human Go champion
2018: OpenAI’s GPT-2, a large language model, generates human-like text.
2020: DeepMind’s AlphaFold predicts protein structures with high accuracy.

DL Mathematics Basics (reference 2 )

Liear Algebra, Calculus, Probability and Statistics, and Optimization are the four main branches of mathematics that are used in deep learning.

Intro

NumPy is a popular library for linear algebra in Python. It provides support for arrays, matrices, and linear algebra operations such as matrix multiplication, matrix inversion, and eigenvalue decomposition. Below code defines an array from a list and prints its size and shape.

Code

import numpy as np

a = np.array([1,2,3,4])

a.size
a.shape

# 2D array, list of lists is used

b = np.array([[1,2,3,4],[5,6,7,8]])
print(b)
b.shape

d = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(d)
d.shape

[[1 2 3 4]
 [5 6 7 8]]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

(3, 3)

Some examples of modifing arrays are given below. : indicates all the elements along a specific dimension.

Code

#zeros and ones

a = np.zeros((3,4), dtype="uint32")
a[0,3] = 42
a[1,1] = 66
a
b = 11*np.ones((3,1))
b

#indexing

a = np.arange(12).reshape((3,4)) #similar to python range function
a[1] #short for a[1,:]
a[1] = [44,55,66,77]
a
a[:2] #short for a[:2,:]
a[:2,:3] #first two rows and first three columns

b = np.arange(12)
b
b[::2] #every other element
b[::-1] #reverse

a = np.arange(24).reshape((4,3,2)) # collection of 3x2 matrices that is a 3D array
a
a[1,:,:] = [[11,22],[33,44],[55,66]] # second matrix is replaced
a
a[2,...] = [[99,99],[99,99],[99,99]] # third matrix is replaced, ... is used for the rest of the dimensions

SciPy uses NumPy under the hood. We can focus on stats module. Say test scores from two classes, how strongly can we believe that the same process generated these two sets of data? The t-test is a classic method for answering this question. One way to evaluate its result is to look at the p-value. A probability near 1 means we have a lot of confidence that the two sets are from the same process.

Code

import numpy as np
from scipy.stats import ttest_ind  # t-test for independent samples

a = np.random.normal(0,1,1000)
b = np.random.normal(0,0.5,1000)
c = np.random.normal(0.1,1,1000)

ttest_ind(a,b) #notice a and be has the same average
ttest_ind(a,c)

TtestResult(statistic=-1.0101173900130938, pvalue=0.31256136162305476, df=1998.0)

Matplotlib is popular for generating plots. We can plot NumPy arrays with it. Plot windows provides a way to interact with the plot.You can configure subplots for example.

Code

import numpy as np
import matplotlib.pyplot as plt
x = np.random.random(100) # 100 random numbers between 0 and 1
plt.plot(x)
plt.show()

The famous Scikit-learn library is used for machine learning in Python. For our case, we can use it to build a simple neural network to classify small 8×8-pixel grayscale images of handwritten digits. This dataset is built into sklearn.

Code

import numpy as np
from sklearn.datasets import load_digits
from sklearn.neural_network import MLPClassifier

d = load_digits()
digits = d["data"] #1,797 rows, with 64 columns per row, 
labels = d["target"] #you can name targets instead of labels. a vector of 1,797 digit labels.
N = 200 #number of test samples

K = 10  # You can adjust this to visualize more or fewer digits
plt.figure(figsize=(10, 2))  # Adjust the figure size as needed
for i in range(K):
    plt.subplot(1, K, i + 1)
    plt.imshow(digits[i].reshape(8, 8), cmap='gray')
    plt.title(f"Label: {labels[i]}")
    plt.axis('off')

plt.show()

idx = np.argsort(np.random.random(len(labels))) #randomly shuffle the data
x_test, y_test = digits[idx[:N]], labels[idx[:N]] #about 20 of each digit for testing.
x_train, y_train = digits[idx[N:]], labels[idx[N:]] #approximately 160 images of each digit to train 

clf = MLPClassifier(hidden_layer_sizes=(128,)) #one hidden layer with 128 neurons, output layer size is deduced from y_train. This is called one instance of the model.
clf.fit(x_train, y_train) #train the model. the learned weights and biases are in the model clf.

score = clf.score(x_test, y_test)
pred = clf.predict(x_test)
err = np.where(y_test != pred)[0]
print("score      : ", score)
print("errors:")
print("  actual   : ", y_test[err])
print("  predicted: ", pred[err])


# Print the weights and biases
print("Weights between layers:")
for i, coef in enumerate(clf.coefs_):
    print(f"Layer {i + 1} to Layer {i + 2} weights shape: {coef.shape}")
    print(coef)

print("\nBiases for each layer:")
for i, intercept in enumerate(clf.intercepts_):
    print(f"Layer {i + 2} biases shape: {intercept.shape}")
    print(intercept)

score      :  0.975
errors:
  actual   :  [1 9 5 8 8]
  predicted:  [8 7 9 5 1]
Weights between layers:
Layer 1 to Layer 2 weights shape: (64, 128)
[[-5.17593514e-06  1.56149363e-08 -1.63391447e-07 ... -1.12661710e-11
   2.73327623e-05 -8.17992048e-11]
 [ 2.02137123e-01 -1.48008857e-02  1.63629266e-01 ... -2.70079417e-02
  -1.95945504e-01 -1.61038384e-01]
 [ 1.16270916e-01  2.38653321e-03  6.34666453e-02 ...  4.32141577e-02
  -1.53720493e-01 -1.03671347e-01]
 ...
 [-3.10186131e-02  1.88393933e-01  3.39237196e-02 ... -8.61215136e-02
   6.07538381e-02  1.29208548e-01]
 [-9.08844444e-02 -1.31864413e-02 -9.63132364e-02 ...  3.19544288e-02
   1.66645788e-01 -4.14512144e-02]
 [-6.01431695e-02 -1.41830056e-01 -1.67303067e-01 ... -9.20444246e-02
  -1.35704020e-01 -1.97471979e-01]]
Layer 2 to Layer 3 weights shape: (128, 10)
[[ 0.0802206   0.11811414  0.15554196 ...  0.10952861 -0.19673115
  -0.01206856]
 [ 0.20375668  0.00601058  0.01542594 ... -0.09759036 -0.24486144
  -0.04011161]
 [-0.16682001 -0.0878621  -0.12048986 ...  0.01672225  0.14767561
   0.02782075]
 ...
 [ 0.17260558  0.04812915 -0.04205182 ... -0.03363136  0.01854209
   0.031338  ]
 [-0.03493015  0.05655029 -0.10003141 ... -0.1788806  -0.18656228
   0.22323695]
 [ 0.14419191 -0.04945441  0.10842404 ... -0.08003547  0.13640862
  -0.09253206]]

Biases for each layer:
Layer 2 biases shape: (128,)
[-0.0154901  -0.15130426  0.01938138 -0.03361278 -0.01493626 -0.03241852
 -0.15233704  0.04388869 -0.04037833  0.18447375 -0.11637788 -0.09406324
 -0.12097813 -0.01710791  0.08800501  0.02938993  0.10034291 -0.08813727
  0.00245587  0.10304174 -0.10874791  0.11799282  0.07973868  0.12218815
  0.14891212 -0.03527556  0.03819379  0.01145883  0.12824578  0.2044192
  0.14387336 -0.10043522 -0.15479077 -0.11348852  0.16051045  0.14271902
 -0.10741135  0.09989618 -0.04471934  0.02978423  0.05646839 -0.03485974
  0.00346052  0.04255235 -0.13407411  0.05818059  0.04303037 -0.05534034
 -0.05523568  0.04101334  0.10179564  0.08788599 -0.05733563  0.09227855
  0.15094261 -0.10986093 -0.15202981  0.14758431 -0.07537613 -0.10287817
  0.15863238 -0.05633332  0.01454432  0.14841003  0.04103576  0.03468033
  0.09963306 -0.03422614 -0.08476052 -0.06439023 -0.09090456 -0.02492231
 -0.03011841  0.03524604  0.04691203 -0.06547988 -0.12017834 -0.05954067
 -0.03277565  0.12785354 -0.09318754 -0.04595618  0.06197901 -0.03297663
  0.12591119  0.13446291  0.05153602 -0.15696598 -0.150649    0.16659949
 -0.1097093   0.21345068 -0.03799814  0.04988282  0.13687615 -0.00299504
 -0.1713345   0.01255818 -0.07000469  0.13240701  0.14794758  0.06273698
 -0.04384604 -0.10282054  0.1362057   0.13742978 -0.05860762 -0.05578062
  0.0682957   0.08424412  0.04540265 -0.14160667 -0.15944197  0.14925356
  0.0182085  -0.00679601 -0.18840589  0.08737223 -0.0114403  -0.00025344
  0.1280287   0.06147109 -0.17019303  0.04654029 -0.06073362  0.01949181
 -0.08647597 -0.03779251]
Layer 3 biases shape: (10,)
[-0.05054062  0.09999839  0.02233768  0.20278396  0.03634248 -0.12467858
  0.09479424  0.19030806 -0.159013    0.18834502]

What’s your accuracy? If you were to guess accurracy?

Probability

Each event is a sample from the sample space, and the sample space represents all the possible events. The sample space for the roll of a standard die is the set {1, 2, 3, 4, 5, 6}. If a feature input to a neural network can take on any value in the range [0, 1], then [0, 1] is the sample space for that feature. Probabilities always sum to 1.0 over all possible values of the sample space. Assume \(A_i\) represent an event i then we can use the summation of n events as follows:

\(\sum_{i=1}^{n} P(A_i) = 1\)

What if we roll two dice and sum them? The sample space is the set of integers from 2 through 12. By counting the ways events can happen and dividing by the total number of events, we can determine the probability. If you flip three coins simultaneously, what is the probability of getting no heads, one head, two heads, or three heads? We can enumerate the possible outcomes and see.

Code

import numpy as np
N = 1000000
M=3
heads = np.zeros(M+1) 
for i in range(N):
    flips = np.random.randint(0,2,M)
    h, _ = np.bincount(flips, minlength=2)
    heads[h] += 1
prob = heads / N
print("Probabilities: %s" % np.array2string(prob))

Probabilities: [0.124846 0.374648 0.375322 0.125184]

Sum rule is applied to find the probability of the union of two events. The probability of the union of two events is the sum of the probabilities of the individual events minus the probability of their intersection. If the events are mutually exclusive, sun rule is formulated as follows:

\(P(A \cup B) = P(A) + P(B)\)

While sum rule tells us about A or B happening, product rule tells us about A and B happening. The probability of the intersection of two events is the product of the probabilities of the individual events. If the events are independent, product rule is formulated as follows:

\(P(A \cap B) = P(A)P(B)\)

If we have a dependent situation, the rule becomes:

\(P(A \cap B) = P(A)P(B|A)\)

Joint probability can be found using chain rule. Assume A,B,C are three random variables. The joint probability can be found as follows:

\(P(A,B,C) = P(A|B,C)P(B,C) = P(A|B,C)P(B|C)P(C)\)

because the joint probability of both A and B doesn’t depend on which event we call A and which we call B, we can write the joint probability as:

\(P(A,B) = P(A|B)P(B) = P(B|A)P(A)\)

Dividing the expressions by P(A) we get the Bayes rule:

\(P(B|A) = \frac{P(A|B)P(B)}{P(A)}\)

Statistics

A statistic is a number that characterizes a sample of data. There are four types of data: nominal (categorical), ordinal, interval, and ratio.

Nominal data: Categories with no inherent order, such as colors or names. Qualitative data.
Ordinal data: Categories with an inherent order, such as rankings or ratings. Qualitative data.
Interval data: Numerical data with meaningful differences but no true zero point, such as temperature in Celsius. Quantitative data.
Ratio data: Numerical data with meaningful differences and a true zero point (means absense), such as weight or height. Quantitative data.

In DL one-hot encoding is used to convert categorical data into numerical data. For example, if we have a list of colors, we can convert them into a list of binary values.

Code

import numpy as np
colors = ["red", "green", "blue", "red", "green", "green"]
unique_colors = np.unique(colors)
one_hot = np.zeros((len(colors), len(unique_colors)))
for i, color in enumerate(colors):
    one_hot[i, np.where(unique_colors == color)[0]] = 1
print(one_hot)

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]

Summary statistics are used to make sense of the data. Mean, weighted mean, geometric mean, harmonic mean, and median are some of the summary statistics. Harmonic mean is used in DL to calculate F1 score that is the harmonic mean of precision and recall.

Correlation describes the associations between variables or features of a dataset. In ML,highly correlated features were undesirable, as they didn’t add any new information and only served to confuse the models. The entire art of feature selection was developed, in part, to remove this effect[2]. For DL, it is less critical to have uncorrelated features, as the network can learn to ignore redundant information.

Pearson correlation coefficient is used to measure the linear relationship between two variables. It ranges from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The closer the correlation coefficient is to zero, either positive or negative, the weaker the correlation between the features [2]. The formula using the expected values is given below:

\(corr(X,Y) = \frac{E(XY) - E(X)E(Y)}{\sqrt{E(X^2) - E(X)^2} \sqrt{E(Y^2) - E(Y)^2}}\)

Code

import numpy as np
def pearson(x,y):
    exy = (x*y).mean()
    ex = x.mean()
    ey = y.mean()
    exx = (x*x).mean()
    ex2 = x.mean()**2
    eyy = (y*y).mean()
    ey2 = y.mean()**2
    return (exy - ex*ey)/(np.sqrt(exx-ex2)*np.sqrt(eyy-ey2))

x = np.random.random(100)
y = 2*x + np.random.random(100)
pearson(x,y)

0.8987461226098205

Assume three feature vectors from a dataset x, y, z.

Code

np.random.seed(8675309)
N = 100
x = np.linspace(0,1,N) + (np.random.random(N)-0.5)
y = np.random.random(N)*x
z = -0.1*np.random.random(N)*x

corr_xy = pearson(x,y)
corr_xz = pearson(x,z)
corr_yz = pearson(y,z)

print("Correlation x,y: %f" % corr_xy)
print("Correlation x,z: %f" % corr_xz)
print("Correlation y,z: %f" % corr_yz)

d = np.vstack((x,y,z))
print(np.corrcoef(d))

Correlation x,y: 0.682852
Correlation x,z: -0.850475
Correlation y,z: -0.565361
[[ 1.          0.68285166 -0.85047468]
 [ 0.68285166  1.         -0.56536104]
 [-0.85047468 -0.56536104  1.        ]]

The Pearson correlation looks for a linear relationship, whereas the Spearman looks for any monotonic association between the inputs [2] using the ranks of the feature values.

Linear Algebra

A scalar has no dimensions, a vector has one, and a matrix has two. As you might suspect, we don’t need to stop there. A mathematical object with more than two dimensions is colloquially referred to as a tensor. A matrix is a tensor of order 2. A vector is an order-1 tensor, and a scalar is an order-0 tensor[2].

Code

t = np.arange(36).reshape((3,3,4)) #stack of 3x4 images for example
print(t[0]) #first image
print(t[0,0]) #first row of first image
print(t[0,0,0]) #first element of first row of first image

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[0 1 2 3]
0

A bookshelf example to access a word of a particular book. Locate the shelf, book, page, line, and the word.

Code

w = np.zeros((9,9,9,9,9))
w[4,1,2,0,1] #second word, first line, third page, second book, fifth shelf

0.0

we can treat a scalar (order-0 tensor) as an order-5 tensor, like this:

Code

t = np.array(42).reshape((1,1,1,1,1))
print(t)
t.shape

[[[[[42]]]]]

(1, 1, 1, 1, 1)

Element-wise multiplication on two matrices (a and b) is often known as the Hadamard product.

Code

a = np.array([[1,2,3],[4,5,6]])
b = np.array([[7,8,9],[10,11,12]])
c = a*b
print(c)

[[ 7 16 27]
 [40 55 72]]

Inner product also called dot product is found by:

Code

import numpy as np

def inner(a,b):
    s = 0.0
    for i in range(len(a)):
        s += a[i]*b[i]
    return s

a = np.array([1,2,3,4]) 
b = np.array([5,6,7,8])

print(inner(a,b))
print(np.dot(a,b))
print(a.dot(b))
(a*b).sum()

70.0
70
70

algebraicly we can write the dot product as:

\(\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i\)

or if vectors are interpreted geometrically, the dot product is the product of the magnitudes of the two vectors and the cosine of the angle between them:

\(\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}||\mathbf{b}|cos(\theta)\)

The inner product of two vectors returned a scalar value. The outer product of two vectors instead returns a matrix.

Code

a = np.array([1,2,3,4]) 
b = np.array([5,6,7,8])
c = np.outer(a,b)
print(c)

[[ 5  6  7  8]
 [10 12 14 16]
 [15 18 21 24]
 [20 24 28 32]]

The outer product appears as a mixing of two different embedding vectors. Embeddings are the vectors generated by lower layers of a network, for example, the next to last fully connected layer before the softmax layer’s output of a traditional convolutional neural network (CNN) [2].

If we want to multiplt matrices A and B to find C, the elements of C are found by taking the dot product of the rows of A and the columns of B.

\(C_{ij} = \sum_{k=1}^{n} a_{ik}b_{kj}\)

Code

def matrixmul(A,B):
    I,K = A.shape
    J = B.shape[1]
    C = np.zeros((I,J), dtype=A.dtype)
    for i in range(I):
        for j in range(J):
            for k in range(K):
                C[i,j] += A[i,k]*B[k,j]
    return C

A = np.array([[1,2,3],[4,5,6]])
B = np.array([[7,8],[9,10],[11,12]])
C = matrixmul(A,B)
print(C)

[[ 58  64]
 [139 154]]

We can use a matrix to transform points between spaces. For example a matrix can map a point from 2D to 3D space.

Code

import numpy as np

def transform(A,x):
    return A.dot(x) #matrix-vector multiplication

A = np.array([[1,2],[3,4],[5,6]])
x = np.array([1,2])
y = transform(A,x)
print(y)

[ 5 11 17]

An affine transformation maps a set of points into another set of points so that points on a line in the original space are still on a line in the mapped space. The transformation is [2]:

\(y = Ax + b\)

In fact, we can view a feedforward neural network as a series of affine transformations, where the transformation matrix is the weight matrix between the layers, and the bias vector provides the translation[2].

We can characterize a matrix B by how it maps vectors using the inner product between the mapped vector and the original vector[2]. B is positive definite if:

\(x^TBx > 0, \forall x \neq 0\)

B is positive semidefinite if:

\(x^TBx \geq 0, \forall x\)

B is negative definite if:

\(x^TBx < 0, \forall x \neq 0\)

B is negative semidefinite if:

\(x^TBx \leq 0, \forall x\)

B is indefinite if it is none of the above.

If a symmetric matrix is positive definite, then all of its eigenvalues are positive. Similarly, a symmetric negative definite matrix has all negative eigenvalues. Positive and negative semidefinite symmetric matrices have eigenvalues that are all positive or zero or all negative or zero, respectively [2].

A is a square matrix, v is a column vector and λ is a scalar. If Av = λv, then v is an eigenvector (eigen comes from German self) of A and λ is the corresponding eigenvalue. The vector, v, is mapped by A back into a scalar multiple of itself [2].

For n dimensional vector x, \(L_p\) norm is defined as:

\(||x||_p = (\sum_{i=1}^{n} |x_i|^p)^{1/p}\)

For p=2, the norm is called the Euclidean norm. For p=1, the norm is called the Manhattan norm. For p=\(\infty\), the norm is called the maximum norm.

L2-norm is calculated explicitly as:

\(||x||_2 = \sqrt{\sum_{i=1}^{n} x_i^2}\)

If we replace x by difference of two vectors, we treat norms as distance measures between the vectors. So, we can calculate the distance (Euclidean) between two vectors as:

\(||x-y||_2 = \sqrt{\sum_{i=1}^{n} (x_i-y_i)^2}\)

For example, weight decay, used in deep learning as a regularizer, uses the L2-norm of the weights of the model to keep the weights from getting too large [2].

We can calculate the variance of the features with respect to each other. Covariance matrix is used to calculate the variance of the features with respect to each other. The covariance matrix is a square matrix that contains the variances of the features along the diagonal and the covariances between the features off the diagonal. The covariance between two features is calculated as:

\(cov(X,Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})\)

We will use the covariance matrix to find principal components of the data.

Code

from sklearn import datasets 
iris = datasets.load_iris() 
X = iris.data[:5]
X
print(np.cov(X, rowvar=False))
print(np.std(X, axis=0, ddof=1)**2)

[[ 0.043   0.0365 -0.0025  0.    ]
 [ 0.0365  0.067  -0.0025  0.    ]
 [-0.0025 -0.0025  0.005   0.    ]
 [ 0.      0.      0.      0.    ]]
[0.043 0.067 0.005 0.   ]

Assume each row of X is a feature vector. n dimensional point in the space is represented by a row vector. Principal component analysis (PCA) is a technique to learn the directions of the scatter in the dataset, starting with the direction aligned along the greatest scatter. This direction is called the principal component[2]. The PCA steps are summarized as follows:

Subtract the mean from the data (mean centering).
Calculate the covariance matrix.
Calculate the eigenvectors and eigenvalues of the covariance matrix.
Sort the eigenvectors by their eigenvalues in decreasing order.
Project the data onto the eigenvectors (using the transformation matrix that has the eigenvectors).

Code

#download iris dataset, 150 samples, 4 features
from sklearn.datasets import load_iris
iris = load_iris().data.copy() 
#mean centering
m = iris.mean(axis=0)
ir = iris - m
#covariance matrix 4x4
cv = np.cov(ir, rowvar=False)  
#eigenvalues and eigenvectors
val, vec = np.linalg.eig(cv)
#sort by eigenvalues
val = np.abs(val)
idx = np.argsort(val)[::-1]
#fraction explained, eigenvalues proportional to variance along each eigenvector(PC)
ex = val[idx] / val.sum()
print("fraction explained: ", ex)
#projection of data onto eigenvectors, keep first 2 components since they explain 98% of the variance
#w is 2x4, d is 150x2
w = np.vstack((vec[:,idx[0]],vec[:,idx[1]]))
d = np.zeros((ir.shape[0],2)) 
for i in range(ir.shape[0]): 
    d[i,:] = np.dot(w,ir[i])

fraction explained:  [0.92461872 0.05306648 0.01710261 0.00521218]

During PCA, you may lose a critical feature allowing class separation. As with most things in machine learning, experimentation is vital [2].

A more compact code for PCA is given below:

Code

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
d = pca.fit_transform(ir)

Sigular Value Decomposition (SVD)

For deep learning, you’ll most likely encounter SVD when calculating the pseudoinverse of a nonsquare matrix [2]. A matrix A (mxn) can be decomposed into three matrices U(mxm), Σ(mxn), and V(nxn), where U and V are orthogonal matrices and Σ is a diagonal matrix. The diagonal elements of Σ are the singular values of A. The SVD is given as:

\(A = U \Sigma V^T\)

The “singular” in “singular value decomposition” comes from the singular values of the matrix A. The singular values are the square roots of the positive eigenvalues of the matrix AAT.

Code

from scipy.linalg import svd
a = np.array([[3,2,2],[2,3,-2]])
u,s,vt = svd(a)
print(u)

[[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

In order to recover A we can augment s with zeros.

Code

S = np.zeros((2,3)) 
S[0,0], S[1,1] = s 
print(S)
A = u @ S @ vt
print(A)

[[5. 0. 0.]
 [0. 3. 0.]]
[[ 3.  2.  2.]
 [ 2.  3. -2.]]

SVD can be used to find principal components or pseudoinverse of a matrix that is not a square one. we can calculate the pseudoinverse of any general matrix as follows:

\(A^+ = V \Sigma^+ U^T\)

where Σ+ is the pseudoinverse of Σ. The pseudoinverse of a diagonal matrix is the reciprocal of the diagonal elements. If the diagonal element is zero, the reciprocal is zero.

Code

A = np.array([[3,2,2],[2,3,-2]])
u,s,vt = svd(A)
Splus = np.array([[1/s[0],0],[0,1/s[1]],[0,0]])
Aplus = vt.T @ Splus @ u.T
print(Aplus)
print(A @ Aplus @ A)

[[ 0.15555556  0.04444444]
 [ 0.04444444  0.15555556]
 [ 0.22222222 -0.22222222]]
[[ 3.  2.  2.]
 [ 2.  3. -2.]]

Differantial Calculus

The chain rule is used to find the derivative of a composite function. If y = f(g(x)), then the derivative of y with respect to x is given by:

\(\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}\)

where u = g(x). The chain rule is used in backpropagation to calculate the gradient of the loss function with respect to the weights of the network.

Minima or maxima of functions are called extrema of a function. Dervatives at these points are zero. We can use the derivative as a pointer to tell us how to move closer and closer to the extrema. This is what gradient descent does…[2].

Gradients tell us the direction and magnitude of the change in the function value at a point. The direction of the maximum change in a function at any point is the gradient at that point [2]. Recall, the gradient is a vector field, so each point on the xy-plane has an associated vector pointing in the direction of the greatest change in the function value[2].

Optimization

Training a neural network is, to a first approximation, an optimization problem—the goal is to find the weights and biases leading to a minimum in the loss function landscape[2].

A first order optimization algorithm uses the gradient of the loss function to update the weights and biases. The simplest first order optimization algorithm is gradient descent. The update rule for gradient descent is given as:

\(w_{t+1} = w_t - \alpha \nabla L\)

where \(w_t\) is the weight at time t, α is the learning rate, and ∇L is the gradient of the loss function. The learning rate is a hyperparameter that controls the size of the update to the weights. If the learning rate is too large, the weights may oscillate around the minimum or even diverge. If the learning rate is too small, the weights may take a long time to converge to the minimum.

Let’s find the derivative of the weights and bias value for a single node of a hidden layer in a feedforward network[2].

Outputs of the previous layer, x, multiplied by the weights, w, and added to the bias, b, are passed through an activation function, such as ReLU, to get the output of the node, y. In order to apply backpropagation we need derivatives of y with respect to w and b. y is given as:

\(y = ReLU(wx + b)\)

Derivative of this function with respect to w is given as:

1-derivative of the dot product of w and x with respect to w is x. 2-derivative of the ReLU function is 1 if the input is positive and 0 otherwise.

Now, chain rule can be applied that is derivative of y with respect to w equals derivative of y with respect to z times derivative of z with respect to w where z is w.x + b and y is ReLU(z). Same logic can be applied to find the derivative of y with respect to b.

Training a deep neural network is, fundamentally, an optimization problem, so the potential utility of the Jacobian and Hessian is clear, even if the latter can’t be easily used for large neural networks[2].

References

1: Understanding Deep Learning

2: Math for Deep Learning