Topic8 - Machine Learning in Graphs

lecture
Author

Harun Pirim

Published

November 27, 2023

Introduction

Graph learning tasks include:

Node classification: predictiong the label of a node in a graph. Link prediction: predicting the existence of an edge between two nodes. Graph classification: predicting the label of a graph.

Figure 1 illustrates these tasks.

Figure 1 Figure 1: Graph learning tasks. Courtesy of [1].

Figure 2 shows a representation of a node based on its neighbors.

Figure 2 Figure 2: node 5’s representation is updated based on its neighbors. Courtesy of [1].

In node classification node features + information from edges + information from the graph are represented.

Node representation

DeepWalk architecture is used to learn node representations. It is a two-step approach. First, it generates random walks (similar to a sequence of words) from the graph. Then, it uses SkipGram to learn the representations of the nodes.

Word2Vec (2013) uses NN to translate words into vectors. skip-gram predicts context words given a target word.

Assume n words: \(w_1, w_2, ..., w_N\). The goal is to maximize the probability of the context words given the target word. The probability is given by:

\[ P(w_{i+j}|w_i) = \frac{exp(v_{w_{i+j}}^T v_{w_i})}{\sum_{w=1}^N exp(v_w^T v_{w_i})} \]

where \(v_w\) is the vector representation of word \(w\). N is the vocabulary size.

\[\frac{1}{N}\sum_{n=1}^{N} \sum_{-c\leq j \leq c, j\ne 0} \log p(W_{n+j} | W_n)\]

Here is a nice video that explains Word2Vec.

Code
 import numpy as np
 CONTEXT_SIZE = 2
 text = """Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Nunc eu sem scelerisque, dictum eros
aliquam, accumsan quam. Pellentesque tempus, lorem ut
semper fermentum, ante turpis accumsan ex, sit amet
ultricies tortor erat quis nulla. Nunc consectetur ligula
sit amet purus porttitor, vel tempus tortor scelerisque.
Vestibulum ante ipsum primis in faucibus orci luctus
et ultrices posuere cubilia curae; Quisque suscipit
ligula nec faucibus accumsan. Duis vulputate massa sit
amet viverra hendrerit. Integer maximus quis sapien id
convallis. Donec elementum placerat ex laoreet gravida.
Praesent quis enim facilisis, bibendum est nec, pharetra
ex. Etiam pharetra congue justo, eget imperdiet diam
varius non. Mauris dolor lectus, interdum in laoreet
quis, faucibus vitae velit. Donec lacinia dui eget
maximus cursus. Class aptent taciti sociosqu ad litora
torquent per conubia nostra, per inceptos himenaeos.
Vivamus tincidunt velit eget nisi ornare convallis.
Pellentesque habitant morbi tristique senectus et netus
et malesuada fames ac turpis egestas. Donec tristique
ultrices tortor at accumsan.
""".split()
Code
skipgrams = []
for i in range(CONTEXT_SIZE, len(text) - CONTEXT_SIZE):
    array = [text[j] for j in np.arange(i - CONTEXT_SIZE,
i + CONTEXT_SIZE + 1) if j != i]
    skipgrams.append((text[i], array))
print(skipgrams[0:3])
[('dolor', ['Lorem', 'ipsum', 'sit', 'amet,']), ('sit', ['ipsum', 'dolor', 'amet,', 'consectetur']), ('amet,', ['dolor', 'sit', 'consectetur', 'adipiscing'])]
Code
vocab = set(text)
VOCAB_SIZE = len(vocab)
print(f"Length of vocabulary = {VOCAB_SIZE}")
Length of vocabulary = 121
Code
#!pip install -qU gensim
from gensim.models.word2vec import Word2Vec

model = Word2Vec([text],
                 sg=1,   # Skip-gram
                 vector_size=10,
                 min_count=0,
                 window=2,
                 workers=2,
                 seed=0)

print(f'Shape of W_embed: {model.wv.vectors.shape}')
Shape of W_embed: (121, 10)
Code
model.train([text], total_examples=model.corpus_count,
epochs=10)
WARNING:gensim.models.word2vec:Effective 'alpha' higher than previous training cycles
(690, 1560)
Code
print('Word embedding =')
print(model.wv[0])
Word embedding =
[ 0.07156403  0.03257632  0.00209916 -0.04374931 -0.03398107 -0.08656936
 -0.09047253 -0.09552431 -0.06482638  0.0660186 ]
Code
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import random
random.seed(0)
Code
G = nx.erdos_renyi_graph(10, 0.3, seed=1, directed=False)
Code
def random_walk(start, length):
    walk = [str(start)]  # starting node
    for i in range(length):
        neighbors = [node for node in G.neighbors(start)]
        next_node = np.random.choice(neighbors, 1)[0]
        walk.append(str(next_node))
        start = next_node
    return walk
Code
print(random_walk(0, 10))
['0', '1', '9', '1', '0', '4', '6', '7', '6', '5', '6']
Code
G = nx.karate_club_graph()
Code
labels = []
for node in G.nodes:
    label = G.nodes[node]['club']
    labels.append(1 if label == 'Officer' else 0)
Code
plt.figure(figsize=(12,12), dpi=300)
plt.axis('off')
nx.draw_networkx(G,
                 pos=nx.spring_layout(G, seed=0),
                 node_color=labels,
                 node_size=800,
                cmap='coolwarm',
                font_size=14,
                font_color='white'
                )

Code
walks = []
for node in G.nodes:
    for _ in range(80):
        walks.append(random_walk(node, 10))
print(walks[0])
['0', '10', '0', '17', '0', '2', '13', '0', '2', '9', '33']

Now we can train the Word2Vec model on the walks.

Code
model = Word2Vec(walks,
                 hs=1,   # Hierarchical Softmax
                 sg=1,   # Skip-gram
                 vector_size=10,
                 min_count=0,
                 window=10,
                 workers=2,
                 seed=0)
model.train(walks, total_examples=model.corpus_count,
epochs=30, report_delay=1)
print(model.wv[0])
WARNING:gensim.models.word2vec:Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. 
WARNING:gensim.models.word2vec:Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. 
WARNING:gensim.models.word2vec:Effective 'alpha' higher than previous training cycles
[ 0.32936254  0.06917843  0.08624159 -0.5531198  -0.33145267  0.50137293
 -0.64187    -0.2095948   0.23543341  0.64617795]

Let’s store the node embedding and labels to apply machine learning algorithms.

Code
nodes_wv = np.array([model.wv.get_vector(str(i)) for i in
range(len(model.wv))])
labels = np.array(labels)

Apply random forest classifier to the node embeddings.

Code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(nodes_wv, labels, test_size=0.2, random_state=0)

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy = {accuracy_score(y_test, y_pred):.2f}')
Accuracy = 1.00
Code
# 5 fold cross validation
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
scores = cross_val_score(clf, nodes_wv, labels, cv=5)
print(f'Accuracy = {np.mean(scores):.2f}')
Accuracy = 0.94

Node embedding problem turn out to be foundational in graph machine learning tasks. We can use encoder and decoder perspective to learn node embeddings. Figures 3 and 4 illustrate the encoder and decoder perspective.

Figure 3 Figure 3: Encoder and decoder perspective. Courtesy of [2].

Here the goal is to learn an encoder that maps a node to a low-dimensional vector representation. The decoder is used to reconstruct the graph from the node embeddings.

\[ ENC: V \rightarrow R^d \]

\[ ENC(v_i) = z_i \]

\[ DEC: R^d \times R^d \rightarrow R^+ \]

\[ DEC(z_i, z_j) \approx S_{ij} \]

where \(z_i\) and \(z_j\) are the embeddings of nodes \(v_i\) and \(v_j\), respectively. \(S_{ij}\) is the actual similarity between nodes $ i and j$.

We can use the following loss function to train the encoder and decoder.

\[ L = \sum_{(i,j) \in E} (S_{ij} - DEC(ENC(v_i), ENC(v_j)))^2 \]

Figure 4 Figure 4: Encoder and decoder perspective. Courtesy of [2].

These type of node embeddings are shallow since node features are not used. Some of the shallow embedding methods are summarized in the following table.

Table 1 Table 1: Summary of some of the shallow embedding algorithms. Courtesy of [2].

Node2Vec

Node2Vec (Grover, Leskovec, 2016) uses the main components of DeepWalk: random walk and Word2Vec. However, random walks are biased defining neighborhood based on BFS and DFS.

The transition probability from node \(v_i\) to \(v_j\) is given by:

\[ P(v_j | v_i) = \alpha_{pq}(i,j) \frac{w_{ij}}{Z} \]

where \(w_{ij}\) is the weight of the edge between \(v_i\) and \(v_j\). \(Z\) is the normalization constant. \(\alpha_{pq}(i,j)\) is the search bias. It is defined as:

\[ \alpha_{pq}(i,j) = \begin{cases} \frac{1}{p} & \text{if } d_{ij} = 0 \\ 1 & \text{if } d_{ij} = 1 \\ \frac{1}{q} & \text{if } d_{ij} = 2 \end{cases} \]

where \(d_{ij}\) is the shortest path between \(v_i\) and \(v_j\). \(p\) and \(q\) are hyperparameters. \(p\) controls the likelihood of immediately revisiting a node in the walk. \(q\) controls the likelihood of exploring outward from the previous node. Greater q values result in exploring nodes close to the start no thus approximating BFS. Smaller q values result in exploring nodes further away from the start node thus approximating DFS.

Let’s implement Node2Vec using the python library.

Code
!pip3 install -qU node2vec
from node2vec import Node2Vec
Code
G = nx.karate_club_graph()
Code
node2vec = Node2Vec(G, dimensions=10, walk_length=10, num_walks=80, p=1, q=0.5, workers=2)
Generating walks (CPU: 1):   0%|          | 0/40 [00:00<?, ?it/s]Generating walks (CPU: 2):   0%|          | 0/40 [00:00<?, ?it/s]Generating walks (CPU: 2): 100%|██████████| 40/40 [00:00<00:00, 1766.00it/s]
Generating walks (CPU: 1): 100%|██████████| 40/40 [00:00<00:00, 1754.08it/s]
Code
model = node2vec.fit(window=10, min_count=1, batch_words=4)
Code
nodes_wv = np.array([model.wv.get_vector(str(i)) for i in
range(len(model.wv))])
labels = np.array(labels)
Code
X_train, X_test, y_train, y_test = train_test_split(nodes_wv, labels, test_size=0.3, random_state=0)

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy = {accuracy_score(y_test, y_pred):.2f}')
Accuracy = 1.00

Graph Neural Networks

A GNN is an optimizable transformation on all attributes of the graph (nodes, edges, global-context) that preserves graph symmetries (permutation invariances)[4].

In a basic neural network, input node A with features \(x_A\) is transformed with a weight matrix \(W\). The output is given by:

\[ h_A = \sigma(Wx_A) \]

where \(\sigma\) is the activation function.

In a GNN, the output of a node is a function of its neighbors. The output of node A is given by:

\[ h_A = \sigma(\sum_{B \in N(A)} Wx_B) \]

where \(N(A)\) is the set of neighbors of node A.

In matrix representation, a graph linear layer is given by:

\[ H = \hat A^T X W^T \]

where \(\hat A\) is the adjacency matrix with self-loops. \(X\) is the feature matrix. \(W\) is the weight matrix. Here is a sample code to implement a graph linear layer.

Code
import torch
!pip install torch_geometric
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root=".", name="Cora")
from torch.nn import Linear
import torch.nn.functional as F
data = dataset[0]

def accuracy(y_pred, y_true):
    return torch.sum(y_pred == y_true) / len(y_true)

class VanillaGNNLayer(torch.nn.Module):
  def __init__(self, dim_in, dim_out):
    super().__init__()
    self.linear = Linear(dim_in, dim_out, bias=False)
  def forward(self,x,adjacency):
    x = self.linear(x)
    x = torch.sparse.mm(adjacency, x) #multiplication with A
    return x

from torch_geometric.utils import to_dense_adj
adjacency = to_dense_adj(data.edge_index)[0] #edge index is coonverted to adj
adjacency += torch.eye(len(adjacency)) #self loop added

class VanillaGNN(torch.nn.Module):
    def __init__(self, dim_in, dim_h, dim_out):
        super().__init__()
        self.gnn1 = VanillaGNNLayer(dim_in, dim_h)
        self.gnn2 = VanillaGNNLayer(dim_h, dim_out)
    def forward(self, x, adjacency):
        h = self.gnn1(x, adjacency)
        h = torch.relu(h)
        h = self.gnn2(h, adjacency)
        return F.log_softmax(h, dim=1)
    def fit(self, data, epochs):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.parameters(),
        lr=0.01, weight_decay=5e-4)
        self.train()
        for epoch in range(epochs+1):
            optimizer.zero_grad()
            out = self(data.x, adjacency)
            loss = criterion(out[data.train_mask],
data.y[data.train_mask])
            acc = accuracy(out[data.train_mask].
argmax(dim=1), data.y[data.train_mask])
            loss.backward()
            optimizer.step()
            if epoch % 20 == 0:
                val_loss = criterion(out[data.val_mask],
data.y[data.val_mask])
                val_acc = accuracy(out[data.val_mask].
argmax(dim=1), data.y[data.val_mask])
                print(f'Epoch {epoch:>3} | Train Loss:{loss:.3f} | Train Acc: {acc*100:>5.2f}% | Val Loss:{val_loss:.2f} | Val Acc: {val_acc*100:.2f}%')
    def test(self, data):
        self.eval()
        out = self(data.x, adjacency)
        acc = accuracy(out.argmax(dim=1)[data.test_mask],
        data.y[data.test_mask])
        return acc

gnn = VanillaGNN(dataset.num_features, 16, dataset.num_classes)
print(gnn)
gnn.fit(data, epochs=100)
acc = gnn.test(data)
print(f'\nGNN test accuracy: {acc*100:.2f}%')
zsh:1: command not found: pip
VanillaGNN(
  (gnn1): VanillaGNNLayer(
    (linear): Linear(in_features=1433, out_features=16, bias=False)
  )
  (gnn2): VanillaGNNLayer(
    (linear): Linear(in_features=16, out_features=7, bias=False)
  )
)
Epoch   0 | Train Loss:2.187 | Train Acc: 13.57% | Val Loss:2.24 | Val Acc: 9.40%
Epoch  20 | Train Loss:0.058 | Train Acc: 100.00% | Val Loss:1.48 | Val Acc: 74.40%
Epoch  40 | Train Loss:0.012 | Train Acc: 100.00% | Val Loss:1.96 | Val Acc: 74.40%
Epoch  60 | Train Loss:0.005 | Train Acc: 100.00% | Val Loss:2.05 | Val Acc: 74.40%
Epoch  80 | Train Loss:0.004 | Train Acc: 100.00% | Val Loss:2.06 | Val Acc: 74.40%
Epoch 100 | Train Loss:0.003 | Train Acc: 100.00% | Val Loss:2.04 | Val Acc: 74.80%

GNN test accuracy: 75.90%

Graph Convolutional Networks

Graph convolutional networks (GCNs) are a type of GNN. GNN did not take into account the difference in neighborhoods. GCNs apply a normalization that considers node degrees. GCNs are based on the convolution operation. A convolution layer with \(\hat D\) and \(\hat A\) is given by:

\[ H^{(l+1)} = \sigma(\hat D^{-1/2} \hat A \hat D^{-1/2} H^{(l)} W^{(l)}) \]

where \(H^{(l)}\) is the hidden state matrix at layer \(l\). \(W^{(l)}\) is the weight matrix at layer \(l\). \(\hat A\) is the adjacency matrix with self-loops. \(\hat D\) is the degree matrix with self-loops. \(\sigma\) is the activation function.

Here is an example code to implement GCN.

Code
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

def accuracy(pred_y, y):
    return ((pred_y == y).sum() / len(y)).item()

class GCN(torch.nn.Module):
    """Graph Convolutional Network"""
    def __init__(self, dim_in, dim_h, dim_out):
        super().__init__()
        self.gcn1 = GCNConv(dim_in, dim_h)
        self.gcn2 = GCNConv(dim_h, dim_out)
    def forward(self, x, edge_index):
        h = self.gcn1(x, edge_index)
        h = torch.relu(h)
        h = self.gcn2(h, edge_index)
        return F.log_softmax(h, dim=1)
    def fit(self, data, epochs):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.parameters(),
                                    lr=0.01,
                                    weight_decay=5e-4)
        self.train()
        for epoch in range(epochs+1):
            optimizer.zero_grad()
            out = self(data.x, data.edge_index)
            loss = criterion(out[data.train_mask],
            data.y[data.train_mask])
            acc = accuracy(out[data.train_mask].
            argmax(dim=1), data.y[data.train_mask])
            loss.backward()
            optimizer.step()
            if(epoch % 20 == 0):
                val_loss = criterion(out[data.val_mask],data.y[data.val_mask])
                val_acc = accuracy(out[data.val_mask].argmax(dim=1), data.y[data.val_mask])
                print(f'Epoch {epoch:>3} | Train Loss:{loss:.3f} | Train Acc: {acc*100:>5.2f}% | Val Loss:{val_loss:.2f} | Val Acc: {val_acc*100:.2f}%')
    @torch.no_grad()
    def test(self, data):
        self.eval()
        out = self(data.x, data.edge_index)
        predictions = out.argmax(dim=1)
        acc = accuracy(out.argmax(dim=1)[data.test_mask],
        data.y[data.test_mask])
        return acc, predictions

gcn = GCN(dataset.num_features, 16, dataset.num_classes)
print(gcn)
gcn.fit(data, epochs=100)

acc, pred = gcn.test(data)
print(f'GCN test accuracy: {acc*100:.2f}%')
print(f'GCN predictions: {pred}')
GCN(
  (gcn1): GCNConv(1433, 16)
  (gcn2): GCNConv(16, 7)
)
Epoch   0 | Train Loss:1.952 | Train Acc:  7.86% | Val Loss:1.95 | Val Acc: 8.40%
Epoch  20 | Train Loss:0.098 | Train Acc: 100.00% | Val Loss:0.79 | Val Acc: 77.00%
Epoch  40 | Train Loss:0.013 | Train Acc: 100.00% | Val Loss:0.75 | Val Acc: 77.80%
Epoch  60 | Train Loss:0.013 | Train Acc: 100.00% | Val Loss:0.73 | Val Acc: 77.80%
Epoch  80 | Train Loss:0.016 | Train Acc: 100.00% | Val Loss:0.72 | Val Acc: 77.80%
Epoch 100 | Train Loss:0.015 | Train Acc: 100.00% | Val Loss:0.73 | Val Acc: 78.00%
GCN test accuracy: 80.00%
GCN predictions: tensor([3, 4, 4,  ..., 1, 3, 3])

Graph Attention Networks

Graph attention networks (GATs) are a type of GNN. GATs use attention mechanism to learn the importance of neighbors. The attention mechanism is given by:

\[ \alpha_{ij} = \frac{exp(LeakyReLU(\vec a^T [Wx_i || Wx_j]))}{\sum_{k \in N(i)} exp(LeakyReLU(\vec a^T [Wx_i || Wx_k]))} \]

where \(x_i\) and \(x_j\) are the features of nodes \(i\) and \(j\), respectively. \(W\) is the weight matrix. \(||\) is the concatenation operator. \(\vec a\) is the attention vector. \(N(i)\) is the set of neighbors of node \(i\).

The output of node \(i\) is given by:

\[ h_i = \sigma(\sum_{j \in N(i)} \alpha_{ij} Wx_j) \]

where \(\sigma\) is the activation function. Multiple attention heads are used to learn multiple representations of nodes. The output of node \(i\) is given by:

\[ h_i = \sigma(\sum_{k=1}^K \sum_{j \in N(i)} \alpha_{ij}^k W^kx_j) \]

where \(K\) is the number of attention heads. Let’s implement GAT in NumPy.

Code
import numpy as np

np.random.seed(0)
A = np.array([
    [1, 1, 1, 1],
    [1, 1, 0, 0],
    [1, 0, 1, 1],
    [1, 0, 1, 1]])
X = np.random.uniform(-1, 1, (4, 4))
W = np.random.uniform(-1, 1, (2, 4)) #number of hidden dim, nb of nodes
W_att = np.random.uniform(-1, 1, (1, 4)) #1 by dim_h*2
connections = np.where(A > 0)
np.concatenate([(X @ W.T)[connections[0]], (X @ W.T) #concatenate hidden vectors from source to destination
[connections[1]]], axis=1)
a = W_att @ np.concatenate([(X @ W.T)[connections[0]], (X
@ W.T)[connections[1]]], axis=1).T
def leaky_relu(x, alpha=0.2):
    return np.maximum(alpha*x, x)
e = leaky_relu(a)
E = np.zeros(A.shape)
E[connections[0], connections[1]] = e[0]
def softmax2D(x, axis):
    e = np.exp(x - np.expand_dims(np.max(x, axis=axis),
axis))
    sum = np.expand_dims(np.sum(e, axis=axis), axis)
    return e / sum
W_alpha = softmax2D(E, 1)
H = A.T @ W_alpha @ X @ W.T
H
array([[-1.10126376,  1.99749693],
       [-0.33950544,  0.97045933],
       [-1.03570438,  1.53614075],
       [-1.03570438,  1.53614075]])

Let’s implement GAT using PyG.

Code
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root=".", name="Cora")
data = dataset[0]
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATv2Conv
from torch.nn import Linear, Dropout

def accuracy(y_pred, y_true):
    return torch.sum(y_pred == y_true) / len(y_true)
class GAT(torch.nn.Module):
    def __init__(self, dim_in, dim_h, dim_out, heads=8):
        super().__init__()
        self.gat1 = GATv2Conv(dim_in, dim_h, heads=heads)
        self.gat2 = GATv2Conv(dim_h*heads, dim_out,
heads=1)
    def forward(self, x, edge_index):
        h = F.dropout(x, p=0.6, training=self.training)
        h = self.gat1(h, edge_index)
        h = F.elu(h)
        h = F.dropout(h, p=0.6, training=self.training)
        h = self.gat2(h, edge_index)
        return F.log_softmax(h, dim=1)
    def fit(self, data, epochs):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.parameters(),
lr=0.01, weight_decay=0.01)
        self.train()
        for epoch in range(epochs+1):
            optimizer.zero_grad()
            out = self(data.x, data.edge_index)
            loss = criterion(out[data.train_mask],
data.y[data.train_mask])
            acc = accuracy(out[data.train_mask].argmax(dim=1), data.y[data.train_mask])
            loss.backward()
            optimizer.step()
            if(epoch % 20 == 0):
                val_loss = criterion(out[data.val_mask],
data.y[data.val_mask])
                val_acc = accuracy(out[data.val_mask].
argmax(dim=1), data.y[data.val_mask])
                print(f'Epoch {epoch:>3} | Train Loss:{loss:.3f} | Train Acc: {acc*100:>5.2f}% | Val Loss:{val_loss:.2f} | Val Acc: {val_acc*100:.2f}%')
    @torch.no_grad()
    def test(self, data):
        self.eval()
        out = self(data.x, data.edge_index)
        acc = accuracy(out.argmax(dim=1)[data.test_mask],
    data.y[data.test_mask])
        return acc
gat = GAT(dataset.num_features, 32, dataset.num_classes)
gat.fit(data, epochs=100)

acc = gat.test(data)
print(f'GAT test accuracy: {acc*100:.2f}%')
Epoch   0 | Train Loss:1.965 | Train Acc:  9.29% | Val Loss:1.94 | Val Acc: 15.20%
Epoch  20 | Train Loss:0.211 | Train Acc: 99.29% | Val Loss:0.94 | Val Acc: 72.20%
Epoch  40 | Train Loss:0.175 | Train Acc: 97.86% | Val Loss:0.88 | Val Acc: 73.00%
Epoch  60 | Train Loss:0.160 | Train Acc: 98.57% | Val Loss:0.90 | Val Acc: 73.60%
Epoch  80 | Train Loss:0.133 | Train Acc: 100.00% | Val Loss:0.93 | Val Acc: 71.00%
Epoch 100 | Train Loss:0.128 | Train Acc: 97.86% | Val Loss:0.92 | Val Acc: 74.40%
GAT test accuracy: 82.40%

GraphSAGE

GraphSAGE (Hamilton et al., 2017) is a type of GNN. GraphSAGE uses a neighborhood aggregation function to aggregate the features of a node’s neighbors. The neighborhood aggregation function is given by:

\[ h_i^{(l)} = \sigma(W^l \cdot AGG(\{h_j^{(l-1)}, \forall j \in N(i)\})) \]

where \(h_i^{(l)}\) is the hidden state of node \(i\) at layer \(l\). \(W^l\) is the weight matrix at layer \(l\). \(\sigma\) is the activation function. \(AGG\) is the aggregation function. \(N(i)\) is the set of neighbors of node \(i\).

Figure 5 illustrates the GraphSAGE computation graph for a node.

Figure 5 Figure 5: Computation graph for node 0. Courtesy of [1].

GraphSAGE uses a sampling strategy to make the computation more efficient. The computation graph for node 0 is given in Figure 6.

Figure 6 Figure 6: Computation graph with sampling for node 0. Courtesy of [1].

GraphSAGE is an example of inductive learning. Inductive learning is the ability to generalize to unseen nodes.

Let’s implement GraphSAGE using PyG.

‘Inductive learning on protein-protein interactions. This dataset is a collection of 24 graphs, where nodes (21,557) are human proteins and edges (342,353) are physical interactions between proteins in a human cell. The goal of the dataset is to perform multi-label classification with 121 labels.’[1]

Code
!pip install torch_geometric
from torch_geometric.datasets import PPI
train_dataset = PPI(root=".", split='train')
val_dataset = PPI(root=".", split='val')
test_dataset = PPI(root=".", split='test')
print(len(train_dataset), len(val_dataset), len(test_dataset))
train_dataset[19]
#Unify training graph in a single set to apply neighbor sampling
from torch_geometric.data import Batch
from torch_geometric.loader import NeighborLoader
train_data = Batch.from_data_list(train_dataset)
loader = NeighborLoader(train_data, batch_size=2048,
shuffle=True, num_neighbors=[20, 10], num_workers=2,
persistent_workers=True)
train_data[19]
from torch_geometric.loader import DataLoader
train_loader = DataLoader(train_dataset, batch_size=2)
val_loader = DataLoader(val_dataset, batch_size=2)
test_loader = DataLoader(test_dataset, batch_size=2)
device = torch.device('cuda' if torch.cuda.is_available()
else 'cpu')
from torch_geometric.nn import GraphSAGE
model = GraphSAGE(
    in_channels=train_dataset.num_features,
    hidden_channels=512,
    num_layers=2,
    out_channels=train_dataset.num_classes,
).to(device)
criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(),
lr=0.005)
def fit():
    model.train()
    total_loss = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        out = model(data.x, data.edge_index)
        loss = criterion(out, data.y)
        total_loss += loss.item() * data.num_graphs
        loss.backward()
        optimizer.step()
    return total_loss / len(train_loader.dataset)
from sklearn.metrics import f1_score
@torch.no_grad()
def test(loader):
    model.eval()
    data = next(iter(loader))
    out = model(data.x.to(device), data.edge_index.
to(device))
    preds = (out > 0).float().cpu()
    y, pred = data.y.numpy(), preds.numpy()
    return pred, f1_score(y, pred, average='micro') if pred.sum() > 0 else 0
for epoch in range(301):
    loss = fit()
    preds, val_f1 = test(val_loader)
    if epoch % 50 == 0:
        print(f'Epoch {epoch:>3} | Train Loss: {loss:.3f}| Val F1 score: {val_f1:.4f}')
preds.shape
preds, test_f1 = test(test_loader)
print(f'Test F1 score: {test_f1:.4f}')
zsh:1: command not found: pip
20 2 2
Epoch   0 | Train Loss: 0.590| Val F1 score: 0.4100
Epoch  50 | Train Loss: 0.190| Val F1 score: 0.8428
Epoch 100 | Train Loss: 0.142| Val F1 score: 0.8771
Epoch 150 | Train Loss: 0.122| Val F1 score: 0.8945
Epoch 200 | Train Loss: 0.107| Val F1 score: 0.9041
Epoch 250 | Train Loss: 0.097| Val F1 score: 0.9086
Epoch 300 | Train Loss: 0.090| Val F1 score: 0.9149
Test F1 score: 0.9351

Heterogenous Networks

Heterogenous networks are networks with different types of nodes and edges. For example, a social network can have users, posts, and comments as nodes. The edges can be friendship, like, and comment. Heterogenous networks are also called knowledge graphs. We can define a graph G as with mapping functions \(f_v\) and \(f_e\):

\[ G = (V, E, f_v, f_e) \]

where \(V\) is the set of nodes, \(E\) is the set of edges, \(f_v\) is the mapping function for nodes, and \(f_e\) is the mapping function for edges. Mapping functions can be used to define the type of nodes and edges. For example, \(f_v\) can be used to define the type of nodes as user, post, and comment. \(f_e\) can be used to define the type of edges as friendship, like, and comment.

Pytorch Geometric supports heterogenous networks. Let’s implement a heterogenous network using PyG. Sticking to the example in reference [1]. The figure below illustrates the heterogenous network.

Figure 8 Figure 8: Heterogenous network. Courtesy of [1].

Code
#!pip install torch_geometric
import torch
from torch_geometric.data import HeteroData
data = HeteroData()
data['user'].x = torch.Tensor([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]]) #num users, num features
data['game'].x = torch.Tensor([[1, 1], [2, 2]])
data['dev'].x = torch.Tensor([[1], [2]])
data['user', 'follows', 'user'].edge_index = torch.Tensor([[0, 1], [1, 2]]) # [2, num_edges_follows]
data['user', 'plays', 'game'].edge_index = torch.Tensor([[0, 1, 1, 2], [0, 0, 1, 1]])
data['dev', 'develops', 'game'].edge_index = torch.Tensor([[0, 1], [0, 1]])
data['user', 'plays', 'game'].edge_attr = torch.Tensor([[2], [0.5], [10], [12]])
print(data)

from torch import nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.datasets import DBLP
from torch_geometric.nn import GAT
from torch_geometric.nn import GATConv, Linear, to_hetero

#The DBLP computer science bibliography offers a dataset that contains four types of nodes – papers (14,328), terms (7,723), authors (4,057), and conferences (20). This dataset’s goal is to correctly classify the authors into four categories – database, data mining, artificial intelligence, and information retrieval. The authors’ node features are a bag-of-words (“0” or “1”) of 334 keywords they might have used in their publications.

dataset = DBLP(root='.')
data = dataset[0]
data['conference'].x = torch.zeros(20, 1)

class GAT(torch.nn.Module):
    def __init__(self, dim_h, dim_out):
        super().__init__()
        self.conv = GATConv((-1, -1), dim_h, add_self_loops=False)
        self.linear = nn.Linear(dim_h, dim_out)
    def forward(self, x, edge_index):
        h = self.conv(x, edge_index).relu()
        h = self.linear(h)
        return h
model = GAT(dim_h=64, dim_out=4)
model = to_hetero(model, data.metadata(), aggr='sum')
optimizer = torch.optim.Adam(model.parameters(),
lr=0.001, weight_decay=0.001)
device = torch.device('cuda' if torch.cuda.is_available()
else 'cpu')
data, model = data.to(device), model.to(device)
@torch.no_grad()
def test(mask):
    model.eval()
    pred = model(data.x_dict, data.edge_index_dict)['author'].argmax(dim=-1)
    acc = (pred[mask] == data['author'].y[mask]).sum() / mask.sum()
    return float(acc)
for epoch in range(101):
    model.train()
    optimizer.zero_grad()
    out = model(data.x_dict, data.edge_index_dict)['author']
    mask = data['author'].train_mask
    loss = F.cross_entropy(out[mask], data['author'].y[mask])
    loss.backward()
    optimizer.step()
    if epoch % 20 == 0:
        train_acc = test(data['author'].train_mask)
        val_acc = test(data['author'].val_mask)
        print(f'Epoch: {epoch:>3} | Train Loss: {loss:.4f} | Train Acc: {train_acc*100:.2f}% | Val Acc:{val_acc*100:.2f}%')
test_acc = test(data['author'].test_mask)
print(f'Test accuracy: {test_acc*100:.2f}%')
HeteroData(
  user={ x=[3, 4] },
  game={ x=[2, 2] },
  dev={ x=[2, 1] },
  (user, follows, user)={ edge_index=[2, 2] },
  (user, plays, game)={
    edge_index=[2, 4],
    edge_attr=[4, 1],
  },
  (dev, develops, game)={ edge_index=[2, 2] }
)
Epoch:   0 | Train Loss: 1.3907 | Train Acc: 18.50% | Val Acc:22.75%
Epoch:  20 | Train Loss: 1.2019 | Train Acc: 94.25% | Val Acc:67.00%
Epoch:  40 | Train Loss: 0.8634 | Train Acc: 95.50% | Val Acc:68.25%
Epoch:  60 | Train Loss: 0.5128 | Train Acc: 97.75% | Val Acc:72.00%
Epoch:  80 | Train Loss: 0.2734 | Train Acc: 99.50% | Val Acc:76.25%
Epoch: 100 | Train Loss: 0.1511 | Train Acc: 100.00% | Val Acc:75.75%
Test accuracy: 78.81%
Downloading https://www.dropbox.com/s/yh4grpeks87ugr2/DBLP_processed.zip?dl=1
Extracting ./raw/DBLP_processed.zip
Processing...
Done!

Lastly, it is good to read PyG documentation [5] to learn more about their data structures including the ones for heterogenous networks.

References

[1] Maxime Labonne, Hands-On Graph Neural Networks Using Python, 2023

[2] William L. Hamilton, Graph Representation Learning, 2020

[3] Node2Vec: Scalable Feature Learning for Networks, Grover, Leskovec, 2016

[4] A Gentle Introduction to Graph Neural Networks

[5] PyG Documentation