Introduction

Graph learning tasks include:

Node classification: predictiong the label of a node in a graph. Link prediction: predicting the existence of an edge between two nodes. Graph classification: predicting the label of a graph.

Figure 1 illustrates these tasks.

Figure 1: Graph learning tasks. Courtesy of [1].

Figure 2 shows a representation of a node based on its neighbors.

Figure 2: node 5’s representation is updated based on its neighbors. Courtesy of [1].

In node classification node features + information from edges + information from the graph are represented.

Node representation

DeepWalk architecture is used to learn node representations. It is a two-step approach. First, it generates random walks (similar to a sequence of words) from the graph. Then, it uses SkipGram to learn the representations of the nodes.

Word2Vec (2013) uses NN to translate words into vectors. skip-gram predicts context words given a target word.

Assume n words: $w_1, w_2, ..., w_N$. The goal is to maximize the probability of the context words given the target word. The probability is given by:

\[ P(w_{i+j}|w_i) = \frac{exp(v_{w_{i+j}}^T v_{w_i})}{\sum_{w=1}^N exp(v_w^T v_{w_i})} \]

where $v_w$ is the vector representation of word $w$. N is the vocabulary size.

\[\frac{1}{N}\sum_{n=1}^{N} \sum_{-c\leq j \leq c, j\ne 0} \log p(W_{n+j} | W_n)\]

Here is a nice video that explains Word2Vec.

Code

 import numpy as np
 CONTEXT_SIZE = 2
 text = """Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Nunc eu sem scelerisque, dictum eros
aliquam, accumsan quam. Pellentesque tempus, lorem ut
semper fermentum, ante turpis accumsan ex, sit amet
ultricies tortor erat quis nulla. Nunc consectetur ligula
sit amet purus porttitor, vel tempus tortor scelerisque.
Vestibulum ante ipsum primis in faucibus orci luctus
et ultrices posuere cubilia curae; Quisque suscipit
ligula nec faucibus accumsan. Duis vulputate massa sit
amet viverra hendrerit. Integer maximus quis sapien id
convallis. Donec elementum placerat ex laoreet gravida.
Praesent quis enim facilisis, bibendum est nec, pharetra
ex. Etiam pharetra congue justo, eget imperdiet diam
varius non. Mauris dolor lectus, interdum in laoreet
quis, faucibus vitae velit. Donec lacinia dui eget
maximus cursus. Class aptent taciti sociosqu ad litora
torquent per conubia nostra, per inceptos himenaeos.
Vivamus tincidunt velit eget nisi ornare convallis.
Pellentesque habitant morbi tristique senectus et netus
et malesuada fames ac turpis egestas. Donec tristique
ultrices tortor at accumsan.
""".split()

Code

skipgrams = []
for i in range(CONTEXT_SIZE, len(text) - CONTEXT_SIZE):
    array = [text[j] for j in np.arange(i - CONTEXT_SIZE,
i + CONTEXT_SIZE + 1) if j != i]
    skipgrams.append((text[i], array))
print(skipgrams[0:3])

[('dolor', ['Lorem', 'ipsum', 'sit', 'amet,']), ('sit', ['ipsum', 'dolor', 'amet,', 'consectetur']), ('amet,', ['dolor', 'sit', 'consectetur', 'adipiscing'])]

Code

vocab = set(text)
VOCAB_SIZE = len(vocab)
print(f"Length of vocabulary = {VOCAB_SIZE}")

Length of vocabulary = 121

Code

#!pip install -qU gensim
from gensim.models.word2vec import Word2Vec

model = Word2Vec([text],
                 sg=1,   # Skip-gram
                 vector_size=10,
                 min_count=0,
                 window=2,
                 workers=2,
                 seed=0)

print(f'Shape of W_embed: {model.wv.vectors.shape}')

Shape of W_embed: (121, 10)

Code

model.train([text], total_examples=model.corpus_count,
epochs=10)

WARNING:gensim.models.word2vec:Effective 'alpha' higher than previous training cycles

(690, 1560)

Code

print('Word embedding =')
print(model.wv[0])

Word embedding =
[ 0.07156403  0.03257632  0.00209916 -0.04374931 -0.03398107 -0.08656936
 -0.09047253 -0.09552431 -0.06482638  0.0660186 ]

Code

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import random
random.seed(0)

Code

G = nx.erdos_renyi_graph(10, 0.3, seed=1, directed=False)

Code

def random_walk(start, length):
    walk = [str(start)]  # starting node
    for i in range(length):
        neighbors = [node for node in G.neighbors(start)]
        next_node = np.random.choice(neighbors, 1)[0]
        walk.append(str(next_node))
        start = next_node
    return walk

Code

print(random_walk(0, 10))

['0', '1', '9', '1', '0', '4', '6', '7', '6', '5', '6']

Code

G = nx.karate_club_graph()

Code

labels = []
for node in G.nodes:
    label = G.nodes[node]['club']
    labels.append(1 if label == 'Officer' else 0)

Code

plt.figure(figsize=(12,12), dpi=300)
plt.axis('off')
nx.draw_networkx(G,
                 pos=nx.spring_layout(G, seed=0),
                 node_color=labels,
                 node_size=800,
                cmap='coolwarm',
                font_size=14,
                font_color='white'
                )

Code

walks = []
for node in G.nodes:
    for _ in range(80):
        walks.append(random_walk(node, 10))
print(walks[0])

['0', '10', '0', '17', '0', '2', '13', '0', '2', '9', '33']

Now we can train the Word2Vec model on the walks.

Code

model = Word2Vec(walks,
                 hs=1,   # Hierarchical Softmax
                 sg=1,   # Skip-gram
                 vector_size=10,
                 min_count=0,
                 window=10,
                 workers=2,
                 seed=0)
model.train(walks, total_examples=model.corpus_count,
epochs=30, report_delay=1)
print(model.wv[0])

WARNING:gensim.models.word2vec:Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. 
WARNING:gensim.models.word2vec:Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. 
WARNING:gensim.models.word2vec:Effective 'alpha' higher than previous training cycles

[ 0.32936254  0.06917843  0.08624159 -0.5531198  -0.33145267  0.50137293
 -0.64187    -0.2095948   0.23543341  0.64617795]

Let’s store the node embedding and labels to apply machine learning algorithms.

Code

nodes_wv = np.array([model.wv.get_vector(str(i)) for i in
range(len(model.wv))])
labels = np.array(labels)

Apply random forest classifier to the node embeddings.

Code

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(nodes_wv, labels, test_size=0.2, random_state=0)

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy = {accuracy_score(y_test, y_pred):.2f}')

Accuracy = 1.00

Code

# 5 fold cross validation
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
scores = cross_val_score(clf, nodes_wv, labels, cv=5)
print(f'Accuracy = {np.mean(scores):.2f}')

Accuracy = 0.94

Node embedding problem turn out to be foundational in graph machine learning tasks. We can use encoder and decoder perspective to learn node embeddings. Figures 3 and 4 illustrate the encoder and decoder perspective.

Figure 3: Encoder and decoder perspective. Courtesy of [2].

Here the goal is to learn an encoder that maps a node to a low-dimensional vector representation. The decoder is used to reconstruct the graph from the node embeddings.

\[ ENC: V \rightarrow R^d \]

\[ ENC(v_i) = z_i \]

\[ DEC: R^d \times R^d \rightarrow R^+ \]

\[ DEC(z_i, z_j) \approx S_{ij} \]

where $z_i$ and $z_j$ are the embeddings of nodes $v_i$ and $v_j$, respectively. $S_{ij}$ is the actual similarity between nodes $ i and j$.

We can use the following loss function to train the encoder and decoder.

\[ L = \sum_{(i,j) \in E} (S_{ij} - DEC(ENC(v_i), ENC(v_j)))^2 \]

Figure 4: Encoder and decoder perspective. Courtesy of [2].

These type of node embeddings are shallow since node features are not used. Some of the shallow embedding methods are summarized in the following table.

Table 1 Table 1: Summary of some of the shallow embedding algorithms. Courtesy of [2].

Node2Vec

Node2Vec (Grover, Leskovec, 2016) uses the main components of DeepWalk: random walk and Word2Vec. However, random walks are biased defining neighborhood based on BFS and DFS.

The transition probability from node $v_i$ to $v_j$ is given by:

\[ P(v_j | v_i) = \alpha_{pq}(i,j) \frac{w_{ij}}{Z} \]

where $w_{ij}$ is the weight of the edge between $v_i$ and $v_j$. $Z$ is the normalization constant. $\alpha_{pq}(i,j)$ is the search bias. It is defined as:

\[ \alpha_{pq}(i,j) = \begin{cases} \frac{1}{p} & \text{if } d_{ij} = 0 \\ 1 & \text{if } d_{ij} = 1 \\ \frac{1}{q} & \text{if } d_{ij} = 2 \end{cases} \]

where $d_{ij}$ is the shortest path between $v_i$ and $v_j$. $p$ and $q$ are hyperparameters. $p$ controls the likelihood of immediately revisiting a node in the walk. $q$ controls the likelihood of exploring outward from the previous node. Greater q values result in exploring nodes close to the start no thus approximating BFS. Smaller q values result in exploring nodes further away from the start node thus approximating DFS.

Let’s implement Node2Vec using the python library.

Code

!pip3 install -qU node2vec
from node2vec import Node2Vec

Code

G = nx.karate_club_graph()

Code

node2vec = Node2Vec(G, dimensions=10, walk_length=10, num_walks=80, p=1, q=0.5, workers=2)

Generating walks (CPU: 1):   0%|          | 0/40 [00:00<?, ?it/s]Generating walks (CPU: 2):   0%|          | 0/40 [00:00<?, ?it/s]Generating walks (CPU: 2): 100%|██████████| 40/40 [00:00<00:00, 1766.00it/s]
Generating walks (CPU: 1): 100%|██████████| 40/40 [00:00<00:00, 1754.08it/s]

Code

model = node2vec.fit(window=10, min_count=1, batch_words=4)

Code

nodes_wv = np.array([model.wv.get_vector(str(i)) for i in
range(len(model.wv))])
labels = np.array(labels)

Code

X_train, X_test, y_train, y_test = train_test_split(nodes_wv, labels, test_size=0.3, random_state=0)

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy = {accuracy_score(y_test, y_pred):.2f}')

Accuracy = 1.00

Graph Neural Networks

A GNN is an optimizable transformation on all attributes of the graph (nodes, edges, global-context) that preserves graph symmetries (permutation invariances)[4].

In a basic neural network, input node A with features $x_A$ is transformed with a weight matrix $W$. The output is given by:

\[ h_A = \sigma(Wx_A) \]

where $\sigma$ is the activation function.

In a GNN, the output of a node is a function of its neighbors. The output of node A is given by:

\[ h_A = \sigma(\sum_{B \in N(A)} Wx_B) \]

where $N(A)$ is the set of neighbors of node A.

In matrix representation, a graph linear layer is given by:

\[ H = \hat A^T X W^T \]

where $\hat A$ is the adjacency matrix with self-loops. $X$ is the feature matrix. $W$ is the weight matrix. Here is a sample code to implement a graph linear layer.

Code

import torch
!pip install torch_geometric
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root=".", name="Cora")
from torch.nn import Linear
import torch.nn.functional as F
data = dataset[0]

def accuracy(y_pred, y_true):
    return torch.sum(y_pred == y_true) / len(y_true)

class VanillaGNNLayer(torch.nn.Module):
  def __init__(self, dim_in, dim_out):
    super().__init__()
    self.linear = Linear(dim_in, dim_out, bias=False)
  def forward(self,x,adjacency):
    x = self.linear(x)
    x = torch.sparse.mm(adjacency, x) #multiplication with A
    return x

from torch_geometric.utils import to_dense_adj
adjacency = to_dense_adj(data.edge_index)[0] #edge index is coonverted to adj
adjacency += torch.eye(len(adjacency)) #self loop added

class VanillaGNN(torch.nn.Module):
    def __init__(self, dim_in, dim_h, dim_out):
        super().__init__()
        self.gnn1 = VanillaGNNLayer(dim_in, dim_h)
        self.gnn2 = VanillaGNNLayer(dim_h, dim_out)
    def forward(self, x, adjacency):
        h = self.gnn1(x, adjacency)
        h = torch.relu(h)
        h = self.gnn2(h, adjacency)
        return F.log_softmax(h, dim=1)
    def fit(self, data, epochs):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.parameters(),
        lr=0.01, weight_decay=5e-4)
        self.train()
        for epoch in range(epochs+1):
            optimizer.zero_grad()
            out = self(data.x, adjacency)
            loss = criterion(out[data.train_mask],
data.y[data.train_mask])
            acc = accuracy(out[data.train_mask].
argmax(dim=1), data.y[data.train_mask])
            loss.backward()
            optimizer.step()
            if epoch % 20 == 0:
                val_loss = criterion(out[data.val_mask],
data.y[data.val_mask])
                val_acc = accuracy(out[data.val_mask].
argmax(dim=1), data.y[data.val_mask])
                print(f'Epoch {epoch:>3} | Train Loss:{loss:.3f} | Train Acc: {acc*100:>5.2f}% | Val Loss:{val_loss:.2f} | Val Acc: {val_acc*100:.2f}%')
    def test(self, data):
        self.eval()
        out = self(data.x, adjacency)
        acc = accuracy(out.argmax(dim=1)[data.test_mask],
        data.y[data.test_mask])
        return acc

gnn = VanillaGNN(dataset.num_features, 16, dataset.num_classes)
print(gnn)
gnn.fit(data, epochs=100)
acc = gnn.test(data)
print(f'\nGNN test accuracy: {acc*100:.2f}%')

zsh:1: command not found: pip
VanillaGNN(
  (gnn1): VanillaGNNLayer(
    (linear): Linear(in_features=1433, out_features=16, bias=False)
  )
  (gnn2): VanillaGNNLayer(
    (linear): Linear(in_features=16, out_features=7, bias=False)
  )
)
Epoch   0 | Train Loss:2.187 | Train Acc: 13.57% | Val Loss:2.24 | Val Acc: 9.40%
Epoch  20 | Train Loss:0.058 | Train Acc: 100.00% | Val Loss:1.48 | Val Acc: 74.40%
Epoch  40 | Train Loss:0.012 | Train Acc: 100.00% | Val Loss:1.96 | Val Acc: 74.40%
Epoch  60 | Train Loss:0.005 | Train Acc: 100.00% | Val Loss:2.05 | Val Acc: 74.40%
Epoch  80 | Train Loss:0.004 | Train Acc: 100.00% | Val Loss:2.06 | Val Acc: 74.40%
Epoch 100 | Train Loss:0.003 | Train Acc: 100.00% | Val Loss:2.04 | Val Acc: 74.80%

GNN test accuracy: 75.90%

Graph Convolutional Networks

Graph convolutional networks (GCNs) are a type of GNN. GNN did not take into account the difference in neighborhoods. GCNs apply a normalization that considers node degrees. GCNs are based on the convolution operation. A convolution layer with $\hat D$ and $\hat A$ is given by:

\[ H^{(l+1)} = \sigma(\hat D^{-1/2} \hat A \hat D^{-1/2} H^{(l)} W^{(l)}) \]

where $H^{(l)}$ is the hidden state matrix at layer $l$. $W^{(l)}$ is the weight matrix at layer $l$. $\hat A$ is the adjacency matrix with self-loops. $\hat D$ is the degree matrix with self-loops. $\sigma$ is the activation function.

Here is an example code to implement GCN.

Code

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

def accuracy(pred_y, y):
    return ((pred_y == y).sum() / len(y)).item()

class GCN(torch.nn.Module):
    """Graph Convolutional Network"""
    def __init__(self, dim_in, dim_h, dim_out):
        super().__init__()
        self.gcn1 = GCNConv(dim_in, dim_h)
        self.gcn2 = GCNConv(dim_h, dim_out)
    def forward(self, x, edge_index):
        h = self.gcn1(x, edge_index)
        h = torch.relu(h)
        h = self.gcn2(h, edge_index)
        return F.log_softmax(h, dim=1)
    def fit(self, data, epochs):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.parameters(),
                                    lr=0.01,
                                    weight_decay=5e-4)
        self.train()
        for epoch in range(epochs+1):
            optimizer.zero_grad()
            out = self(data.x, data.edge_index)
            loss = criterion(out[data.train_mask],
            data.y[data.train_mask])
            acc = accuracy(out[data.train_mask].
            argmax(dim=1), data.y[data.train_mask])
            loss.backward()
            optimizer.step()
            if(epoch % 20 == 0):
                val_loss = criterion(out[data.val_mask],data.y[data.val_mask])
                val_acc = accuracy(out[data.val_mask].argmax(dim=1), data.y[data.val_mask])
                print(f'Epoch {epoch:>3} | Train Loss:{loss:.3f} | Train Acc: {acc*100:>5.2f}% | Val Loss:{val_loss:.2f} | Val Acc: {val_acc*100:.2f}%')
    @torch.no_grad()
    def test(self, data):
        self.eval()
        out = self(data.x, data.edge_index)
        predictions = out.argmax(dim=1)
        acc = accuracy(out.argmax(dim=1)[data.test_mask],
        data.y[data.test_mask])
        return acc, predictions

gcn = GCN(dataset.num_features, 16, dataset.num_classes)
print(gcn)
gcn.fit(data, epochs=100)

acc, pred = gcn.test(data)
print(f'GCN test accuracy: {acc*100:.2f}%')
print(f'GCN predictions: {pred}')

GCN(
  (gcn1): GCNConv(1433, 16)
  (gcn2): GCNConv(16, 7)
)
Epoch   0 | Train Loss:1.952 | Train Acc:  7.86% | Val Loss:1.95 | Val Acc: 8.40%
Epoch  20 | Train Loss:0.098 | Train Acc: 100.00% | Val Loss:0.79 | Val Acc: 77.00%
Epoch  40 | Train Loss:0.013 | Train Acc: 100.00% | Val Loss:0.75 | Val Acc: 77.80%
Epoch  60 | Train Loss:0.013 | Train Acc: 100.00% | Val Loss:0.73 | Val Acc: 77.80%
Epoch  80 | Train Loss:0.016 | Train Acc: 100.00% | Val Loss:0.72 | Val Acc: 77.80%
Epoch 100 | Train Loss:0.015 | Train Acc: 100.00% | Val Loss:0.73 | Val Acc: 78.00%
GCN test accuracy: 80.00%
GCN predictions: tensor([3, 4, 4,  ..., 1, 3, 3])

Graph Attention Networks

Graph attention networks (GATs) are a type of GNN. GATs use attention mechanism to learn the importance of neighbors. The attention mechanism is given by:

\[ \alpha_{ij} = \frac{exp(LeakyReLU(\vec a^T [Wx_i || Wx_j]))}{\sum_{k \in N(i)} exp(LeakyReLU(\vec a^T [Wx_i || Wx_k]))} \]

where $x_i$ and $x_j$ are the features of nodes $i$ and $j$, respectively. $W$ is the weight matrix. $||$ is the concatenation operator. $\vec a$ is the attention vector. $N(i)$ is the set of neighbors of node $i$.

The output of node $i$ is given by:

\[ h_i = \sigma(\sum_{j \in N(i)} \alpha_{ij} Wx_j) \]

where $\sigma$ is the activation function. Multiple attention heads are used to learn multiple representations of nodes. The output of node $i$ is given by:

\[ h_i = \sigma(\sum_{k=1}^K \sum_{j \in N(i)} \alpha_{ij}^k W^kx_j) \]

where $K$ is the number of attention heads. Let’s implement GAT in NumPy.

Code

import numpy as np

np.random.seed(0)
A = np.array([
    [1, 1, 1, 1],
    [1, 1, 0, 0],
    [1, 0, 1, 1],
    [1, 0, 1, 1]])
X = np.random.uniform(-1, 1, (4, 4))
W = np.random.uniform(-1, 1, (2, 4)) #number of hidden dim, nb of nodes
W_att = np.random.uniform(-1, 1, (1, 4)) #1 by dim_h*2
connections = np.where(A > 0)
np.concatenate([(X @ W.T)[connections[0]], (X @ W.T) #concatenate hidden vectors from source to destination
[connections[1]]], axis=1)
a = W_att @ np.concatenate([(X @ W.T)[connections[0]], (X
@ W.T)[connections[1]]], axis=1).T
def leaky_relu(x, alpha=0.2):
    return np.maximum(alpha*x, x)
e = leaky_relu(a)
E = np.zeros(A.shape)
E[connections[0], connections[1]] = e[0]
def softmax2D(x, axis):
    e = np.exp(x - np.expand_dims(np.max(x, axis=axis),
axis))
    sum = np.expand_dims(np.sum(e, axis=axis), axis)
    return e / sum
W_alpha = softmax2D(E, 1)
H = A.T @ W_alpha @ X @ W.T
H

array([[-1.10126376,  1.99749693],
       [-0.33950544,  0.97045933],
       [-1.03570438,  1.53614075],
       [-1.03570438,  1.53614075]])

Let’s implement GAT using PyG.

Code

from torch_geometric.datasets import Planetoid
dataset = Planetoid(root=".", name="Cora")
data = dataset[0]
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATv2Conv
from torch.nn import Linear, Dropout

def accuracy(y_pred, y_true):
    return torch.sum(y_pred == y_true) / len(y_true)
class GAT(torch.nn.Module):
    def __init__(self, dim_in, dim_h, dim_out, heads=8):
        super().__init__()
        self.gat1 = GATv2Conv(dim_in, dim_h, heads=heads)
        self.gat2 = GATv2Conv(dim_h*heads, dim_out,
heads=1)
    def forward(self, x, edge_index):
        h = F.dropout(x, p=0.6, training=self.training)
        h = self.gat1(h, edge_index)
        h = F.elu(h)
        h = F.dropout(h, p=0.6, training=self.training)
        h = self.gat2(h, edge_index)
        return F.log_softmax(h, dim=1)
    def fit(self, data, epochs):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.parameters(),
lr=0.01, weight_decay=0.01)
        self.train()
        for epoch in range(epochs+1):
            optimizer.zero_grad()
            out = self(data.x, data.edge_index)
            loss = criterion(out[data.train_mask],
data.y[data.train_mask])
            acc = accuracy(out[data.train_mask].argmax(dim=1), data.y[data.train_mask])
            loss.backward()
            optimizer.step()
            if(epoch % 20 == 0):
                val_loss = criterion(out[data.val_mask],
data.y[data.val_mask])
                val_acc = accuracy(out[data.val_mask].
argmax(dim=1), data.y[data.val_mask])
                print(f'Epoch {epoch:>3} | Train Loss:{loss:.3f} | Train Acc: {acc*100:>5.2f}% | Val Loss:{val_loss:.2f} | Val Acc: {val_acc*100:.2f}%')
    @torch.no_grad()
    def test(self, data):
        self.eval()
        out = self(data.x, data.edge_index)
        acc = accuracy(out.argmax(dim=1)[data.test_mask],
    data.y[data.test_mask])
        return acc
gat = GAT(dataset.num_features, 32, dataset.num_classes)
gat.fit(data, epochs=100)

acc = gat.test(data)
print(f'GAT test accuracy: {acc*100:.2f}%')

Epoch   0 | Train Loss:1.965 | Train Acc:  9.29% | Val Loss:1.94 | Val Acc: 15.20%
Epoch  20 | Train Loss:0.211 | Train Acc: 99.29% | Val Loss:0.94 | Val Acc: 72.20%
Epoch  40 | Train Loss:0.175 | Train Acc: 97.86% | Val Loss:0.88 | Val Acc: 73.00%
Epoch  60 | Train Loss:0.160 | Train Acc: 98.57% | Val Loss:0.90 | Val Acc: 73.60%
Epoch  80 | Train Loss:0.133 | Train Acc: 100.00% | Val Loss:0.93 | Val Acc: 71.00%
Epoch 100 | Train Loss:0.128 | Train Acc: 97.86% | Val Loss:0.92 | Val Acc: 74.40%
GAT test accuracy: 82.40%

GraphSAGE

GraphSAGE (Hamilton et al., 2017) is a type of GNN. GraphSAGE uses a neighborhood aggregation function to aggregate the features of a node’s neighbors. The neighborhood aggregation function is given by:

\[ h_i^{(l)} = \sigma(W^l \cdot AGG(\{h_j^{(l-1)}, \forall j \in N(i)\})) \]

where $h_i^{(l)}$ is the hidden state of node $i$ at layer $l$. $W^l$ is the weight matrix at layer $l$. $\sigma$ is the activation function. $AGG$ is the aggregation function. $N(i)$ is the set of neighbors of node $i$.

Figure 5 illustrates the GraphSAGE computation graph for a node.

Figure 5: Computation graph for node 0. Courtesy of [1].

GraphSAGE uses a sampling strategy to make the computation more efficient. The computation graph for node 0 is given in Figure 6.

Figure 6: Computation graph with sampling for node 0. Courtesy of [1].

GraphSAGE is an example of inductive learning. Inductive learning is the ability to generalize to unseen nodes.

Let’s implement GraphSAGE using PyG.

‘Inductive learning on protein-protein interactions. This dataset is a collection of 24 graphs, where nodes (21,557) are human proteins and edges (342,353) are physical interactions between proteins in a human cell. The goal of the dataset is to perform multi-label classification with 121 labels.’[1]

Code

!pip install torch_geometric
from torch_geometric.datasets import PPI
train_dataset = PPI(root=".", split='train')
val_dataset = PPI(root=".", split='val')
test_dataset = PPI(root=".", split='test')
print(len(train_dataset), len(val_dataset), len(test_dataset))
train_dataset[19]
#Unify training graph in a single set to apply neighbor sampling
from torch_geometric.data import Batch
from torch_geometric.loader import NeighborLoader
train_data = Batch.from_data_list(train_dataset)
loader = NeighborLoader(train_data, batch_size=2048,
shuffle=True, num_neighbors=[20, 10], num_workers=2,
persistent_workers=True)
train_data[19]
from torch_geometric.loader import DataLoader
train_loader = DataLoader(train_dataset, batch_size=2)
val_loader = DataLoader(val_dataset, batch_size=2)
test_loader = DataLoader(test_dataset, batch_size=2)
device = torch.device('cuda' if torch.cuda.is_available()
else 'cpu')
from torch_geometric.nn import GraphSAGE
model = GraphSAGE(
    in_channels=train_dataset.num_features,
    hidden_channels=512,
    num_layers=2,
    out_channels=train_dataset.num_classes,
).to(device)
criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(),
lr=0.005)
def fit():
    model.train()
    total_loss = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        out = model(data.x, data.edge_index)
        loss = criterion(out, data.y)
        total_loss += loss.item() * data.num_graphs
        loss.backward()
        optimizer.step()
    return total_loss / len(train_loader.dataset)
from sklearn.metrics import f1_score
@torch.no_grad()
def test(loader):
    model.eval()
    data = next(iter(loader))
    out = model(data.x.to(device), data.edge_index.
to(device))
    preds = (out > 0).float().cpu()
    y, pred = data.y.numpy(), preds.numpy()
    return pred, f1_score(y, pred, average='micro') if pred.sum() > 0 else 0
for epoch in range(301):
    loss = fit()
    preds, val_f1 = test(val_loader)
    if epoch % 50 == 0:
        print(f'Epoch {epoch:>3} | Train Loss: {loss:.3f}| Val F1 score: {val_f1:.4f}')
preds.shape
preds, test_f1 = test(test_loader)
print(f'Test F1 score: {test_f1:.4f}')

zsh:1: command not found: pip
20 2 2
Epoch   0 | Train Loss: 0.590| Val F1 score: 0.4100
Epoch  50 | Train Loss: 0.190| Val F1 score: 0.8428
Epoch 100 | Train Loss: 0.142| Val F1 score: 0.8771
Epoch 150 | Train Loss: 0.122| Val F1 score: 0.8945
Epoch 200 | Train Loss: 0.107| Val F1 score: 0.9041
Epoch 250 | Train Loss: 0.097| Val F1 score: 0.9086
Epoch 300 | Train Loss: 0.090| Val F1 score: 0.9149
Test F1 score: 0.9351

Link Prediction

Link prediction is the task of predicting the existence of an edge between two nodes. Link prediction is used to recommend friends in social networks, recommend products in e-commerce, and recommend movies in streaming services.

We can distinguish between traditional link prediction and graph neural network-based link prediction.

Traditional link prediction methods are based on similarity measures. Common similarity measures are:

Jaccard similarity: $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$ Adamic-Adar similarity: $AA(A,B) = \sum_{u \in A \cap B} \frac{1}{\log(|N(u)|)}$ Katz similarity: $Katz(A,B) = \sum_{l=1}^{\infty} \beta^l |walks(A,B,l)|$

where $A$ and $B$ are the sets of neighbors of nodes $i$ and $j$, respectively. $N(u)$ is the set of neighbors of node $u$. $walks(A,B,l)$ is the set of walks of length $l$ between nodes $A$ and $B$. $\beta$ is a hyperparameter.

Katz similarity calculation can be approximated using the following algebraic expression:

(I - $\beta$A)$^{-1}$ - I

where $I$ is the identity matrix. $A$ is the adjacency matrix. Let’s implement these similarity measures.

Code

import networkx as nx

# Create a simple graph
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3), (1, 4), (2, 3), (3, 4), (4, 5), (4, 6)])

# Calculate Jaccard coefficient for potential new edges
potential_edges = []
for u in G.nodes():
    for v in G.nodes():
        if u != v and not G.has_edge(u, v):
            jaccard_coeff = len(list(nx.common_neighbors(G, u, v))) / len(set(G[u]) | set(G[v]))
            potential_edges.append((u, v, jaccard_coeff))

# Sort potential edges by Jaccard coefficient
potential_edges = sorted(potential_edges, key=lambda x: x[2], reverse=True)

# Predict links based on Jaccard's coefficient
for edge in potential_edges[:5]:  # Predict top 5 edges
    print(f"Predicted edge between nodes {edge[0]} and {edge[1]} with Jaccard coefficient {edge[2]:.4f}")

# networkx implementation
potential_edges = []
for u, v in nx.non_edges(G):
    jaccard_coeff = len(list(nx.common_neighbors(G, u, v))) / len(set(G[u]) | set(G[v]))
    potential_edges.append((u, v, jaccard_coeff))
print(potential_edges)

Predicted edge between nodes 5 and 6 with Jaccard coefficient 1.0000
Predicted edge between nodes 6 and 5 with Jaccard coefficient 1.0000
Predicted edge between nodes 2 and 4 with Jaccard coefficient 0.5000
Predicted edge between nodes 4 and 2 with Jaccard coefficient 0.5000
Predicted edge between nodes 1 and 5 with Jaccard coefficient 0.3333
[(1, 5, 0.3333333333333333), (1, 6, 0.3333333333333333), (2, 4, 0.5), (2, 5, 0.0), (2, 6, 0.0), (3, 5, 0.3333333333333333), (3, 6, 0.3333333333333333), (5, 6, 1.0)]

Code

import networkx as nx
import math

# Create a simple graph
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3), (1, 4), (2, 3), (3, 4), (4, 5), (4, 6)])

# Calculate Adamic-Adar index for potential new edges
potential_edges = []
for u in G.nodes():
    for v in G.nodes():
        if u != v and not G.has_edge(u, v):
            adamic_adar = sum(1 / (math.log(G.degree(w))) for w in nx.common_neighbors(G, u, v))
            potential_edges.append((u, v, adamic_adar))

# Sort potential edges by Adamic-Adar index
potential_edges = sorted(potential_edges, key=lambda x: x[2], reverse=True)

# Predict links based on Adamic-Adar index
for edge in potential_edges[:5]:  # Predict top 5 edges
    print(f"Predicted edge between nodes {edge[0]} and {edge[1]} with Adamic-Adar index {edge[2]:.4f}")

Predicted edge between nodes 2 and 4 with Adamic-Adar index 1.8205
Predicted edge between nodes 4 and 2 with Adamic-Adar index 1.8205
Predicted edge between nodes 1 and 5 with Adamic-Adar index 0.7213
Predicted edge between nodes 1 and 6 with Adamic-Adar index 0.7213
Predicted edge between nodes 3 and 5 with Adamic-Adar index 0.7213

Code

import networkx as nx
import numpy as np

# Create a simple graph
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3), (1, 4), (2, 3), (3, 4), (4, 5), (4, 6)])

# Get the mapping of nodes to indices
node_to_index = {node: index for index, node in enumerate(G.nodes())}

# Calculate Katz index for potential new edges
beta = 0.1  # Adjust the beta value for the Katz index
adjacency_matrix = nx.to_numpy_array(G, nodelist=node_to_index.keys())

I = np.identity(len(G.nodes()))
katz_matrix = np.linalg.inv(I - beta * adjacency_matrix) - I

# Calculate Katz index for potential new edges
potential_edges = []
for u in G.nodes():
    for v in G.nodes():
        if u != v and not G.has_edge(u, v):
            u_idx, v_idx = node_to_index[u], node_to_index[v]
            katz_index = katz_matrix[u_idx, v_idx]
            potential_edges.append((u, v, katz_index))

# Sort potential edges by Katz index
potential_edges = sorted(potential_edges, key=lambda x: x[2], reverse=True)

# Predict links based on Katz index
for edge in potential_edges[:5]:  # Predict top 5 edges
    print(f"Predicted edge between nodes {edge[0]} and {edge[1]} with Katz index {edge[2]:.4f}")

# playing with adjacency matrices to observe exponent relationship with number of walks of length l

import numpy as np

# Define the adjacency matrix for a simple graph with 6 nodes
# Here, the matrix represents connections between nodes (1s for connections, 0s for no connections)
A = np.array([
    [0, 1, 1, 0, 0, 0],
    [1, 0, 1, 1, 0, 0],
    [1, 1, 0, 1, 1, 0],
    [0, 1, 1, 0, 1, 1],
    [0, 0, 1, 1, 0, 1],
    [0, 0, 0, 1, 1, 0]
])

# Calculate A^2 and A^3
A_squared = np.dot(A, A)
A_cubed = np.dot(A_squared, A)

# Display the matrices
print("Adjacency Matrix A:")
print(A)

print("\nA^2 (A squared):")
print(A_squared)

print("\nA^3 (A cubed):")
print(A_cubed)

# Convert the adjacency matrix to a NetworkX graph
G = nx.from_numpy_array(np.array(A))

# Draw the graph
nx.draw(G, with_labels=True)

Predicted edge between nodes 2 and 4 with Katz index 0.0237
Predicted edge between nodes 4 and 2 with Katz index 0.0237
Predicted edge between nodes 1 and 5 with Katz index 0.0119
Predicted edge between nodes 1 and 6 with Katz index 0.0119
Predicted edge between nodes 6 and 1 with Katz index 0.0119
Adjacency Matrix A:
[[0 1 1 0 0 0]
 [1 0 1 1 0 0]
 [1 1 0 1 1 0]
 [0 1 1 0 1 1]
 [0 0 1 1 0 1]
 [0 0 0 1 1 0]]

A^2 (A squared):
[[2 1 1 2 1 0]
 [1 3 2 1 2 1]
 [1 2 4 2 1 2]
 [2 1 2 4 2 1]
 [1 2 1 2 3 1]
 [0 1 2 1 1 2]]

A^3 (A cubed):
[[2 5 6 3 3 3]
 [5 4 7 8 4 3]
 [6 7 6 9 8 3]
 [3 8 9 6 7 6]
 [3 4 8 7 4 5]
 [3 3 3 6 5 2]]

These methods do not consider node features and structural similarity. They do not have inductive learning capability. Therefore, GNN-based link prediction methods are proposed.

Variational autoencoder for graphs (VGAE)

VGAEs learn normal distributions that are then sampled to produce embeddings. Encoder has two GCNs that share the first layer to 1. learn mean of latent normal distribution 2. learn variance of normal distribution Decoder samples embeddings z from the learned distributions. Inner product is used to approximate adjacency matrix.

Here is a sample code:

Code

!pip install torch_geometric
import numpy as np
np.random.seed(0)
import torch
torch.manual_seed(0)
import matplotlib.pyplot as plt
import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid

device = torch.device('cuda' if torch.cuda.is_available()
else 'cpu')
transform = T.Compose([
    T.NormalizeFeatures(),
    T.ToDevice(device),
    T.RandomLinkSplit(num_val=0.05, num_test=0.1, is_undirected=True,
                      split_labels=True, add_negative_train_samples=False),
])

dataset = Planetoid('.', name='Cora',
transform=transform)

train_data, val_data, test_data = dataset[0]
train_data

from torch_geometric.nn import GCNConv, VGAE

class Encoder(torch.nn.Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.conv1 = GCNConv(dim_in, 2 * dim_out)
        self.conv_mu = GCNConv(2 * dim_out, dim_out)
        self.conv_logstd = GCNConv(2 * dim_out, dim_out)
    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv_mu(x, edge_index), self.conv_logstd(x, edge_index)

model = VGAE(Encoder(dataset.num_features, 16)).to(device) #default decoder is inner product
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train():
    model.train()
    optimizer.zero_grad()
    z = model.encode(train_data.x, train_data.edge_index)#embedding matrix z computed
    loss = model.recon_loss(z, train_data.pos_edge_label_index) + (1 / train_data.num_nodes) * model.kl_loss()
    loss.backward()
    optimizer.step()
    return float(loss)

@torch.no_grad()
def test(data):
    model.eval()
    z = model.encode(data.x, data.edge_index)
    return model.test(z, data.pos_edge_label_index, data.neg_edge_label_index)

for epoch in range(301):
    loss = train()
    val_auc, val_ap = test(val_data)
    if epoch % 50 == 0:
        print(f'Epoch {epoch:>2} | Loss: {loss:.4f} | Val AUC: {val_auc:.4f} | Val AP: {val_ap:.4f}')

test_auc, test_ap = test(test_data)
print(f'Test AUC: {test_auc:.4f} | Test AP {test_ap:.4f}')

z = model.encode(test_data.x, test_data.edge_index)
Ahat = torch.sigmoid(z @ z.T)
Ahat.shape

zsh:1: command not found: pip
Epoch  0 | Loss: 3.5207 | Val AUC: 0.6133 | Val AP: 0.6447
Epoch 50 | Loss: 1.3293 | Val AUC: 0.6947 | Val AP: 0.7049
Epoch 100 | Loss: 1.1834 | Val AUC: 0.7303 | Val AP: 0.7335
Epoch 150 | Loss: 1.0535 | Val AUC: 0.7918 | Val AP: 0.7905
Epoch 200 | Loss: 1.0234 | Val AUC: 0.8266 | Val AP: 0.8192
Epoch 250 | Loss: 1.0053 | Val AUC: 0.8284 | Val AP: 0.8230
Epoch 300 | Loss: 0.9894 | Val AUC: 0.8306 | Val AP: 0.8257
Test AUC: 0.8359 | Test AP 0.8356

torch.Size([2708, 2708])

Subgraph based algorithms - SEAL

Enclosing subgraphs are extracted based on target nodes and their k-hop neighbors. Enclosing subgraph extraction, which consists of taking a set of real links and a set of fake links (negative sampling) to form the training data.Node information matrix construction, which involves three components – node labels, node embeddings, and node features.GNN training, which takes the node information matrices as input and outputs link likelihoods. Figure 7 illustrates the procedure.

Figure 7: SEAL steps. Courtesy of [1].

SEAL considers nodes labels determined based on the distances from target nodes. That is Double-radius node labeling (d(i,x), d(i,y)). The label for a neighbor node $i$ is given by:

\[f(i)=1+min(d(i,x),d(i,y))+(d/2)[(d/2)+(d\%2)-1]\] where (d/2) and (d%2) are the quotient and remainder of d/2, respectively.

Optional node embeddings can be calculated using another algorithm, such as Node2Vec. Then, they are concatenated with the node features and one-hot encoded labels to build the final node information matrix.

The original SEAL algorithm is based on GCN - named deep graph convolutional neural network. Several GCN layers compute node embeddings that are then concatenated. A global sort pooling layer sorts these embeddings in a consistent order before feeding them into convolutional layers. Traditional convolutional and dense layers are applied to the sorted graph representations and output a link probability.

Here is a sample code:

Code

!pip install torch_geometric
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score
from scipy.sparse.csgraph import shortest_path
import torch
import torch.nn.functional as F
from torch.nn import Conv1d, MaxPool1d, Linear, Dropout, BCEWithLogitsLoss
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import RandomLinkSplit
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn import GCNConv, aggr
from torch_geometric.utils import k_hop_subgraph, to_scipy_sparse_matrix

transform = RandomLinkSplit(num_val=0.05, num_test=0.1,
is_undirected=True, split_labels=True)
dataset = Planetoid('.', name='Cora',
transform=transform)
train_data, val_data, test_data = dataset[0]

def seal_processing(dataset, edge_label_index, y):
  data_list = []
  for src, dst in edge_label_index.t().tolist(): #khop neighbors for source and dest
    sub_nodes, sub_edge_index, mapping, _ = \
    k_hop_subgraph([src, dst], 2, dataset.edge_index, relabel_nodes=True)
    src, dst = mapping.tolist()
    #distance calculation removing target nodes
    mask1 = (sub_edge_index[0] != src) | (sub_edge_index[1] != dst)
    mask2 = (sub_edge_index[0] != dst) | (sub_edge_index[1] != src)
    sub_edge_index = sub_edge_index[:, mask1 & mask2]
    src, dst = (dst, src) if src > dst else (src,dst)
    adj = to_scipy_sparse_matrix(sub_edge_index, num_nodes=sub_nodes.size(0)).tocsr()
    idx = list(range(src)) + list(range(src + 1, adj.shape[0]))
    adj_wo_src = adj[idx, :][:, idx]
    idx = list(range(dst)) + list(range(dst + 1, adj.shape[0]))
    adj_wo_dst = adj[idx, :][:, idx]
    d_src = shortest_path(adj_wo_dst, directed=False,unweighted=True, indices=src)
    d_src = np.insert(d_src, dst, 0, axis=0)
    d_src = torch.from_numpy(d_src)
    d_dst = shortest_path(adj_wo_src, directed=False,unweighted=True, indices=dst-1)
    d_dst = np.insert(d_dst, src, 0, axis=0)
    d_dst = torch.from_numpy(d_dst)
    #node labels
    dist = d_src + d_dst
    z = 1 + torch.min(d_src, d_dst) + dist // 2 * (dist // 2 + dist % 2 - 1)
    z[src], z[dst], z[torch.isnan(z)] = 1., 1., 0.
    z = z.to(torch.long)
    #node information matrix
    node_labels = F.one_hot(z, num_classes=200).to(torch.float)
    node_emb = dataset.x[sub_nodes]
    node_x = torch.cat([node_emb, node_labels],dim=1)
    data = Data(x=node_x, z=z, edge_index=sub_edge_index, y=y)
    data_list.append(data)
  return data_list

#extract subgraphs separating + - examples
train_pos_data_list = seal_processing(train_data, train_data.pos_edge_label_index, 1)
train_neg_data_list = seal_processing(train_data, train_data.neg_edge_label_index, 0)
val_pos_data_list = seal_processing(val_data, val_data.pos_edge_label_index, 1)
val_neg_data_list = seal_processing(val_data, val_data.neg_edge_label_index, 0)
test_pos_data_list = seal_processing(test_data, test_data.pos_edge_label_index, 1)
test_neg_data_list = seal_processing(test_data, test_data.neg_edge_label_index, 0)

#reconstruct tr val test
train_dataset = train_pos_data_list + train_neg_data_list
val_dataset = val_pos_data_list + val_neg_data_list
test_dataset = test_pos_data_list + test_neg_data_list

#loaders using batches
train_loader = DataLoader(train_dataset, batch_size=32,\
shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

class DGCNN(torch.nn.Module):
    def __init__(self, dim_in, k=30): #k number of nodes to hold for each subgraph
        super().__init__()
        self.gcn1 = GCNConv(dim_in, 32)
        self.gcn2 = GCNConv(32, 32)
        self.gcn3 = GCNConv(32, 32)
        self.gcn4 = GCNConv(32, 1)
        self.global_pool = aggr.SortAggregation(k=k)
        self.conv1 = Conv1d(1, 16, 97, 97)
        self.conv2 = Conv1d(16, 32, 5, 1)
        self.maxpool = MaxPool1d(2, 2)
        self.linear1 = Linear(352, 128)
        self.dropout = Dropout(0.5)
        self.linear2 = Linear(128, 1)
    def forward(self, x, edge_index, batch):
        h1 = self.gcn1(x, edge_index).tanh()
        h2 = self.gcn2(h1, edge_index).tanh()
        h3 = self.gcn3(h2, edge_index).tanh()
        h4 = self.gcn4(h3, edge_index).tanh()
        h = torch.cat([h1, h2, h3, h4], dim=-1)
        h = self.global_pool(h, batch)
        h = h.view(h.size(0), 1, h.size(-1))
        h = self.conv1(h).relu()
        h = self.maxpool(h)
        h = self.conv2(h).relu()
        h = h.view(h.size(0), -1)
        h = self.linear1(h).relu()
        h = self.dropout(h)
        h = self.linear2(h).sigmoid()
        return h

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = DGCNN(train_dataset[0].num_features).to(device)
optimizer = torch.optim.Adam(params=model.parameters(), \
lr=0.0001)
criterion = BCEWithLogitsLoss()

def train():
    model.train()
    total_loss = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        out = model(data.x, data.edge_index, data.batch)
        loss = criterion(out.view(-1), data.y.to(torch.
float))
        loss.backward()
        optimizer.step()
        total_loss += float(loss) * data.num_graphs
    return total_loss / len(train_dataset)

@torch.no_grad()
def test(loader):
    model.eval()
    y_pred, y_true = [], []
    for data in loader:
        data = data.to(device)
        out = model(data.x, data.edge_index, data.batch)
        y_pred.append(out.view(-1).cpu())
        y_true.append(data.y.view(-1).cpu().to(torch.
float))
    auc = roc_auc_score(torch.cat(y_true), torch.cat(y_pred))
    ap = average_precision_score(torch.cat(y_true),
torch.cat(y_pred))
    return auc, ap

for epoch in range(31):
    loss = train()
    val_auc, val_ap = test(val_loader)
    print(f'Epoch {epoch:>2} | Loss: {loss:.4f} | Val AUC: {val_auc:.4f} | Val AP: {val_ap:.4f}')

test_auc, test_ap = test(test_loader)
print(f'Test AUC: {test_auc:.4f} | Test AP {test_ap:.4f}')

zsh:1: command not found: pip
Epoch  0 | Loss: 0.7031 | Val AUC: 0.6922 | Val AP: 0.6855
Epoch  1 | Loss: 0.6411 | Val AUC: 0.7725 | Val AP: 0.8044
Epoch  2 | Loss: 0.6129 | Val AUC: 0.7728 | Val AP: 0.8105
Epoch  3 | Loss: 0.6069 | Val AUC: 0.7729 | Val AP: 0.8136
Epoch  4 | Loss: 0.6047 | Val AUC: 0.7676 | Val AP: 0.8070
Epoch  5 | Loss: 0.6033 | Val AUC: 0.7680 | Val AP: 0.8097
Epoch  6 | Loss: 0.6015 | Val AUC: 0.7687 | Val AP: 0.8028
Epoch  7 | Loss: 0.6005 | Val AUC: 0.7653 | Val AP: 0.8002
Epoch  8 | Loss: 0.5980 | Val AUC: 0.7679 | Val AP: 0.7949
Epoch  9 | Loss: 0.5985 | Val AUC: 0.7673 | Val AP: 0.7957
Epoch 10 | Loss: 0.5973 | Val AUC: 0.7678 | Val AP: 0.7980
Epoch 11 | Loss: 0.5966 | Val AUC: 0.7654 | Val AP: 0.7973
Epoch 12 | Loss: 0.5960 | Val AUC: 0.7663 | Val AP: 0.7996
Epoch 13 | Loss: 0.5955 | Val AUC: 0.7659 | Val AP: 0.7977
Epoch 14 | Loss: 0.5943 | Val AUC: 0.7688 | Val AP: 0.8001
Epoch 15 | Loss: 0.5942 | Val AUC: 0.7696 | Val AP: 0.8046
Epoch 16 | Loss: 0.5929 | Val AUC: 0.7620 | Val AP: 0.7848
Epoch 17 | Loss: 0.5938 | Val AUC: 0.7702 | Val AP: 0.8010
Epoch 18 | Loss: 0.5921 | Val AUC: 0.7630 | Val AP: 0.7865
Epoch 19 | Loss: 0.5931 | Val AUC: 0.7671 | Val AP: 0.7988
Epoch 20 | Loss: 0.5916 | Val AUC: 0.7692 | Val AP: 0.8046
Epoch 21 | Loss: 0.5923 | Val AUC: 0.7674 | Val AP: 0.7917
Epoch 22 | Loss: 0.5915 | Val AUC: 0.7655 | Val AP: 0.7962
Epoch 23 | Loss: 0.5908 | Val AUC: 0.7669 | Val AP: 0.7937
Epoch 24 | Loss: 0.5903 | Val AUC: 0.7688 | Val AP: 0.8001
Epoch 25 | Loss: 0.5903 | Val AUC: 0.7686 | Val AP: 0.7992
Epoch 26 | Loss: 0.5907 | Val AUC: 0.7674 | Val AP: 0.7941
Epoch 27 | Loss: 0.5889 | Val AUC: 0.7665 | Val AP: 0.7935
Epoch 28 | Loss: 0.5889 | Val AUC: 0.7683 | Val AP: 0.7976
Epoch 29 | Loss: 0.5894 | Val AUC: 0.7699 | Val AP: 0.7973
Epoch 30 | Loss: 0.5891 | Val AUC: 0.7682 | Val AP: 0.7942
Test AUC: 0.7939 | Test AP 0.7868

Heterogenous Networks

Heterogenous networks are networks with different types of nodes and edges. For example, a social network can have users, posts, and comments as nodes. The edges can be friendship, like, and comment. Heterogenous networks are also called knowledge graphs. We can define a graph G as with mapping functions $f_v$ and $f_e$:

\[ G = (V, E, f_v, f_e) \]

where $V$ is the set of nodes, $E$ is the set of edges, $f_v$ is the mapping function for nodes, and $f_e$ is the mapping function for edges. Mapping functions can be used to define the type of nodes and edges. For example, $f_v$ can be used to define the type of nodes as user, post, and comment. $f_e$ can be used to define the type of edges as friendship, like, and comment.

Pytorch Geometric supports heterogenous networks. Let’s implement a heterogenous network using PyG. Sticking to the example in reference [1]. The figure below illustrates the heterogenous network.

Figure 8: Heterogenous network. Courtesy of [1].

Code

#!pip install torch_geometric
import torch
from torch_geometric.data import HeteroData
data = HeteroData()
data['user'].x = torch.Tensor([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]]) #num users, num features
data['game'].x = torch.Tensor([[1, 1], [2, 2]])
data['dev'].x = torch.Tensor([[1], [2]])
data['user', 'follows', 'user'].edge_index = torch.Tensor([[0, 1], [1, 2]]) # [2, num_edges_follows]
data['user', 'plays', 'game'].edge_index = torch.Tensor([[0, 1, 1, 2], [0, 0, 1, 1]])
data['dev', 'develops', 'game'].edge_index = torch.Tensor([[0, 1], [0, 1]])
data['user', 'plays', 'game'].edge_attr = torch.Tensor([[2], [0.5], [10], [12]])
print(data)

from torch import nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.datasets import DBLP
from torch_geometric.nn import GAT
from torch_geometric.nn import GATConv, Linear, to_hetero

#The DBLP computer science bibliography offers a dataset that contains four types of nodes – papers (14,328), terms (7,723), authors (4,057), and conferences (20). This dataset’s goal is to correctly classify the authors into four categories – database, data mining, artificial intelligence, and information retrieval. The authors’ node features are a bag-of-words (“0” or “1”) of 334 keywords they might have used in their publications.

dataset = DBLP(root='.')
data = dataset[0]
data['conference'].x = torch.zeros(20, 1)

class GAT(torch.nn.Module):
    def __init__(self, dim_h, dim_out):
        super().__init__()
        self.conv = GATConv((-1, -1), dim_h, add_self_loops=False)
        self.linear = nn.Linear(dim_h, dim_out)
    def forward(self, x, edge_index):
        h = self.conv(x, edge_index).relu()
        h = self.linear(h)
        return h
model = GAT(dim_h=64, dim_out=4)
model = to_hetero(model, data.metadata(), aggr='sum')
optimizer = torch.optim.Adam(model.parameters(),
lr=0.001, weight_decay=0.001)
device = torch.device('cuda' if torch.cuda.is_available()
else 'cpu')
data, model = data.to(device), model.to(device)
@torch.no_grad()
def test(mask):
    model.eval()
    pred = model(data.x_dict, data.edge_index_dict)['author'].argmax(dim=-1)
    acc = (pred[mask] == data['author'].y[mask]).sum() / mask.sum()
    return float(acc)
for epoch in range(101):
    model.train()
    optimizer.zero_grad()
    out = model(data.x_dict, data.edge_index_dict)['author']
    mask = data['author'].train_mask
    loss = F.cross_entropy(out[mask], data['author'].y[mask])
    loss.backward()
    optimizer.step()
    if epoch % 20 == 0:
        train_acc = test(data['author'].train_mask)
        val_acc = test(data['author'].val_mask)
        print(f'Epoch: {epoch:>3} | Train Loss: {loss:.4f} | Train Acc: {train_acc*100:.2f}% | Val Acc:{val_acc*100:.2f}%')
test_acc = test(data['author'].test_mask)
print(f'Test accuracy: {test_acc*100:.2f}%')

HeteroData(
  user={ x=[3, 4] },
  game={ x=[2, 2] },
  dev={ x=[2, 1] },
  (user, follows, user)={ edge_index=[2, 2] },
  (user, plays, game)={
    edge_index=[2, 4],
    edge_attr=[4, 1],
  },
  (dev, develops, game)={ edge_index=[2, 2] }
)
Epoch:   0 | Train Loss: 1.3907 | Train Acc: 18.50% | Val Acc:22.75%
Epoch:  20 | Train Loss: 1.2019 | Train Acc: 94.25% | Val Acc:67.00%
Epoch:  40 | Train Loss: 0.8634 | Train Acc: 95.50% | Val Acc:68.25%
Epoch:  60 | Train Loss: 0.5128 | Train Acc: 97.75% | Val Acc:72.00%
Epoch:  80 | Train Loss: 0.2734 | Train Acc: 99.50% | Val Acc:76.25%
Epoch: 100 | Train Loss: 0.1511 | Train Acc: 100.00% | Val Acc:75.75%
Test accuracy: 78.81%

Downloading https://www.dropbox.com/s/yh4grpeks87ugr2/DBLP_processed.zip?dl=1
Extracting ./raw/DBLP_processed.zip
Processing...
Done!

Lastly, it is good to read PyG documentation [5] to learn more about their data structures including the ones for heterogenous networks.

References

[1] Maxime Labonne, Hands-On Graph Neural Networks Using Python, 2023

[2] William L. Hamilton, Graph Representation Learning, 2020

[3] Node2Vec: Scalable Feature Learning for Networks, Grover, Leskovec, 2016

[4] A Gentle Introduction to Graph Neural Networks

[5] PyG Documentation