Motivation
As a part of this blog series, this time we’ll be looking at a spectral convolution technique introduced in the paper by M. Defferrard, X. Bresson, and P. Vandergheynst, on “Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering”.
As mentioned in our previous blog on A Review : Graph Convolutional Networks (GCN), the spatial convolution and pooling operations are well-defined only for the Euclidean domain. Hence, we cannot apply the convolution directly on the irregular structured data such as graphs.
The technique proposed in this paper provide us with a way to perform convolution on graph like data, for which they used convolution theorem. According to which, Convolution in spatial domain is equivalent to multiplication in Fourier domain. Hence, instead of performing convolution explicitly in the spatial domain, we will transform the graph data and the filter into Fourier domain. Do element-wise multiplication and the result is converted back to spatial domain by performing inverse Fourier transform. Following figure illustrates the proposed technique:
But How to Take This Fourier Transform?
As mentioned we have to take a fourier transform of graph signal. In spectral graph theory, the important operator used for Fourier analysis of graph is the Laplacian operator. For the graph $G=(V,E)$, with set of vertices $V$ of size $n$ and set of edges $E$. The Laplacian is given by
$$Δ=D−A$$
where $D$ denotes the diagonal degree matrix and $A$ denotes the adjacency matrix of the graph.
When we do eigen-decomposition of the Laplacian, we get the orthonormal eigenvectors, as the Laplacian is real symmetric positive semi-definite matrix (side note: positive semidefinite matrices have orthogonal eigenvectors and symmetric matrix has real eigenvalues). These eigenvectors are denoted by ${ϕ_l}^n_{l=0}$ and also called as Fourier modes. The corresponding eigenvalues ${λ_l}^n_{l=0}$ acts as frequencies of the graph.
The Laplacian can be diagonalized by the Fourier basis.
$$Δ=ΦΛΦ^T$$
where, $Φ={ϕ_l}^n_{l=0}$ is a matrix with eigenvectors as columns and $Λ$ is a diagonal matrix of eigenvalues.
Now the graph can be transformed to Fourier domain just by multiplying by the Fourier basis. Hence, the Fourier transform of graph signal $x:V→R$ which is defined on nodes of the graph $x∈R^n$ is given by:
$\hat{x}=Φ^Tx$, where $\hat{x}$ denotes the graph Fourier transform. Hence, the task of transforming the graph signal to Fourier domain is nothing but the matrix-vector multiplication.
Similarly, the inverse graph Fourier transform is given by:
$x=Φ\hat{x}$.
This formulation of Fourier transform on graph gives us the required tools to perform convolution on graphs.
Filtering of Signals on Graph
As we now have the two necessary tools to define convolution on non-Euclidean domain:
-
Way to transform graph to Fourier domain.
-
Convolution in Fourier domain, the convolution operation between graph signal $x$ and filter $g$ is given by the graph convolution of the input signal $x$ with a filter $g∈R^n$ defined as:
$x∗_Gg=ℱ^{−1}(ℱ(x)⊙ℱ(g))=Φ(Φ^Tx⊙Φ^Tg)$,
where $⊙$ denotes the element-wise product. If we denote a filter as $g_θ=diag(Φ^Tg)$, then the spectral graph convolution is simplified as $x∗_Gg_θ=Φg_θΦ^Tx$
Why can't we go forward with this scheme?
All spectral-based ConvGNNs follow this definition. But, this method has three major problems:
-
The number of filter parameters to learn depends on the dimensionality of the input which translates into O(n) complexity and filter is non-parametric.
-
The filters are not localized i.e. filters learnt for graph considers the entire graph, unlike traditional CNN which takes only nearby local pixels to compute convolution.
-
The algorithm needs to calculate the eigen-decomposition explicitly and multiply signal with Fourier basis as there is no Fast Fourier Transform algorithm defined for graphs, hence the computation is $O(n^2)$. (Fast Fourier Transform defined for Euclidean data has $O(nlogn)$ complexity)
Polynomial Parametrization of Filters
To overcome these problems they used an polynomial approximation to parametrize the filter.
Now, filter is of the form of:
$g_θ(Λ) =\sum_{k=0}^{K-1}θ_kΛ_k$, where the parameter $θ∈R^K$ is a vector of polynomial coefficients.
These spectral filters represented by $Kth$-order polynomials of the Laplacian are exactly $K$-localized. Besides, their learning complexity is $O(K)$, the support size of the filter, and thus the same complexity as classical CNNs.
Is everything fixed now?
No, the cost to filter a signal is still high with $O(n^2)$ operations because of the multiplication with the Fourier basis U. (calculating the eigen-decomposition explicitly and multiply signal with Fourier basis)
To bypass this problem, the authors parametrize $g_θ(Δ)$ as a polynomial function that can be computed recursively from $Δ$. One such polynomial, traditionally used in Graph Signal Processing to approximate kernels, is the Chebyshev expansion. The Chebyshev polynomial $T_k(x)$ of order $k$ may be computed by the stable recurrence relation $T_k(x) = 2xT_{k−1}(x)−T_{k−2}(x)$ with $T_0=1$ and $T_1=x$.
The spectral filter is now given by a truncated Chebyshev polynomial:
$$g_θ(\barΔ)=Φg(\barΛ)Φ^T=\sum_{k=0}^{K-1}θ_kT_k(\barΔ)$$
where, $Θ∈R^K$ now represents a vector of the Chebyshev coefficients, the $\barΔ$ denotes the rescaled $Δ$. (This rescaling is necessary as the Chebyshev polynomial form orthonormal basis in the interval [-1,1] and the eigenvalues of original Laplacian lies in the interval $[0,λ_{max}]$). Scaling is done as $\barΔ= 2Δ/λ_{max}−I_n$.
The filtering operation can now be written as $y=g_θ(Δ)x=\sum_{k=0}^{K-1}θ_kT_k(\barΔ)x$, where, $x_{i,k}$ are the input feature maps, $Θ_k$ are the trainable parameters.
Pooling Operation
In case of images, the pooling operation consists of taking a fixed size patch of pixels, say 2x2, and keeping only the pixel with max value (assuming you apply max pooling) and discarding the other pixels from the patch. Similar concept of pooling can be applied to graphs.
Defferrard et al. address this issue by using the coarsening phase of the Graclus multilevel clustering algorithm. Graclus’ greedy rule consists, at each coarsening level, in picking an unmarked vertex $i$ and matching it with one of its unmarked neighbors $j$ that maximizes the local normalized cut $Wij(1/di+ 1/dj)$. The two matched vertices are then marked and the coarsened weights are set as the sum of their weights. The matching is repeated until all nodes have been explored. This is an very fast coarsening scheme which divides the number of nodes by approximately two from one level to the next coarser level. After coarsening, the nodes of the input graph and its coarsened version are rearranged into a balanced binary tree. Arbitrarily aggregating the balanced binary tree from bottom to top will arrange similar nodes together. Pooling such a rearranged signal is much more efficient than pooling the original. The following figure shows the example of graph coarsening and pooling.
Implementing ChebNET in PyTorch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## Imports
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import torch.nn as nn
import collections
import time
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
import sys
import os
|
1
2
3
4
5
6
7
8
9
10
|
if torch.cuda.is_available():
print('cuda available')
dtypeFloat = torch.cuda.FloatTensor
dtypeLong = torch.cuda.LongTensor
torch.cuda.manual_seed(1)
else:
print('cuda not available')
dtypeFloat = torch.FloatTensor
dtypeLong = torch.LongTensor
torch.manual_seed(1)
|
cuda available
Data Prepration
1
2
3
4
5
6
7
8
9
|
# load data in folder datasets
mnist = input_data.read_data_sets('datasets', one_hot=False)
train_data = mnist.train.images.astype(np.float32)
val_data = mnist.validation.images.astype(np.float32)
test_data = mnist.test.images.astype(np.float32)
train_labels = mnist.train.labels
val_labels = mnist.validation.labels
test_labels = mnist.test.labels
|
1
2
3
4
5
6
7
8
9
10
11
12
|
from grid_graph import grid_graph
from coarsening import coarsen
from coarsening import lmax_L
from coarsening import perm_data
from coarsening import rescale_L
# Construct graph
t_start = time.time()
grid_side = 28
number_edges = 8
metric = 'euclidean'
A = grid_graph(grid_side,number_edges,metric) # create graph of Euclidean grid
|
nb edges: 6396
1
2
3
|
# Compute coarsened graphs
coarsening_levels = 4
L, perm = coarsen(A, coarsening_levels)
|
Heavy Edge Matching coarsening with Xavier version
Layer 0: M_0 = |V| = 976 nodes (192 added), |E| = 3198 edges
Layer 1: M_1 = |V| = 488 nodes (83 added), |E| = 1619 edges
Layer 2: M_2 = |V| = 244 nodes (29 added), |E| = 794 edges
Layer 3: M_3 = |V| = 122 nodes (7 added), |E| = 396 edges
Layer 4: M_4 = |V| = 61 nodes (0 added), |E| = 194 edges
1
2
3
4
5
|
# Compute max eigenvalue of graph Laplacians
lmax = []
for i in range(coarsening_levels):
lmax.append(lmax_L(L[i]))
print('lmax: ' + str([lmax[i] for i in range(coarsening_levels)]))
|
lmax: [1.3857538, 1.3440963, 1.1994357, 1.0239158]
1
2
3
4
5
6
7
8
9
10
11
|
# Reindex nodes to satisfy a binary tree structure
train_data = perm_data(train_data, perm)
val_data = perm_data(val_data, perm)
test_data = perm_data(test_data, perm)
print(train_data.shape)
print(val_data.shape)
print(test_data.shape)
print('Execution time: {:.2f}s'.format(time.time() - t_start))
del perm
|
(55000, 976)
(5000, 976)
(10000, 976)
Execution time: 4.18s
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
|
# class definitions
class my_sparse_mm(torch.autograd.Function):
"""
Implementation of a new autograd function for sparse variables,
called "my_sparse_mm", by subclassing torch.autograd.Function
and implementing the forward and backward passes.
"""
def forward(self, W, x): # W is SPARSE
self.save_for_backward(W, x)
y = torch.mm(W, x)
return y
def backward(self, grad_output):
W, x = self.saved_tensors
grad_input = grad_output.clone()
grad_input_dL_dW = torch.mm(grad_input, x.t())
grad_input_dL_dx = torch.mm(W.t(), grad_input )
return grad_input_dL_dW, grad_input_dL_dx
class Graph_ConvNet_LeNet5(nn.Module):
def __init__(self, net_parameters):
print('Graph ConvNet: LeNet5')
super(Graph_ConvNet_LeNet5, self).__init__()
# parameters
D, CL1_F, CL1_K, CL2_F, CL2_K, FC1_F, FC2_F = net_parameters
FC1Fin = CL2_F*(D//16)
# graph CL1
self.cl1 = nn.Linear(CL1_K, CL1_F)
Fin = CL1_K; Fout = CL1_F;
scale = np.sqrt( 2.0/ (Fin+Fout) )
self.cl1.weight.data.uniform_(-scale, scale)
self.cl1.bias.data.fill_(0.0)
self.CL1_K = CL1_K; self.CL1_F = CL1_F;
# graph CL2
self.cl2 = nn.Linear(CL2_K*CL1_F, CL2_F)
Fin = CL2_K*CL1_F; Fout = CL2_F;
scale = np.sqrt( 2.0/ (Fin+Fout) )
self.cl2.weight.data.uniform_(-scale, scale)
self.cl2.bias.data.fill_(0.0)
self.CL2_K = CL2_K; self.CL2_F = CL2_F;
# FC1
self.fc1 = nn.Linear(FC1Fin, FC1_F)
Fin = FC1Fin; Fout = FC1_F;
scale = np.sqrt( 2.0/ (Fin+Fout) )
self.fc1.weight.data.uniform_(-scale, scale)
self.fc1.bias.data.fill_(0.0)
self.FC1Fin = FC1Fin
# FC2
self.fc2 = nn.Linear(FC1_F, FC2_F)
Fin = FC1_F; Fout = FC2_F;
scale = np.sqrt( 2.0/ (Fin+Fout) )
self.fc2.weight.data.uniform_(-scale, scale)
self.fc2.bias.data.fill_(0.0)
# nb of parameters
nb_param = CL1_K* CL1_F + CL1_F # CL1
nb_param += CL2_K* CL1_F* CL2_F + CL2_F # CL2
nb_param += FC1Fin* FC1_F + FC1_F # FC1
nb_param += FC1_F* FC2_F + FC2_F # FC2
print('nb of parameters=',nb_param,'\n')
def init_weights(self, W, Fin, Fout):
scale = np.sqrt( 2.0/ (Fin+Fout) )
W.uniform_(-scale, scale)
return W
def graph_conv_cheby(self, x, cl, L, lmax, Fout, K):
# parameters
# B = batch size
# V = nb vertices
# Fin = nb input features
# Fout = nb output features
# K = Chebyshev order & support size
B, V, Fin = x.size(); B, V, Fin = int(B), int(V), int(Fin)
# rescale Laplacian
lmax = lmax_L(L)
L = rescale_L(L, lmax)
# convert scipy sparse matric L to pytorch
L = L.tocoo()
indices = np.column_stack((L.row, L.col)).T
indices = indices.astype(np.int64)
indices = torch.from_numpy(indices)
indices = indices.type(torch.LongTensor)
L_data = L.data.astype(np.float32)
L_data = torch.from_numpy(L_data)
L_data = L_data.type(torch.FloatTensor)
L = torch.sparse.FloatTensor(indices, L_data, torch.Size(L.shape))
L = Variable( L , requires_grad=False)
if torch.cuda.is_available():
L = L.cuda()
# transform to Chebyshev basis
x0 = x.permute(1,2,0).contiguous() # V x Fin x B
x0 = x0.view([V, Fin*B]) # V x Fin*B
x = x0.unsqueeze(0) # 1 x V x Fin*B
def concat(x, x_):
x_ = x_.unsqueeze(0) # 1 x V x Fin*B
return torch.cat((x, x_), 0) # K x V x Fin*B
if K > 1:
x1 = my_sparse_mm()(L,x0) # V x Fin*B
x = torch.cat((x, x1.unsqueeze(0)),0) # 2 x V x Fin*B
for k in range(2, K):
x2 = 2 * my_sparse_mm()(L,x1) - x0
x = torch.cat((x, x2.unsqueeze(0)),0) # M x Fin*B
x0, x1 = x1, x2
x = x.view([K, V, Fin, B]) # K x V x Fin x B
x = x.permute(3,1,2,0).contiguous() # B x V x Fin x K
x = x.view([B*V, Fin*K]) # B*V x Fin*K
# Compose linearly Fin features to get Fout features
x = cl(x) # B*V x Fout
x = x.view([B, V, Fout]) # B x V x Fout
return x
# Max pooling of size p. Must be a power of 2.
def graph_max_pool(self, x, p):
if p > 1:
x = x.permute(0,2,1).contiguous() # x = B x F x V
x = nn.MaxPool1d(p)(x) # B x F x V/p
x = x.permute(0,2,1).contiguous() # x = B x V/p x F
return x
else:
return x
def forward(self, x, d, L, lmax):
# graph CL1
x = x.unsqueeze(2) # B x V x Fin=1
x = self.graph_conv_cheby(x, self.cl1, L[0], lmax[0], self.CL1_F, self.CL1_K)
x = F.relu(x)
x = self.graph_max_pool(x, 4)
# graph CL2
x = self.graph_conv_cheby(x, self.cl2, L[2], lmax[2], self.CL2_F, self.CL2_K)
x = F.relu(x)
x = self.graph_max_pool(x, 4)
# FC1
x = x.view(-1, self.FC1Fin)
x = self.fc1(x)
x = F.relu(x)
x = nn.Dropout(d)(x)
# FC2
x = self.fc2(x)
return x
def loss(self, y, y_target, l2_regularization):
loss = nn.CrossEntropyLoss()(y,y_target)
l2_loss = 0.0
for param in self.parameters():
data = param* param
l2_loss += data.sum()
loss += 0.5* l2_regularization* l2_loss
return loss
def update(self, lr):
update = torch.optim.SGD( self.parameters(), lr=lr, momentum=0.9 )
return update
def update_learning_rate(self, optimizer, lr):
for param_group in optimizer.param_groups:
param_group['lr'] = lr
return optimizer
def evaluation(self, y_predicted, test_l):
_, class_predicted = torch.max(y_predicted.data, 1)
return 100.0* (class_predicted == test_l).sum()/ y_predicted.size(0)
|
1
2
3
4
5
6
7
8
9
|
# network parameters
D = train_data.shape[1]
CL1_F = 32
CL1_K = 25
CL2_F = 64
CL2_K = 25
FC1_F = 512
FC2_F = 10
net_parameters = [D, CL1_F, CL1_K, CL2_F, CL2_K, FC1_F, FC2_F]
|
1
2
3
4
5
|
# instantiate the object net of the class
net = Graph_ConvNet_LeNet5(net_parameters)
if torch.cuda.is_available():
net.cuda()
print(net)
|
Graph ConvNet: LeNet5
nb of parameters= 2056586
Graph_ConvNet_LeNet5(
(cl1): Linear(in_features=25, out_features=32, bias=True)
(cl2): Linear(in_features=800, out_features=64, bias=True)
(fc1): Linear(in_features=3904, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=10, bias=True)
)
1
2
|
# Weights
L_net = list(net.parameters())
|
Hyper parameters setting
1
2
3
4
5
6
7
8
9
|
# learning parameters
learning_rate = 0.05
dropout_value = 0.5
l2_regularization = 5e-4
batch_size = 100
num_epochs = 20
train_size = train_data.shape[0]
nb_iter = int(num_epochs * train_size) // batch_size
print('num_epochs=',num_epochs,', train_size=',train_size,', nb_iter=',nb_iter)
|
num_epochs= 20 , train_size= 55000 , nb_iter= 11000
Training & Evaluation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
|
# Optimizer
global_lr = learning_rate
global_step = 0
decay = 0.95
decay_steps = train_size
lr = learning_rate
optimizer = net.update(lr)
# loop over epochs
indices = collections.deque()
for epoch in range(num_epochs): # loop over the dataset multiple times
# reshuffle
indices.extend(np.random.permutation(train_size)) # rand permutation
# reset time
t_start = time.time()
# extract batches
running_loss = 0.0
running_accuray = 0
running_total = 0
while len(indices) >= batch_size:
# extract batches
batch_idx = [indices.popleft() for i in range(batch_size)]
train_x, train_y = train_data[batch_idx,:], train_labels[batch_idx]
train_x = Variable( torch.FloatTensor(train_x).type(dtypeFloat) , requires_grad=False)
train_y = train_y.astype(np.int64)
train_y = torch.LongTensor(train_y).type(dtypeLong)
train_y = Variable( train_y , requires_grad=False)
# Forward
y = net.forward(train_x, dropout_value, L, lmax)
loss = net.loss(y,train_y,l2_regularization)
loss_train = loss.data
# Accuracy
acc_train = net.evaluation(y,train_y.data)
# backward
loss.backward()
# Update
global_step += batch_size # to update learning rate
optimizer.step()
optimizer.zero_grad()
# loss, accuracy
running_loss += loss_train
running_accuray += acc_train
running_total += 1
# print
if not running_total%100: # print every x mini-batches
print('epoch= %d, i= %4d, loss(batch)= %.4f, accuray(batch)= %.2f' % (epoch+1, running_total, loss_train, acc_train))
# print
t_stop = time.time() - t_start
print('epoch= %d, loss(train)= %.3f, accuracy(train)= %.3f, time= %.3f, lr= %.5f' %
(epoch+1, running_loss/running_total, running_accuray/running_total, t_stop, lr))
# update learning rate
lr = global_lr * pow( decay , float(global_step// decay_steps) )
optimizer = net.update_learning_rate(optimizer, lr)
# Test set
running_accuray_test = 0
running_total_test = 0
indices_test = collections.deque()
indices_test.extend(range(test_data.shape[0]))
t_start_test = time.time()
while len(indices_test) >= batch_size:
batch_idx_test = [indices_test.popleft() for i in range(batch_size)]
test_x, test_y = test_data[batch_idx_test,:], test_labels[batch_idx_test]
test_x = Variable( torch.FloatTensor(test_x).type(dtypeFloat) , requires_grad=False)
y = net.forward(test_x, 0.0, L, lmax)
test_y = test_y.astype(np.int64)
test_y = torch.LongTensor(test_y).type(dtypeLong)
test_y = Variable( test_y , requires_grad=False)
acc_test = net.evaluation(y,test_y.data)
running_accuray_test += acc_test
running_total_test += 1
t_stop_test = time.time() - t_start_test
print(' accuracy(test) = %.3f %%, time= %.3f' % (running_accuray_test / running_total_test, t_stop_test))
|
You can find our implementation made using PyTorch at ChebNet.
References