Wednesday, 30 June 2021

How to split the Cora dataset to train a GCN model only on training part?

I am training a GCN (Graph Convolutional Network) on Cora dataset.

The Cora dataset has the following attributes:

Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Number of training nodes: 140
Training node label rate: 0.05
Is undirected: True

Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])

Since my code is very long, I only put the relevent parts of my code here. Firstly, I split the Cora dataset as follows:

def to_mask(index, size):
    mask = torch.zeros(size, dtype=torch.bool)
    mask[index] = 1
    return mask

def cora_splits(data, num_classes):
    indices = []

    for i in range(num_classes):
        # returns all indices of the elements = i from data.y tensor
        index = (data.y == i).nonzero().view(-1)

        # returns a random permutation of integers from 0 to index.size(0).
        index = index[torch.randperm(index.size(0))]

        # indices is a list of tensors and it has a length of 7
        indices.append(index)

    # select 20 nodes from each class for training
    train_index = torch.cat([i[:20] for i in indices], dim=0)

    rest_index = torch.cat([i[20:] for i in indices], dim=0)
    rest_index = rest_index[torch.randperm(len(rest_index))]

    data.train_mask = to_mask(train_index, size=data.num_nodes)
    data.val_mask = to_mask(rest_index[:500], size=data.num_nodes)
    data.test_mask = to_mask(rest_index[500:], size=data.num_nodes)

    return data

The train is as follows (taken from here with few modifications):


def train(model, optimizer, data, epoch):
    t = time.time()
    model.train()
    optimizer.zero_grad()
    output = model(data)
    loss_train = F.nll_loss(output[data.train_mask], data.y[data.train_mask])
    acc_train = accuracy(output[data.train_mask], data.y[data.train_mask])
    loss_train.backward()
    optimizer.step()

    loss_val = F.nll_loss(output[data.val_mask], data.y[data.val_mask])
    acc_val = accuracy(output[data.val_mask], data.y[data.val_mask])

def accuracy(output, labels):
    preds = output.max(1)[1].type_as(labels)
    correct = preds.eq(labels).double()
    correct = correct.sum()
    return correct / len(labels)

When I ran my code with 200 epochs in 10 runs I gained:

tensor([0.7690, 0.8030, 0.8530, 0.8760, 0.8600, 0.8550, 0.8850, 0.8580, 0.8940, 0.8830])

Val Loss: 0.5974, Test Accuracy: 0.854 ± 0.039

where each value in the tensor belongs to the model accurracy of each run and the mean accuracy of all those 10 runs is 0.854 with std ± 0.039.

As it can be observed, the accuracy from the first run to the 10th one is increasing substantially. Therefore, I think the model is overfitting. One reason of overfitting is that in the code, the test data has been seen by the model in the training time since in the train function, there is a line output = model(data) so the model is trained over the whole data. What I intend to do is to train my model only on a part of the data (something similar to data[data.train_mask]) but the problem is I cannot pass data[data.train_mask], due to the forward function of the GCN model (from this repository):

def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        for conv in self.convs:
            x = F.relu(conv(x, edge_index))
        x = F.relu(self.lin1(x))
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin2(x)
        return F.log_softmax(x, dim=-1)

If I pass data[data.train_mask] to the GCN model, then in the above forward function in line x, edge_index = data.x, data.edge_index, x and edge_index cannot be retrieved from data[data.train_mask]. Therefore, I need to find a way to split the Cora dataset in a way that I can pass a specefic part of it with the nodes, edge-index and other attributes to the model. My question is how to do it?

Also, any suggestion about k-fold cross validation is much appreciated.



from How to split the Cora dataset to train a GCN model only on training part?

No comments:

Post a Comment