Tuesday, 2 October 2018

In SVC from Sklearn, why is the training time not strictly linear to maximum iteration when label size is big?

I have doing an analysis trying to see the relation between training time and maximum iteration in SVC. The data I use is some randomly generated number and I plotted the training time against max_iter of SVC fit. I checked logs and each binary classifier has reached the max_iter (I output all console logs which showed detailed warning for each binary classifier and count them). However, I was assuming that the training time will be strictly linear to the iteration but actually, in the case that the training data has many labels e.g. say 40, then the plot does not show it's linear. enter image description here

It seems as the maximum iteration goes up, each iteration takes slight less time than before. While if we changed label_size to be 2 (which means each fit contains only 1 binary classifier), the line is straight.

enter image description here

What causes that to happen?

Here is my source code:

# -*- coding: utf-8 -*-
import numpy as np
from sklearn.svm import SVC
import time
import pandas as pd


def main(row_size, label_size):
    np.random.seed(2019)
    y = np.array([i for i in range(label_size) for j in range(row_size
                 / label_size)])
    if len(y) < row_size:
        y = np.append(y, [y[-1]] * (row_size - len(y)))
    X = np.random.rand(row_size, 300)
    print X.shape, y.shape
    return (X, y)


def train_svm(X, y, max_iter):
    best_params = {'C': 1}
    clf = SVC(
        C=best_params['C'],
        kernel=str('linear'),
        probability=False,
        class_weight='balanced',
        max_iter=max_iter,
        random_state=2018,
        verbose=True,
        )
    start = time.time()
    clf.fit(X, y)
    end = time.time()
    return end - start


if __name__ == '__main__':
    row_size = 20000
    m_iter = range(10, 401, 20)
    label_size = [40]
    data = {
        'label_size': [],
        'max_iter': [],
        'row_size': [],
        'time': [],
        }
    for it in m_iter:
        for l in label_size:
            (X, y) = main(row_size, l)
            t = train_svm(X, y, max_iter=it)
            data['label_size'].append(l)
            data['max_iter'].append(it)
            data['row_size'].append(row_size)
            data['time'].append(t)
            df = pd.DataFrame(data)
            df.to_csv('svc_iter.csv', index=None)



from In SVC from Sklearn, why is the training time not strictly linear to maximum iteration when label size is big?

No comments:

Post a Comment