Saturday, 3 June 2023

A data resampler based on support vectors

I am working to implement a data resampler to work based on support vectors. The idea is to fit an SVM classifier, get the support vector points of the classes, then balance the data by selecting only data points near the support vectors points of each class in a way that the classes have equal number of examples, ignoring all others (far from support vector points).

I am doing this in a multi-class setttings. So, I needed to resample the classes pairwise (i.e. one-against-one). I know that in sklean's SVM "...internally, one-vs-one (‘ovo’) is always used as a multi-class strategy to train models". However, since I am not sure how to change the training behaviour of the sklearn's SVM in a way to resample each pair during training, I implemented a custom class to do that.

Currently, the custom class works fine. However, in my implementation I have an bug (logic error) that changes each pair of class labels into 0 and 1, thereby messing up with my class labels. In the code below, I illustrate this with a MWE:

# required imports
import random
from collections import Counter
from math import dist
import numpy as np
from sklearn.svm import SVC
from sklearn.utils import check_random_state
from sklearn.multiclass import OneVsOneClassifier
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

np.random.seed(7)
random.seed(7)

# resampler class
class DataUndersampler():
  def __init__(self, random_state=None):
    self.random_state = random_state
    print('DataUndersampler()')

  def fit_resample(self, X, y):
    random_state = check_random_state(self.random_state)
    # class distribution
    counter = Counter(y)
    print(f'Original class distribution: {counter}')
    maj_class = counter.most_common()[0][0]
    min_class = counter.most_common()[-1][0]
    # number of minority examples
    num_minority = len(X[ y == min_class])
    #num_majority = len(X[ y == maj_class]) # check on with maj now
    svc = SVC(kernel='rbf', random_state=32)
    svc.fit(X,y)
    # majority class support vectors
    maj_sup_vectors = svc.support_vectors_[maj_class]
    #min_sup_vectors = svc.support_vectors_[min_class] # minority sup vect
    # compute distances to support vectors' point
    distances = []
    for i, x in enumerate(X[y == maj_class]): 
      #input(f'sv: {maj_sup_vectors}, x: {x}') # check value passed
      d = dist(maj_sup_vectors, x) 
      distances.append((i, d))
    # sort distances (reverse=False -> ascending)
    distances.sort(reverse=False, key=lambda tup: tup[1])
    index = [i for i, d in distances][:num_minority] 
    X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
    y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))
    print(f"Resampled class distribution ('ovo'): {Counter(y_ds)} \n")

    return X_ds, y_ds

So, working with this:

# synthetic data
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
                           n_informative=4)

# actual class distribution
Counter(y)
Counter({0: 9924, 1: 22, 2: 15, 3: 13, 4: 26})

resampler = DataUndersampler(random_state=234)
rf_clf = model = RandomForestClassifier()

pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
DataUndersampler()

classifier.fit(X, y)

Original class distribution: Counter({0: 9924, 1: 22})  
Resampled class distribution ('ovo'): Counter({0: 22, 1: 22}) 

Original class distribution: Counter({0: 9924, 1: 15}) # this should be {0: 9924, 2: 15}
Resampled class distribution ('ovo'): Counter({0: 15, 1: 15}) # should be-> {0: 9924, 2: 15}

Original class distribution: Counter({0: 9924, 1: 13}) # should be -> {0: 9924, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) # -> {0: 9924, 3: 13}

Original class distribution: Counter({0: 9924, 1: 26}) # should be-> {0: 9924, 4: 26}
Resampled class distribution ('ovo'): Counter({0: 26, 1: 26}) # -> {0: 9924, 4: 26}

Original class distribution: Counter({0: 22, 1: 15}) # should be > {1: 22, 2: 15}
Resampled class distribution ('ovo'): Counter({0: 15, 1: 15}) # -> {1: 22, 2: 15}

Original class distribution: Counter({0: 22, 1: 13}) # -> {1: 22, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) ## -> {1: 22, 3: 13}

Original class distribution: Counter({1: 26, 0: 22}) # -> {4: 26, 1: 22}
Resampled class distribution ('ovo'): Counter({1: 22, 0: 22}) # -> {4: 26, 1: 22}

Original class distribution: Counter({0: 15, 1: 13}) # -> {2: 15, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) # -> {2: 15, 3: 13}

Original class distribution: Counter({1: 26, 0: 15}) # -> {4: 26, 2: 15}
Resampled class distribution ('ovo'): Counter({1: 15, 0: 15}) # -> {4: 26, 2: 15}

Original class distribution: Counter({1: 26, 0: 13}) # -> {4: 26, 3: 13}
Resampled class distribution ('ovo'): Counter({1: 13, 0: 13}) # -> {4: 26, 3: 13}

How do I fix this?



from A data resampler based on support vectors

No comments:

Post a Comment