Tuesday, 19 January 2021

Value Error X has 24 features, but DecisionTreeClassifier is expecting 19 features as input

I'm trying to reproduce this github project on my machine, on Topological Data Analysis (TDA).

My steps:

  • get best parameters from a cross validation output
  • load my dataset feature selection
  • extract topological features from dataset for prediction
  • create a Random Forest Classifier model built on best parameters
  • calculate probabilities on test data

Background:

  1. Feature selection

In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensive nes. Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating. Below, is possible to see the 24 features used for each player:

Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots" Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing" Goalkeeper: "overall_rating"

From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.

Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense.

In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two team.


  1. Feature extraction

The aim of TDA is to catch the structure of the space underlying the data. In our project we assume that the neigborood of a data point hides meaningfull information which are correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.


Methods:

def get_best_params():
    cv_output = read_pickle('cv_output.pickle')
    best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output

    return top_feat_params, top_model_feat_params

def load_dataset():
    x_y = get_dataset(42188).get_data(dataset_format='array')[0]
    x_train_with_topo = x_y[:, :-1]
    y_train = x_y[:, -1]

    return x_train_with_topo, y_train


def extract_x_test_features(x_train, y_train, players_df, pipeline):
    """Extract the topological features from the test set. This requires also the train set

    Parameters
    ----------
    x_train:
        The x used in the training phase
    y_train:
        The y used in the training phase
    players_df: pd.DataFrame
        The DataFrame containing the matches with all the players, from which to extract the test set
    pipeline: Pipeline
        The Giotto pipeline

    Returns
    -------
    x_test:
        The x_test with the topological features
    """
    x_train_no_topo = x_train[:, :14]
    y_test = np.zeros(len(players_df))  # Artificial y_test for features computation
    print('Y_TEST',y_test.shape)

    x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)

    return x_test_topo

def extract_topological_features(diagrams):
    metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
    new_features = []
    for metric in metrics:
        amplitude = Amplitude(metric=metric)
        new_features.append(amplitude.fit_transform(diagrams))
    new_features = np.concatenate(new_features, axis=1)
    return new_features

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
    shift = 10
    top_features = []
    all_x_train = x_train
    all_y_train = y_train
    for i in tqdm(range(0, len(x_test), shift)):
        #
        print(range(0, len(x_test), shift) )
        if i+shift > len(x_test):
            shift = len(x_test) - i
        batch = np.concatenate([all_x_train, x_test[i: i + shift]])
        batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
        diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
        new_features_batch = extract_topological_features(diagrams_batch[-shift:])
        top_features.append(new_features_batch)
        all_x_train = np.concatenate([all_x_train, batch[-shift:]])
        all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
    final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
    return final_x_test

def get_probabilities(model, x_test, team_ids):
    """Get the probabilities on the outcome of the matches contained in the test set

    Parameters
    ----------
    model:
        The model (must have the 'predict_proba' function)
    x_test:
        The test set
    team_ids: pd.DataFrame
        The DataFrame containing, for each match in the test set, the ids of the two teams
    Returns
    -------
    probabilities:
        The probabilities for each match in the test set
    """
    prob_pred = model.predict_proba(x_test)
    prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
    prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
    return prob_match_df

Working code:

best_pipeline_params, best_model_feat_params = get_best_params()

# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}

pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
             SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
            ('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])

x_train, y_train = load_dataset()

# x_train.shape ->  (2565, 19)
# y_train.shape -> (2565,)

# x_train[1]  

 [ 74.588234  76.        64.805885  62.1       87.524254  84.98428
  78.        67.87059   64.01667   82.03975   78.33702   84.47059
  81.5       84.         9.106519  48.51588   11.267342 174.84785
  10.968423]

# x_test[1]

    [ 88.41176471  69.33333333  63.95882353  55.55        82.60936581
  75.65425217  70.          70.64705882  64.03333333  86.03288554
  74.59789773  82.76470588  81.          84.           5.86254644
   1.01109886  28.72603239   2.42169424  11.58941883   1.19348484
 133.63215683  11.35063377  54.73722181  20.60696337]

x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

# x_test.shape -> (380, 24)

rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)  # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')

But I'm getting the error:

ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.

Full Traceback:

  File "FootballTDA.py", line 91, in <module>
    matches_probabilities = get_probabilities(rf_model, x_test, team_ids)
  File "/Volumes/Dados/Documents/Code/Apps/gato_mestre/football-tda-master/notebook_functions.py", line 156, in get_probabilities
    prob_pred = model.predict_proba(x_test)
  File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 674, in predict_proba
    X = self._validate_X_predict(X)
  File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 422, in _validate_X_predict
    return self.estimators_[0]._validate_X_predict(X, check_input=True)
  File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 403, in _validate_X_predict
    reset=False)
  File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/base.py", line 437, in _validate_data
    self._check_n_features(X, reset=reset)
  File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/base.py", line 366, in _check_n_features
    f"X has {n_features} features, but {self.__class__.__name__} "
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.

How do I fix the mismatch using the code above?


NOTES:

1- x_train and y_test are not dataframes but numpy.ndarray

2 - This question is completely reproducible if one clones or downloads the project from the following link:

Github Link



from Value Error X has 24 features, but DecisionTreeClassifier is expecting 19 features as input

No comments:

Post a Comment