I'm trying to reproduce this github project on my machine, on Topological Data Analysis (TDA).
My steps:
- get best parameters from a cross validation output
- load my dataset feature selection
- extract topological features from dataset for prediction
- create a Random Forest Classifier model built on best parameters
- calculate probabilities on test data
- Feature selection
In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensive nes. Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating. Below, is possible to see the 24 features used for each player:
Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots" Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing" Goalkeeper: "overall_rating"
From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.
Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense.
In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two team.
- Feature extraction
The aim of TDA is to catch the structure of the space underlying the data. In our project we assume that the neigborood of a data point hides meaningfull information which are correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.
def get_best_params():
cv_output = read_pickle('cv_output.pickle')
best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output
return top_feat_params, top_model_feat_params
def load_dataset():
x_y = get_dataset(42188).get_data(dataset_format='array')[0]
x_train_with_topo = x_y[:, :-1]
y_train = x_y[:, -1]
return x_train_with_topo, y_train
def extract_x_test_features(x_train, y_train, players_df, pipeline):
"""Extract the topological features from the test set. This requires also the train set
The x used in the training phase
The y used in the training phase
players_df: pd.DataFrame
The DataFrame containing the matches with all the players, from which to extract the test set
pipeline: Pipeline
The Giotto pipeline
The x_test with the topological features
x_train_no_topo = x_train[:, :14]
y_test = np.zeros(len(players_df)) # Artificial y_test for features computation
x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)
return x_test_topo
def extract_topological_features(diagrams):
metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
new_features = []
for metric in metrics:
amplitude = Amplitude(metric=metric)
new_features = np.concatenate(new_features, axis=1)
return new_features
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
shift = 10
top_features = []
all_x_train = x_train
all_y_train = y_train
for i in tqdm(range(0, len(x_test), shift)):
print(range(0, len(x_test), shift) )
if i+shift > len(x_test):
shift = len(x_test) - i
batch = np.concatenate([all_x_train, x_test[i: i + shift]])
batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
new_features_batch = extract_topological_features(diagrams_batch[-shift:])
all_x_train = np.concatenate([all_x_train, batch[-shift:]])
all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
return final_x_test
def get_probabilities(model, x_test, team_ids):
"""Get the probabilities on the outcome of the matches contained in the test set
The model (must have the 'predict_proba' function)
The test set
team_ids: pd.DataFrame
The DataFrame containing, for each match in the test set, the ids of the two teams
The probabilities for each match in the test set
prob_pred = model.predict_proba(x_test)
prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
return prob_match_df
Working code:
best_pipeline_params, best_model_feat_params = get_best_params()
# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}
pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])
x_train, y_train = load_dataset()
# x_train.shape -> (2565, 19)
# y_train.shape -> (2565,)
# x_train[1]
[ 74.588234 76. 64.805885 62.1 87.524254 84.98428
78. 67.87059 64.01667 82.03975 78.33702 84.47059
81.5 84. 9.106519 48.51588 11.267342 174.84785
# x_test[1]
[ 88.41176471 69.33333333 63.95882353 55.55 82.60936581
75.65425217 70. 70.64705882 64.03333333 86.03288554
74.59789773 82.76470588 81. 84. 5.86254644
1.01109886 28.72603239 2.42169424 11.58941883 1.19348484
133.63215683 11.35063377 54.73722181 20.60696337]
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)
# x_test.shape -> (380, 24)
rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids) # <-- breaks here
compute_final_standings(matches_probabilities, 'premier league')
But I'm getting the error:
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
Full Traceback:
File "FootballTDA.py", line 91, in <module>
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)
File "/Volumes/Dados/Documents/Code/Apps/gato_mestre/football-tda-master/notebook_functions.py", line 156, in get_probabilities
prob_pred = model.predict_proba(x_test)
File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 674, in predict_proba
X = self._validate_X_predict(X)
File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 403, in _validate_X_predict
File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/base.py", line 437, in _validate_data
self._check_n_features(X, reset=reset)
File "/Users/vitorpatalano/anaconda2/envs/ds/lib/python3.7/site-packages/sklearn/base.py", line 366, in _check_n_features
f"X has {n_features} features, but {self.__class__.__name__} "
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
How do I fix the mismatch using the code above?
1- x_train and y_test are not dataframes
but numpy.ndarray
2 - This question is completely reproducible if one clones or downloads the project from the following link:
from Value Error X has 24 features, but DecisionTreeClassifier is expecting 19 features as input
No comments:
Post a Comment