I am using scikit "Decision Tree" classifier for predicting the "effort size" of a migration project. Another part of my requirement is to find the features that are influencing the prediction.
I trained the model and I get a hierarchical tree with all features at different nodes.
I thought the same tree will be used to predict the size when I supply a test record. But it is not the case, to my surprise!!
After predicting, I printed the decision_path to see the "features considered in that prediction".
This decision path is completely different from the tree built by the model.
If the tree is not used for predictions, what is the use of tree.
How can I use decision path to get the significant features in that prediction?
If I export these ruleset and use to find the decision path, that will give me wrong features or not matching the output of decision path.
Edit 1
Added the generic code. It gives the similar output.
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import tree
# Create tree object
import graphviz
import pydotplus
import collections
file_path = "sample_data_generic.csv"
data = pd.read_csv( file_path )
data.head()
df = data.copy()
cols = df.columns
col_len = len(cols)
features_category = []
for col_index in range( col_len ):
if df[ cols[col_index] ].dtype == 'object' or df[ cols[col_index] ].dtype == 'float64':
df[ cols[col_index] ] = df[ cols[col_index] ].astype('category')
features_category.append( cols[col_index] )
#redefining the variable value as it is throwing some error in the below lines due to the presence of next line char?!
features_category = ['Cloud Provider', 'OS Upgrade Path', 'Target_OS_NAME', 'Target_OS_VERSION', 'os_version']
# create dataframe for target variable
df_target = df['Size']
df.drop('Size', axis=1, inplace=True)
df = pd.get_dummies(df, columns=features_category, dtype='int')
df.head()
df_x_data = df.copy()
df_x_data.head()
y_data = df_target
target_classes = y_data.unique()
target_classes = target_classes.astype('category')
test_size_val = 0.3
x_train, x_test, y_train, y_test = train_test_split(df_x_data, y_data, test_size=test_size_val, random_state=1)
print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])
x_train.sort_values(['Comps'], ascending=[True]) #, 'Estimation'
model = tree.DecisionTreeClassifier()
model = model.fit(x_train, y_train)
model.score(x_train, y_train)
dot_data = tree.export_graphviz(model, out_file=None,
feature_names=x_train.columns,
class_names=target_classes,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
print('graph: ', graph)
colors = ('white','red', 'green')
edges = collections.defaultdict(list)
for edge in graph.get_edge_list():
edges[edge.get_source()].append(int(edge.get_destination()))
print( edges )
for edge in edges:
edges[edge].sort()
for i in range(2):
dest = graph.get_node(str(edges[edge][i]))[0]
dest.set_fillcolor(colors[i])
graph.write_png('decision_tree_2019_generic.png')
from IPython.display import Image
Image(filename = 'decision_tree_2019_generic.png')
to_predict = x_test[3:4]
model.predict( to_predict )
to_predict.values
applied = model.apply( to_predict )
applied
to_predict
decision_path = model.decision_path( to_predict )
print( decision_path.indices, '\n' )
print( decision_path[:1][:1])
predict_cols = decision_path.indices
predicted_row = to_predict
cols = predicted_row.columns
#print("len of cols: ", len(cols) )
for col in predict_cols:
print( cols[col], predicted_row[ cols[col] ].values )
Sample data: It is a generated data at present.
Cloud Provider,Comps,env,hosts,OS Upgrade Path,Target_OS_NAME,Target_OS_VERSION,Size,os_version AWS,11,2,3833,Not Direct,Linux,4,M,2 Google Cloud,16,6,4779,Direct,Mac,3,S,1 AWS,18,6,6677,Not Direct,Linux,7,S,8 Google Cloud,34,2,1650,Direct,Windows,5,B,1 AWS,35,6,9569,Direct,Windows,6,M,3 AWS,36,6,7421,Not Direct,Windows,3,B,5 Google Cloud,49,4,3469,Direct,Mac,6,B,1 AWS,54,5,5677,Direct,Mac,4,M,8
But the predicted test data's decision path is: Comps [206] --> env [3] --> hosts [637]
Thanks in advance
from Is trained "Decision Tree" not used for prediction?

No comments:
Post a Comment