Saturday 7 September 2019

get funny results when do a manual split of test and train data as opposed to python splitting function

If a run a simple dtree regression model using data via the train_test_split functon, i get nice r2 scores, and low mse values.

training_data = pandas.read_csv('data.csv',usecols=['y','x1','x2','x3'])
y = training_data.iloc[:,0]
x = training_data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
regressor = DecisionTreeRegressor(random_state = 0)  
# fit the regressor with X and Y data 
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

yet if i split the data file manually into two files 2/3 train and 1/3 test

i get negative r2 scores, and high mse

training_data = pandas.read_csv("train"+".csv",usecols=['y','x1','x2','x3'])
testing_data  = pandas.read_csv("test"+".csv", usecols=['y','x1','x2','x3'])

y_train = training_data.iloc[:,training_data.columns.str.contains('y')]
X_train = training_data.iloc[:,training_data.columns.str.contains('|'.join(['x1','x2','x3']))] 
y_test = testing_data.iloc[:,testing_data.columns.str.contains('y')]
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))] 

y_train = pandas.Series(y_train['y'], index=y_train.index)
y_test = pandas.Series(y_test['y'], index=y_test.index)

regressor = DecisionTreeRegressor(random_state = 0)  
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

I was expecting more or less the same results, and all the data types seem the same for both calls.

Have i missed anything obvious



from get funny results when do a manual split of test and train data as opposed to python splitting function

No comments:

Post a Comment