This question was helpful in realizing that I can split training and validation data. Here is the code I use to load my train and test.
def load_data(datafile):
training_data = pd.read_csv(datafile, header=0, low_memory=False)
training_y = training_data[['job_performance']]
training_x = training_data.drop(['job_performance'], axis=1)
training_x.replace([np.inf, -np.inf], np.nan, inplace=True)
training_x.fillna(training_x.mean(), inplace=True)
training_x.fillna(0, inplace=True)
categorical_data = training_x.select_dtypes(
include=['category', object]).columns
training_x = pd.get_dummies(training_x, columns=categorical_data)
return training_x, training_y
Where the datafile
is my training file. I have another file, test.csv
that has the same columns as the training file, except it may be missing categories. How can I do the get_dummies
across the test file and ensure the categories are encoded in the same way?
Additionally, my test data is missing job_performance
column, how can I handle this in the function?
from How can I align pandas get_dummies across training and test data?
No comments:
Post a Comment