Thursday, 27 June 2019

How can I align pandas get_dummies across training and test data?

This question was helpful in realizing that I can split training and validation data. Here is the code I use to load my train and test.

def load_data(datafile):
    training_data = pd.read_csv(datafile, header=0, low_memory=False)
    training_y = training_data[['job_performance']]
    training_x = training_data.drop(['job_performance'], axis=1)

    training_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    training_x.fillna(training_x.mean(), inplace=True)
    training_x.fillna(0, inplace=True)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns

    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y

Where the datafile is my training file. I have another file, test.csv that has the same columns as the training file, except it may be missing categories. How can I do the get_dummies across the test file and ensure the categories are encoded in the same way?

Additionally, my test data is missing job_performance column, how can I handle this in the function?



from How can I align pandas get_dummies across training and test data?

No comments:

Post a Comment