Hemant Vishwakarma: Removing *NEARLY* Duplicate Observations

Monday, 22 February 2021

Removing NEARLY Duplicate Observations - Python

I am attempting to remove some observations in a data frame where the similarities are ALMOST 100% but not quite. See frame below:

Notice how "John", "Mary", and "Wesley" have nearly identical observations, but have one column being different. The real data set have 15 columns, and 215,000+ observations. In all of the cases I could visually verify, the similarities were likewise: out of 15 columns, the other observation would match up to 14 columns, every time. For the purpose of the project I have decided to remove the repeated observations, (and store them into another data frame just in case my boss asks to see them).

I have evidently thought of remove_duplicates(keep='something'), but that would not work since the observations are not ENTIRELY similar. Has anyone ever encounter such an issue? Any idea on a remedy?

from Removing *NEARLY* Duplicate Observations - Python

Hemant Vishwakarma

Monday, 22 February 2021

Removing NEARLY Duplicate Observations - Python

No comments:

Post a Comment