Monday, 22 February 2021

Removing *NEARLY* Duplicate Observations - Python

I am attempting to remove some observations in a data frame where the similarities are ALMOST 100% but not quite. See frame below:

enter image description here

Notice how "John", "Mary", and "Wesley" have nearly identical observations, but have one column being different. The real data set have 15 columns, and 215,000+ observations. In all of the cases I could visually verify, the similarities were likewise: out of 15 columns, the other observation would match up to 14 columns, every time. For the purpose of the project I have decided to remove the repeated observations, (and store them into another data frame just in case my boss asks to see them).

I have evidently thought of remove_duplicates(keep='something'), but that would not work since the observations are not ENTIRELY similar. Has anyone ever encounter such an issue? Any idea on a remedy?



from Removing *NEARLY* Duplicate Observations - Python

No comments:

Post a Comment