Monday, 17 July 2023

Deduplicate pandas dataset by index value without using `networkx`

Please note I have already reviewed this link

Pandas and python: deduplication of dataset by several fields*

I would like to have only one unique value of field code per value of id.

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df.index.name='id'

df:

id code
1 A
1 A
1 B
2 C
3 D
3 A

My desired output is:

id code
1 A
1 B
2 C
3 D
3 A

I managed to accomplish this as follows, but I don't love it.

i=df.index.name
df.reset_index().drop_duplicates().set_index(i)

Here's why:

  • This will fail if the index has no name
  • I shouldn't need to re-set and set an index
  • This is a fairly common operation, and there is way too much ink here.

What I want to say is:

df.groupby('id').drop_duplicates()

Which is, currently, not supported.

Is there a more Pythonic way to do this?



from Deduplicate pandas dataset by index value without using `networkx`

No comments:

Post a Comment