Tuesday, 16 April 2019

Python 3: remove overlaps in table

I have a table (simplified output from a program), that I need to filter:

id   hit from   to value
A   hit1    56  102 0.00085
B   hit2    89  275 0.00034
B   hit3    240 349 0.00034
C   hit4    332 480 3.40E-15
D   hit5    291 512 3.80E-24
D   hit6    287 313 0.00098
D   hit7    381 426 0.00098
D   hit8    287 316 0.0029
D   hit9    373 422 0.0029
D   hit10   514 600 0.0021

For each id, the df should be sorted by from and, if there are overlapping hits, keep the one with the lower value.

So far, this is my code, which does first the starting by from then by value:

import pandas
df = pandas.read_csv("table", sep='\s+', names=["id", "hit", "from", "to", "value"])
df.sort_values(['from', "value"]).groupby('id')

But how do I check for the overlap (from to to) & remove the one with the higher score?

This is my expected output:

id   hit from   to valu
A   hit1    56  102 0.00085
C   hit4    332 480 3.40E-15
D   hit5    291 512 3.80E-24
D   hit10   514 600 0.0021

Please note, that id B has two overlapping hits with equal value, therefore both entries are to be kicked out.



from Python 3: remove overlaps in table

No comments:

Post a Comment