I have a dataframe with the columns: [id, range_start, range_stop, score]. If two rows have a range overlap by x percentage I retain the row with the higher score. However, I am confused how to pull out rows with no overlap to other ranges. I am using a nested loop and recursion to condense overlapping ranges into a new dataframe. However, this structure causes all rows to be retained when I am looking for the non overlapping rows.
## This is my function to recursively select the highest scoring overlapping regions
def overlap_retention(df_overlap, threshold, df_nonoverlap=None):
if df_nonoverlap != None:
df_nonoverlap = pd.DataFrame()
df_overlap = pd.DataFrame()
for index, row in x.iterrows():
rs = row['range_start']
re = row['range_end']
## Silly nested loop to compare ranges between all rows
for index2, row2 in x.drop(index).iterrows():
rs2 = row2['range_start']
re2 = row2['range_end']
readRegion=[*range(rs,re,1)]
refRegion=[*range(rs2,re2,1)]
regionUnion = set(readRegion).intersection(set(refRegion))
overlap_length = len(regionUnion)
overlap_min = min(rs, rs2)
overlap_max = max(re, re2)
overlap_full_range = overlap_max-overlap_min
overlap_percentage = (overlap_length/overlap_full_range)*100
## Check if they overlap by x_percentage and retain the higher score
if overlap_percentage>x_percentage:
evalue = row['score']
evalue_2 = row2['score']
if evalue_2 > evalue:
df_overlap = df_overlap.append(row2)
else:
df_overlap = df_overlap.append(row)
#----------------------------------------------------------
## How to find non-overlapping rows without pulling everything?
else:
df_nonoverlap = df_nonoverlap.append(row)
# ---------------------------------------------
### Recursion here to condense overlapped list further
if len(df_overlap)>1:
overlap_retention(df_overlap, threshold, df_nonoverlap)
else:
return(df_nonoverlap)
An example input is below:
data = {'id':['id1', 'id2', 'id3', 'id4', 'id5'],
'range_start':[1,12,11,6,20],
'range_end':[4,15,15,6,23],
'score':[3,1,8,2,5]}
input = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
The desired output can change based on the overlap threshold. In the example above id1 and id4 may both be retained or simply id1 depending on the overlap threshold:
data = {'id':['id1', 'id3', 'id5'],
'range_start':[1,11,20],
'range_end':[4,15,23],
'score':[3,8,5]}
output = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
from Retaining rows that have percent overlapping ranges in Pandas
No comments:
Post a Comment