Friday, 26 February 2021

parsing and avoiding nested loops in Python

i have a sql table with 12000 entries stored in a dataframe df1 which looks like this:

id name
00001 angiocarcoma
00261 shrimp allergy

and i have another table with 20000 entries which is stored in dataframe df:

Entry_name CA
TRGV2 3BHS1 HSD3B1 3BH HSDB3
TRGJ1 3BP1 SH3BP1 IF

The aim is to match for each possible combination of name from df1 with that of CA(splitted with " " space) from df in a sentence with a condition that length of CA cell value should be greater than 2. The simplest logic would be to search for all the name values from df1 in the sentence and if a match is found then search for CA values in the same sentence. But doing that i am limiting resource usage.

Following is the code which i have tried and i can only think of nested loops to accomplish the task. If i use two functions then i am creating a functon calling overhead and if i try to do it recursive then if am exceeding the recusrive function call in Python which is forcing the kernel to shut off. The following function is called by passing a sentence (i have to parse 500k sentences) to it:

 def disease_search(nltk_tokens_sen):
  for dis_index in range(len(df1)): 
        disease_name=df1.at[dis_index,'name']
        regex_for_dis = rf"\b{disease_name}\b"
        matches_for_dis= re.findall(regex_for_dis, nltk_tokens_sen, re.IGNORECASE | re.MULTILINE)
        if len(matches_for_dis)!=0:
            disease_marker(nltk_tokens_sen, disease_name)
        

and this function is called if the above function founds a match:

    def disease_marker(nltk_tokens_sen, disease_name):
     for zz in range(len(df)):
      biomarker_txt=((df.at[zz,'CA'])) 
      biomarker = biomarker_txt.split(" ")
      for tt in range(len(biomarker)):
        if len(biomarker[tt])>2:
            matches_for_marker = re.findall(rf"\b{re.escape(biomarker[tt])}\b", nltk_tokens_sen)
            if len(matches_for_marker)!=0:
                print("Match_found:", disease_name, biomarker[tt] )

Do i need need to change my logic completely or is there a Pythonic runtime efficent way to achieve it?



from parsing and avoiding nested loops in Python

No comments:

Post a Comment