Friday, 16 July 2021

optimize python code with basic libraries

I'm trying to do a non equi self join with basic python on a table that has 1.7 millions rows and 4 variables. the data look like this:

product     position_min     position_max      count_pos
A.16        167804              167870              20
A.18        167804              167838              15
A.15        167896              167768              18
A.20        238359              238361              33
A.35        167835              167837              8

here the code i used:

import csv
from collections import defaultdict
import sys
import os

list_csv=[]
l=[]
with open(r'product.csv', 'r') as file1:
    my_reader1 = csv.reader(file1, delimiter=';')
    for row in my_reader1:
        list_csv.append(row)
with open(r'product.csv', 'r') as file2:
    my_reader2 = csv.reader(file2, delimiter=';') 
    with open('product_p.csv', "w") as csvfile_write:
        ecriture = csv.writer(csvfile_write, delimiter=';',
                                quotechar='"', quoting=csv.QUOTE_ALL)
        for row in my_reader2:
            res = defaultdict(list)
            for k in range(len(list_csv)):
                comp= list_csv[k]
                try:
                    if int(row[1]) >= int(comp[1]) and int(row[2]) <= int(comp[2]) and row[0] != comp[0]:
                        res[row[0]].append([comp[0],comp[3]]) 
                except:
                    pass
            


            if bool(res):    
                for key, value in res.items():
                    sublists = defaultdict(list)
                    for sublist in value:
                        l=[]
                        sublists[sublist[0]].append(int(sublist[1]))
                    l.append(str(key) + ";"+ str(min(sublists.keys(), key=(lambda k: sublists[k]))))
                        ecriture.writerow(l)

I should get this in the "product_p.csv" file:

'A.18'; 'A.16'
'A.15'; 'A.18'
'A.35'; 'A.18' 

What the code does is to read the same file twice, the first time completely, and convert it into a list, and the 2nd time line by line and that is to find for each product (1st variable) all the products to which it belongs by the condition on position_min and position_max and after that choose only one by keeping the product that has the minimum of count_pos .

I tried it on a sample of the original data, it works, but with 1.7 millions rows, it runs for hours without giving any results. Is there a way to dos that withour or with less loops ? could anyone help on optimizing this with basic python libraries ?

Thank you in advance



from optimize python code with basic libraries

No comments:

Post a Comment