Hemant Vishwakarma: Speeding up a comparison function for comparing sentences

I have a data frame that has a shape of (789174, 9). There is a column called resolution that contains a sentence that is less than 139 characters in length. I built a function to find sentences that have a similarity score of above 0.9 from the difflib library. I have a virtual computer with 96 cpus and 384 gb of ram. I have been running this function for longer than 2 hours now and it still has not processed when i = 1000. I am concerned that this will take too long to process and I am wondering if there is a way to speed this up.

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(len(input_list)):
            if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

Clearly since we are iterating twice through the column it is O(n^2). I am not sure if there is a way to make this faster. Any suggestions would be greatly appreciated.

EDIT:

I have attempted a speed up using the difflib and fuzzywuzzy. The function only goes through the column once but I do iterate through the dictionary keys.

def cluster_resolution(df):
    clusters = {}
    for string in df['resolution_modified'].unique():
        match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
        
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            clusters[string] = [ string ]
            for m in clusters.keys():
                match2 = fuzz.partial_ratio(string, m)
                if match2 >= 90:
                    clusters[m].append(string)
    return clusters
mappings = cluster_resolution(df_sample)

Is it possible to speed up the latter function?

Here is an example of some data in a dataframe

d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}

df = pd.DataFrame(data=d)

How I define similarity:

Similarity is really defined by the overall action taken such as replaced scanner and replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use. The longer strings overall action was replacing the scanner thus those two are very similar which is why I chose to use the partial_ratio function since those have a score of 100.

Attention:

Please refer to the second function cluster_resolution as this is the function I would like to sped up. The latter function is not going to useful.

from Speeding up a comparison function for comparing sentences

Hemant Vishwakarma

Saturday 6 March 2021

Speeding up a comparison function for comparing sentences

No comments:

Post a Comment