Saturday 17 October 2020

Why does _.shutdown() take longer in implementation 1 as compared to implementation 2?

Implementation 1:

def pickle_data(input_list, lang_code):
    strings_list = filter_content(input_list)
    logging.info("Dumping Created Pickle (Step 2 of 2)")
    with bz2.BZ2File("{x}.pickle".format(x=lang_code), "w") as f:
        c_pickle.dump(strings_list, f)
    return

if __name__ == "__main__":
    pickle_data(input_files, lang_code)
    logging.debug("Pickle Dumped. Closing any stranded threads.")

Implementation 2:

def pickle_data(strings_list, lang_code):
    logging.info("Dumping Created Pickle (Step 2 of 2)")
    with bz2.BZ2File("{x}.pickle".format(x=lang_code), "w") as f:
        c_pickle.dump(strings_list, f)
    return

if __name__ == "__main__":
    pickle_data(filter_content(input_files), lang_code)
    logging.debug("Pickle Dumped. Closing any stranded threads.")

Other relevant pieces of code:

def parse_data(input_file):
    """
    Reads a given input_file and returns the bs4 processed soup on it
    """
    infile = open(input_file, "r", encoding="utf-8")
    data = infile.read()
    infile.close()
    return bs(data, 'xml')

def filter_content(in_files):
    """
    For each file in in_files, extract a relevant piece of text within the 'revision' tag.
    Return a list of all the gathered data across different files.
    """
    strings = []
    for input_file in tqdm.tqdm(in_files, desc="Reading Input Files (Step 1 of 2)", ncols=100):
        soup = parse_data(input_file)
        pages_data = soup.find_all('revision')[1:]
        for page in pages_data:
            text = [x for x in list(page.children) if x.name == "text"][0].text
            if text.strip().lstrip().rstrip() != "":
                strings.append(text)
    return strings

Using Profiler, I could see that while Implementation 1 takes about 10s to close the main thread after the last log, Implementation 2 takes only 1s. For reference, the return size of list from filter_content() function is about 2.5 G.



from Why does _.shutdown() take longer in implementation 1 as compared to implementation 2?

No comments:

Post a Comment