Implementation 1:
def pickle_data(input_list, lang_code):
strings_list = filter_content(input_list)
logging.info("Dumping Created Pickle (Step 2 of 2)")
with bz2.BZ2File("{x}.pickle".format(x=lang_code), "w") as f:
c_pickle.dump(strings_list, f)
return
if __name__ == "__main__":
pickle_data(input_files, lang_code)
logging.debug("Pickle Dumped. Closing any stranded threads.")
Implementation 2:
def pickle_data(strings_list, lang_code):
logging.info("Dumping Created Pickle (Step 2 of 2)")
with bz2.BZ2File("{x}.pickle".format(x=lang_code), "w") as f:
c_pickle.dump(strings_list, f)
return
if __name__ == "__main__":
pickle_data(filter_content(input_files), lang_code)
logging.debug("Pickle Dumped. Closing any stranded threads.")
Other relevant pieces of code:
def parse_data(input_file):
"""
Reads a given input_file and returns the bs4 processed soup on it
"""
infile = open(input_file, "r", encoding="utf-8")
data = infile.read()
infile.close()
return bs(data, 'xml')
def filter_content(in_files):
"""
For each file in in_files, extract a relevant piece of text within the 'revision' tag.
Return a list of all the gathered data across different files.
"""
strings = []
for input_file in tqdm.tqdm(in_files, desc="Reading Input Files (Step 1 of 2)", ncols=100):
soup = parse_data(input_file)
pages_data = soup.find_all('revision')[1:]
for page in pages_data:
text = [x for x in list(page.children) if x.name == "text"][0].text
if text.strip().lstrip().rstrip() != "":
strings.append(text)
return strings
Using Profiler, I could see that while Implementation 1 takes about 10s to close the main thread after the last log, Implementation 2 takes only 1s. For reference, the return size of list from filter_content()
function is about 2.5 G.
from Why does _.shutdown() take longer in implementation 1 as compared to implementation 2?
No comments:
Post a Comment