Sunday, 27 December 2020

Python string occurence count regex performance

I was asked to find the total number of substring (case insensitive with/without punctuations) occurrences in a given string, some examples:

count_occurrences("Text with", "This is an example text with more than +100 lines") # Should return 1
count_occurrences("'example text'", "This is an 'example text' with more than +100 lines") # Should return 1
count_occurrences("more than", "This is an example 'text' with (more than) +100 lines") # Should return 1
count_occurrences("clock", "its 3o'clock in the morning") # Should return 0

I chose regex over .count() as I needed an exact match, and ended up with

def count_occurrences(word, text):
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text, re.IGNORECASE))

and I've got every matching count but my code took 0.10secs while expected time is 0.025secs, Am I missing something ? is there any better (performance optimised) way to do this ?

EDIT : here's the file that I'm working on Python file



from Python string occurence count regex performance

No comments:

Post a Comment