Monday, 25 February 2019

How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...

def my_custom_strainer(self, elem, attrs):
    for attr in attrs:
        print("attr:" + attr + "=" + attrs[attr])
    if elem == 'div' and 'class' in attr and attrs['class'] == "score":
        return True
    elif elem == "span" and elem.text == re.compile("my text"):
        return True

article_stat_page_strainer = SoupStrainer(self.article_stats_html_match)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)

One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the

elem == "span" and elem.text == re.compile("my text")

clause. However, this results in an

AttributeError: 'str' object has no attribute 'text'

error when I try and run the above. What's the proper way to write my strainer?



from How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

No comments:

Post a Comment