I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...
def my_custom_strainer(self, elem, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
if elem == 'div' and 'class' in attr and attrs['class'] == "score":
return True
elif elem == "span" and elem.text == re.compile("my text"):
return True
article_stat_page_strainer = SoupStrainer(self.article_stats_html_match)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the
elem == "span" and elem.text == re.compile("my text")
clause. However, this results in an
AttributeError: 'str' object has no attribute 'text'
error when I try and run the above. What's the proper way to write my strainer?
from How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?
No comments:
Post a Comment