Hemant Vishwakarma: BeautifulSoup find tags based on source position

Thursday, 14 January 2021

BeautifulSoup find tags based on source position

I want to find all of the tags in an html document whose text falls within a certain offset, say 100:200. This offset is an offset into the original HTML file, so if I do f.read()[100:200] this represents the piece of text I am looking for. BeautifulSoup gets me part of the way there, as for each tag I can get its start with tag.sourcepos. Using this, I could get the element closest to the start of the sequence. However I am not sure how to line up the original offset with the offset into the element's text. Perhaps if I could get the length of the tag itself, I could use that but I don't see any way of doing that.

Here a minimal attempt, where the second assert fails:

<html><head><title></title></head><body><span style="white-space: pre; font-family: 'Courier New', Courier;"><br><br><br><br><br><br><br>AAA<br><br><br><br><br></span></body></html>

from bs4 import BeautifulSoup

with open('test.htm') as f:
    doc_content = f.read()
    soup = BeautifulSoup(doc_content, 'html5lib')

spans = soup.find_all('span')

start = 137
end = 140
found = doc_content[start:end]
print(found)
assert found == 'AAA'

# Find first span before target start
best_span = max(s for s in spans if s.sourcepos < start)
offset = best_span.sourcepos + 1
breakpoint()
found = best_span.encode_contents()[start-offset:end-offset]
found = found.decode('utf8')
print(found)
assert(found == 'AAA')

The root cause seems to be that BS4 is 'helping' by inserting slashes into the <br> which were not originally there, corrupting the offset.

from BeautifulSoup find tags based on source position

Hemant Vishwakarma

Thursday, 14 January 2021

BeautifulSoup find tags based on source position

No comments:

Post a Comment