Thursday, 14 January 2021

BeautifulSoup find tags based on source position

I want to find all of the tags in an html document whose text falls within a certain offset, say 100:200. This offset is an offset into the original HTML file, so if I do f.read()[100:200] this represents the piece of text I am looking for. BeautifulSoup gets me part of the way there, as for each tag I can get its start with tag.sourcepos. Using this, I could get the element closest to the start of the sequence. However I am not sure how to line up the original offset with the offset into the element's text. Perhaps if I could get the length of the tag itself, I could use that but I don't see any way of doing that.

Here a minimal attempt, where the second assert fails:

<html><head><title></title></head><body><span style="white-space: pre; font-family: 'Courier New', Courier;"><br><br><br><br><br><br><br>AAA<br><br><br><br><br></span></body></html>
from bs4 import BeautifulSoup

with open('test.htm') as f:
    doc_content = f.read()
    soup = BeautifulSoup(doc_content, 'html5lib')

spans = soup.find_all('span')

start = 137
end = 140
found = doc_content[start:end]
print(found)
assert found == 'AAA'

# Find first span before target start
best_span = max(s for s in spans if s.sourcepos < start)
offset = best_span.sourcepos + 1
breakpoint()
found = best_span.encode_contents()[start-offset:end-offset]
found = found.decode('utf8')
print(found)
assert(found == 'AAA')

The root cause seems to be that BS4 is 'helping' by inserting slashes into the <br> which were not originally there, corrupting the offset.



from BeautifulSoup find tags based on source position

No comments:

Post a Comment