I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It:
- downloads a pdf transcript from the web (Cassidy Hutchinson's 9/14/2022 interview transcript with the J6C)
- reads/OCRs that pdf to text
- attempts to split that text into the series of Q&A passages from the interview
- runs a series of tests I wrote based on my manual read of the transcript
running the python code below generates the following output:
~/askliz main !1 ?21 python stack_overflow_q_example.py ✔ docenv Py 22:41:00
Test for passage0 passed.
Test for passage1 passed.
Test for passage7 passed.
Test for passage8 passed.
Traceback (most recent call last):
File "/home/max/askliz/stack_overflow_q_example.py", line 91, in <module>
assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
AssertionError: Failed on passage 10
Your mission, should you choose to accept it: get this passage10 test to pass without breaking one of the previous tests. I'm hoping there's a clever regex or other modification in extract_q_a_locations
below that will do the trick, but I'm open to any solution that passes all these tests, as I chose these test passages deliberately.
A little background on this transcript text, in case it's not as fun reading to you as it is to me: Sometimes a passage starts with a "Q" or "A", and sometimes it starts with a name (e.g. "Ms. Cheney."). The test that's failing, for passage 10, is where a question is asked by a staff member whose name is then redacted. The only way I've managed to get that test to pass has inadvertently broken one of the other tests, because not all redactions indicate the start of a question. (Note: in the pdf/ocr library I'm using, pdfplumber, redacted text usually shows up as just a bunch of extra spaces).
Code below:
import nltk
import re
import requests
import pdfplumber
def extract_q_a_locations(examination_text:str)->list:
# (when parsed by pdfplumber) every Q/A starts with a newline, then spaces,
# then a line number and more spaces
prefix_regex = '\n\s+\d+\s+'
# sometimes what comes next is a 'Q' or 'A' and more spaces
qa_regex = '[QA]\s+'
# other times what comes next is the name of a congressperson or lawyer for the witness
speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"
# the combined regex I've been using is looking for the prefix then QA or Speaker regex
pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
delims = list(re.finditer(pattern, text))
return delims
def get_q_a_passages(qa_delimiters, text):
q_a_list = []
for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
# prefix is either 'Q', 'A', or the name of the speaker
prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]
# the text chunk is the actual dialogue text. everything from current delim to next one
text_chunk = text[delim.span()[1]:next_delim.span()[0]]
# now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk) # remove line numbers
text_chunk = " ".join(text_chunk.split()) # remove extra whitespace
q_a_list.append(f"{prefix} {text_chunk}")
return q_a_list
if __name__ == "__main__":
# download pdf
PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
FILENAME = "interview_transcript_stackoverflow.pdf"
response = requests.get(PDF_URL)
with open(FILENAME, "wb") as f:
f.write(response.content)
# read pdf as text
with pdfplumber.open(FILENAME) as pdf:
text = "".join([p.extract_text(layout=True) for p in pdf.pages])
# I care about the Q&A transcript, which starts after the "EXAMINATION" header
startidx = text.find("EXAMINATION")
text = text[startidx:]
# extract Q&A passages
passage_locations = extract_q_a_locations(text)
passages = get_q_a_passages(passage_locations, text)
# TESTS
ACCEPTABLE_TEXT_DISCREPANCY = 2
# The tests below all pass already.
actual_passage0_start = "Q So I do first want to bring up exhibit"
assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage0 passed.")
actual_passage1 = "A This is correct."
assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage1 passed.")
# (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" &
# "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
actual_passage7_start = "Cheney. And we also, just as"
assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage7 passed.")
actual_passage8_start = "Jordan. They are pro bono"
assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage8 passed.")
# HERE'S MY PROBLEM.
# This test fails because my regex fails to capture the question which starts with the
# redacted name of the staff/questioner. The only way I've managed to get this test to
# pass has also broken at least one of the tests above.
actual_passage10_start = " So at this point, as we discussed earlier, I'm going to"
e_msg = "Failed on passage 10"
assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
from Trouble parsing interview transcript (Q&As) where questioner name is sometimes redacted
No comments:
Post a Comment