Friday, 24 February 2023

Trouble parsing interview transcript (Q&As) where questioner name is sometimes redacted

I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It:

  • downloads a pdf transcript from the web (Cassidy Hutchinson's 9/14/2022 interview transcript with the J6C)
  • reads/OCRs that pdf to text
  • attempts to split that text into the series of Q&A passages from the interview
  • runs a series of tests I wrote based on my manual read of the transcript

running the python code below generates the following output:

~/askliz  main !1 ?21  python stack_overflow_q_example.py                                                      ✔  docenv Py  22:41:00 
Test for passage0 passed.
Test for passage1 passed.
Test for passage7 passed.
Test for passage8 passed.
Traceback (most recent call last):
  File "/home/max/askliz/stack_overflow_q_example.py", line 91, in <module>
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
AssertionError: Failed on passage 10

Your mission, should you choose to accept it: get this passage10 test to pass without breaking one of the previous tests. I'm hoping there's a clever regex or other modification in extract_q_a_locations below that will do the trick, but I'm open to any solution that passes all these tests, as I chose these test passages deliberately.

A little background on this transcript text, in case it's not as fun reading to you as it is to me: Sometimes a passage starts with a "Q" or "A", and sometimes it starts with a name (e.g. "Ms. Cheney."). The test that's failing, for passage 10, is where a question is asked by a staff member whose name is then redacted. The only way I've managed to get that test to pass has inadvertently broken one of the other tests, because not all redactions indicate the start of a question. (Note: in the pdf/ocr library I'm using, pdfplumber, redacted text usually shows up as just a bunch of extra spaces).

Code below:

import nltk
import re
import requests
import pdfplumber


def extract_q_a_locations(examination_text:str)->list:

    # (when parsed by pdfplumber) every Q/A starts with a newline, then spaces, 
    # then a line number and more spaces 
    prefix_regex = '\n\s+\d+\s+'

    # sometimes what comes next is a 'Q' or 'A' and more spaces
    qa_regex = '[QA]\s+'

    # other times what comes next is the name of a congressperson or lawyer for the witness
    speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"

    # the combined regex I've been using is looking for the prefix then QA or Speaker regex
    pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
    delims = list(re.finditer(pattern, text))
    return delims

def get_q_a_passages(qa_delimiters, text):
    q_a_list = []
    for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
        # prefix is either 'Q', 'A', or the name of the speaker
        prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]

        # the text chunk is the actual dialogue text. everything from current delim to next one
        text_chunk = text[delim.span()[1]:next_delim.span()[0]]
        
        # now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
        text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk)  # remove line numbers
        text_chunk = " ".join(text_chunk.split())            # remove extra whitespace
        
        q_a_list.append(f"{prefix} {text_chunk}")

    return q_a_list

if __name__ == "__main__":

    # download pdf
    PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
    FILENAME = "interview_transcript_stackoverflow.pdf"

    response = requests.get(PDF_URL)
    with open(FILENAME, "wb") as f:
        f.write(response.content)

    # read pdf as text
    with pdfplumber.open(FILENAME) as pdf:
        text = "".join([p.extract_text(layout=True) for p in pdf.pages])

    # I care about the Q&A transcript, which starts after the "EXAMINATION" header
    startidx = text.find("EXAMINATION")
    text = text[startidx:]

    # extract Q&A passages
    passage_locations = extract_q_a_locations(text)
    passages = get_q_a_passages(passage_locations, text)

    # TESTS
    ACCEPTABLE_TEXT_DISCREPANCY = 2

    # The tests below all pass already.
    actual_passage0_start = "Q So I do first want to bring up exhibit"
    assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage0 passed.")

    actual_passage1 = "A This is correct."
    assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage1 passed.")

    # (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" & 
    # "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
    actual_passage7_start = "Cheney. And we also, just as" 
    assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage7 passed.")

    actual_passage8_start = "Jordan. They are pro bono"
    assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage8 passed.")

    # HERE'S MY PROBLEM. 
    # This test fails because my regex fails to capture the question which starts with the 
    # redacted name of the staff/questioner. The only way I've managed to get this test to 
    # pass has also broken at least one of the tests above. 
    actual_passage10_start = " So at this point, as we discussed earlier, I'm going to"
    e_msg = "Failed on passage 10"
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg


from Trouble parsing interview transcript (Q&As) where questioner name is sometimes redacted

No comments:

Post a Comment