Wednesday, 7 December 2022

Using lxml to parse text and break it into a list of sentences using some tags to add structure

Consider the following text in custom xml:

<?xml version="1.0"?>
<body>
    <heading><b>This is a title</b></heading>
    <p>This is a first <b>paragraph</b>.</p>
    <p>This is a second <b>paragraph</b>. With a list: 
        <ul>
            <li>first item</li>
            <li>second item</li>
        </ul>
    And the end.
    </p>
    <p>This is a third paragraph.
        <ul>
            <li>This is a first long sentence.</li>
            <li>This is a second long sentence.</li>
        </ul>
    And the end of the paragraph.</p>
</body>

I would like to convert that in a list of plain strings with the following rules:

  • Discard some tags like <b></b>
  • Each heading and each paragraph are distinct elements in the list. Add a final period if missing at the end of the element.
  • When a list is preceded by a colon ":", just add a line break between elements and add dashes.
  • When a list is not preceded by a colon, act as if the paragraph was split into several paragraphs

The result would be:

[
    "This is a title.", # Note the period
    "This is a first paragraph.",
    "This is a second paragraph. With a list:\n- first item\n- second item\nAnd the end.",
    "This is a third paragraph.",
    "This is a first long sentence.",
    "This is a second long sentence.",
    "And the end of the paragraph."
]

I would like to do that by iterating on the result of the lmxl etree etree.fromstring(text). My first few trials are overly complicated and slow, and I'm sure there is a nice approach to this problem.

How to do it?



from Using lxml to parse text and break it into a list of sentences using some tags to add structure

No comments:

Post a Comment