Saturday, 17 December 2022

Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )

In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.

In my specific case I have a page of HTML and need to extract all the content between

<w-block-content><span><div>

and

</div></span></w-block-content>

where

  • the elements might have attributes
  • the HTML might be formatted or not - there might be extra white space and newlines
  • there may be other content between any of the above tags, including inner div elements within the above outer div. But you can assume each <w-block-content> element
    • contains ONLY ONE direct child <span> child (i.e. it may contain other non-span children)
      • which contains ONLY ONE direct <div> child
        • which wraps the content that must be extracted
  • 🚩 the match must extend all the way to the last </div> within the <span> within the <w-block-content>, even if it is unmatched with an opening <div>.
  • the solution must be pure ECMAScript-spec Regex. No Javascript code can be used

Thus the problem stated in the question at the top.

The following regex successfully matches as long as there are NO internal </div> tags:

(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)

❌ But if there are additional </div> tags, the match ends prematurely, not including the entirety of the block.

I use [\s\S]*? to match against arbitrary content, including extra whitespace and newlines.

Here is sample test data:

</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><div><b>Master č. 2</b>                  </div><br>

                  </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>
</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><b>Master č. 2</b><br>
                  
                   </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>

which I've been testing here: (https://regex101.com/r/jekZhr/3

The first extracted chunk should be:


Další master<br><div><b>Master č. 2</b>                  </div><br>

                  

I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.



from Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )

No comments:

Post a Comment