Hemant Vishwakarma: Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )

Saturday, 17 December 2022

Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )

In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.

In my specific case I have a page of HTML and need to extract all the content between

<w-block-content><span><div>

and

</div></span></w-block-content>

where

the elements might have attributes
the HTML might be formatted or not - there might be extra white space and newlines
there may be other content between any of the above tags, including inner div elements within the above outer div. But you can assume each <w-block-content> element
- contains ONLY ONE direct child <span> child (i.e. it may contain other non-span children)
  - which contains ONLY ONE direct <div> child
    - which wraps the content that must be extracted
🚩 the match must extend all the way to the last </div> within the <span> within the <w-block-content>, even if it is unmatched with an opening <div>.
the solution must be pure ECMAScript-spec Regex. No Javascript code can be used

Thus the problem stated in the question at the top.

The following regex successfully matches as long as there are NO internal </div> tags:

(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)

❌ But if there are additional </div> tags, the match ends prematurely, not including the entirety of the block.

I use [\s\S]*? to match against arbitrary content, including extra whitespace and newlines.

Here is sample test data:

</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><div><b>Master č. 2</b>                  </div><br>

                  </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>
</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><b>Master č. 2</b><br>
                  
                   </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>

which I've been testing here: (https://regex101.com/r/jekZhr/3

The first extracted chunk should be:


Další master<br><div><b>Master č. 2</b>                  </div><br>

I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.

from Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )

Hemant Vishwakarma

Saturday, 17 December 2022

Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )

No comments:

Post a Comment